$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer
Title: $\text{VG}^2$GT: A Visual Geometry-Grounded Transformer Utilizing Voxel-Gaussian Splatting
Abstract: Gaussian splatting has demonstrated significant promise for novel view synthesis and 3D reconstruction. Yet, conventional approaches typically depend on precise camera intrinsics and per-scene optimization, whereas feed-forward alternatives employing pixel-aligned Gaussian primitives frequently encounter issues with non-uniform primitives and visual artifacts. To address these challenges, we introduce $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. This framework utilizes a frozen, pretrained visual foundation model (VFM) and integrates a multi-scale differentiable voxel module to bolster geometric comprehension. It directly regresses and splits Gaussian primitive parameters from voxel features. By supervising depth maps via stochastic solid volume rendering, the method achieves geometrically precise Gaussian scene reconstruction without updating the visual foundation model. This architecture allows $\text{VG}^2$GT to be easily integrated into any VFM that relies on patch features, significantly lowering training expenses. Experimental results indicate that $\text{VG}^2$GT surpasses current state-of-the-art techniques across standard benchmarks, including ScanNet, TAT, Replica, and DTU.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





