DVGT: Driving Visual Geometry Transformer
Title: DVGT: Driving Visual Geometry Transformer
Abstract: Accurately perceiving and reconstructing 3D scene geometry from visual data is a fundamental requirement for autonomous driving systems. Despite its importance, there remains a scarcity of dense geometry perception models specifically designed for driving environments that can effectively adapt to varying scenarios and diverse camera setups. To address this limitation, we introduce the Driving Visual Geometry Transformer (DVGT), a novel approach that reconstructs a global, dense 3D point map from a sequence of multi-view visual inputs without requiring known poses.
Our method begins by extracting visual features from each image using a DINO backbone. It then leverages a mechanism of alternating attention layers—specifically intra-view local attention, cross-view spatial attention, and cross-frame temporal attention—to deduce geometric relationships across the image sequence. Subsequently, multiple decoding heads are employed to generate a global point map within the ego coordinate system of the initial frame, while also estimating the ego poses for every subsequent frame.
Distinct from traditional approaches that depend on precise camera parameters, DVGT operates without explicit 3D geometric priors. This characteristic allows for the flexible processing of arbitrary camera configurations. Furthermore, DVGT directly predicts metric-scaled geometry from image sequences, thereby removing the necessity for post-alignment with external sensors.
Evaluated on a comprehensive mixture of driving datasets, including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT demonstrates significant performance improvements over existing models across various scenarios. The source code has been made publicly available at https://github.com/wzzheng/DVGT.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



