Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning
Title: Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning
Abstract:
Discrete visual tokens offer a compact representation strategy for both token-based world modeling and planning in autonomous driving. However, the majority of existing tokenizers are adapted from image generation tasks and primarily optimized for pixel reconstruction. This focus creates a disconnect between ease of generation and utility for decoding driving decisions. To address this, we introduce a tokenizer that is guided by representation learning and enhanced by geometry, learning discrete tokens under joint supervision. Specifically, the method aligns its discrete bottleneck with a frozen DINO feature space via feature decoding, while maintaining visual fidelity through RGB reconstruction utilizing perceptual and adversarial losses. To incorporate geometric state-related cues, we integrate adjacent-frame depth and relative-pose supervision during the training phase. Furthermore, multi-codebook quantization is employed to stabilize these joint objectives. We assess the efficacy of these learned tokens using a lightweight planning readout and a GPT-style next-token world model. Our experiments on the NAVSIM dataset demonstrate improvements in reconstruction fidelity and representation consistency, competitive planning performance with a fixed decoder, and superior generative quality under matched settings.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





