arXiv

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

June 2, 2026 · Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari · Original Source

Title: Bridging the Gap: Advancing Temporal Comprehension in Autonomous Driving Through Vision-Language Models

Vision-Language Models (VLMs) are rapidly emerging as the core perception and reasoning engines for autonomous agents navigating real-world environments, with autonomous driving (AD) standing out as a domain where safety is paramount. For these systems to operate securely in dynamic settings, they must possess robust temporal understanding to predict occurrences, identify underlying causes, and execute safe maneuvers. However, achieving this level of temporal comprehension remains a formidable hurdle, even for the most advanced state-of-the-art (SoTA) VLMs. While contemporary video benchmarks predominantly cover diverse activities such as sports and cooking, there is currently no resource dedicated exclusively to evaluating temporal reasoning in both short and long-duration AD video clips.

To address this critical deficiency, we introduce the Temporal Understanding in Autonomous Driving (TAD) benchmark. This resource features nearly 6,000 question-answer (QA) pairs distributed across seven distinct tasks. We utilized TAD to assess nine models, including both generalist and AD-specialist variants, covering both closed- and open-source architectures. Our evaluation reveals that current SoTA models lag significantly behind human performance on the TAD benchmark.

To bridge this gap and enhance the temporal reasoning capabilities of VLM-driven agents, we introduce two novel, training-free methodologies: Scene-CoT and TCogMap. Scene-CoT leverages Chain-of-Thought (CoT) reasoning to structure logical inference. TCogMap, on the other hand, integrates an ego-centric temporal cognitive map generated by a trajectory-analysis module, which functions as an agentic tool surrounding the VLM. When applied to existing VLMs, these approaches boost average accuracy on TAD by as much as 17.72% and improve performance on STSBench by up to 10.35%.

By establishing the TAD benchmark, rigorously evaluating leading models, and demonstrating effective enhancement strategies, this study seeks to stimulate further advancements in temporal understanding for agentic AD systems deployed in unstructured environments. The benchmark dataset and evaluation code are publicly accessible via Hugging Face (https://huggingface.co/datasets/vbdai/TAD) and GitHub (https://github.com/vbdi/tad_bench), respectively.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC