Global News Digest

arXiv

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Title: Bridging the Gap: Advancing Temporal Comprehension in Autonomous Driving Through Vision-Language Models

Vision-Language Models (VLMs) are rapidly emerging as the core perception and reasoning engines for autonomous agents navigating real-world environments, with autonomous driving (AD) standing out as a domain where safety is paramount. For these systems to operate securely in dynamic settings, they must possess robust temporal understanding to predict occurrences, identify underlying causes, and execute safe maneuvers. However, achieving this level of temporal comprehension remains a formidable hurdle, even for the most advanced state-of-the-art (SoTA) VLMs. While contemporary video benchmarks predominantly cover diverse activities such as sports and cooking, there is currently no resource dedicated exclusively to evaluating temporal reasoning in both short and long-duration AD video clips.

To address this critical deficiency, we introduce the Temporal Understanding in Autonomous Driving (TAD) benchmark. This resource features nearly 6,000 question-answer (QA) pairs distributed across seven distinct tasks. We utilized TAD to assess nine models, including both generalist and AD-specialist variants, covering both closed- and open-source architectures. Our evaluation reveals that current SoTA models lag significantly behind human performance on the TAD benchmark.

To bridge this gap and enhance the temporal reasoning capabilities of VLM-driven agents, we introduce two novel, training-free methodologies: Scene-CoT and TCogMap. Scene-CoT leverages Chain-of-Thought (CoT) reasoning to structure logical inference. TCogMap, on the other hand, integrates an ego-centric temporal cognitive map generated by a trajectory-analysis module, which functions as an agentic tool surrounding the VLM. When applied to existing VLMs, these approaches boost average accuracy on TAD by as much as 17.72% and improve performance on STSBench by up to 10.35%.

By establishing the TAD benchmark, rigorously evaluating leading models, and demonstrating effective enhancement strategies, this study seeks to stimulate further advancements in temporal understanding for agentic AD systems deployed in unstructured environments. The benchmark dataset and evaluation code are publicly accessible via Hugging Face (https://huggingface.co/datasets/vbdai/TAD) and GitHub (https://github.com/vbdi/tad_bench), respectively.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.