arXiv

Question-Aware Evidence Ledgers for Video Relational Reasoning

June 2, 2026 · Yilin Ou, Mengshi Qi, Huadong Ma · Original Source

Title: Leveraging Question-Aware Evidence Ledgers for Video Relational Reasoning

Abstract:

The VRR-QA benchmark is designed to assess visual relational reasoning capabilities within video data. In this domain, determining the correct answer frequently hinges on nuanced factors such as implicit spatial relationships, event demarcations, the identity of specific targets, and conversational context, rather than relying on any single prominent frame. To address this, we introduce a test-time reasoning framework centered on a robust GPT-5.5 video question-answering solver, augmented by a collection of question-aware evidence ledgers.

Initially, the solver generates answers based on a standardized video representation. Subsequently, specific ledgers are activated to clarify essential elements required for various reasoning tasks—including counting, spatial analysis, endpoint detection, viewpoint assessment, and dialogue comprehension. These ledgers explicitly define targets, count units, reference frames, and temporal or spatial scopes.

We employ external tools such as open-vocabulary detection systems, depth information, pair crops, automatic speech recognition (ASR), and scene-graph ledgers exclusively as sources of evidence. A conservative gating mechanism is utilized to retain the solver’s original answer unless independent evidence distinctly validates an alternative option. This final, evidence-gated pipeline demonstrates strong performance, achieving an overall accuracy of 92.95% and a macro accuracy of 93.79% on the challenge’s test split.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC