Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge
Title: Adaptive Dense Evidence Refinement for Video Relational Reasoning in the VRR-QA Challenge
Abstract: The VRR-QA benchmark assesses the capacity of video-language models to deduce spatial, temporal, viewpoint, depth, and visibility relationships that cannot be determined from a single frame alone. We introduce an inference-centric approach leveraging adaptive test-time computation. Initially, the system generates an answer via a standard video-language model pass. It then employs multiple lightweight views to identify questions with unstable results. Only these challenging instances are forwarded to a resource-intensive dense evidence module, which performs timestamped frame analysis, relation-specific probing, candidate verification, and conservative temporal aggregation. This architecture distinctly addresses two commonly conflated issues in video question answering: the generation of plausible alternative answers versus the determination of whether a current answer requires modification. On the test split, the final configuration achieved an average accuracy of 90.07 and a macro average accuracy of 87.81. This report details the final test system and the specific implementation parameters necessary to replicate the adaptive dense verifier.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





