Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA
Title: TimeLogicQA: Leveraging Structured Visual Evidence and Temporal Routing
Abstract:
The TimeLogicQA benchmark is designed to assess the capacity of video question-answering systems to navigate complex temporal relationships, including event existence, sequence, persistence, boundary constraints, and temporal overlap. To tackle this challenge, we introduce a methodology that decouples visual perception from symbolic temporal reasoning through a structured evidence routing pipeline.
Initially, the system deconstructs each inquiry to identify event targets, the required answer mode, potential options, and applicable temporal operators. Subsequently, videos are routed based on their duration and the complexity of the temporal operators involved: short clips are analyzed using ordered full-frame evidence, whereas longer videos utilize event-centric candidate windows. A multimodal large language model generates structured visual evidence pertinent to these events. This data is processed by programmatic verifiers that extract dense action intervals, followed by a deterministic reducer that applies specific temporal rules to derive the final response.
To ensure robustness, a conservative fusion mechanism is employed, accepting an answer only when the visual evidence, temporal program, and confidence metrics align. This approach significantly mitigates erroneous answer flips caused by noise. In evaluations on the official test set, the proposed system attained an Average Accuracy (AvgAcc) of 81.8.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





