Benchmarking Visual State Tracking in Multimodal Video Understanding
Title: Evaluating Visual State Tracking Capabilities in Multimodal Video Comprehension
Abstract:
Grasping video content demands more than merely identifying isolated moments; it necessitates the continuous monitoring of entities, states, and events as they unfold. While this ability for visual state tracking is central to video comprehension, it has been largely overlooked in current assessments of Multimodal Large Language Models (MLLMs). To address this gap, we present the Visual STAte Tracking benchmark (VSTAT), a specialized video-based tool designed to evaluate an MLLM’s capacity for visual state tracking. VSTAT comprises 1,500 questions linked to 834 clips sourced from both synthetic and real-world video datasets. These questions are specifically crafted to be unanswerable by examining any single frame or brief segment, thereby requiring the model to maintain continuous perception and synthesize information across the entire video duration.
Our findings reveal that despite their impressive results on existing video benchmarks, state-of-the-art MLLMs significantly underperform compared to humans, showing only marginal improvement over answer-prior baselines. To investigate the reasons behind this discrepancy, we analyze the models' internal reasoning processes against the actual video stream to pinpoint where and why failures occur in VSTAT. We discover that while MLLMs are capable of correct textual reasoning and tracking, they struggle with the visual perception required to detect the specific events they are tasked with tracking. Furthermore, our initial assessment indicates that recent agentic strategies—such as MLLM-driven video agents and coding agents—do not effectively mitigate these issues, as they continue to fall short on the VSTAT benchmark.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC




