arXiv

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

June 2, 2026 · Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang · Original Source

Title: Moment-Video: Assessing the Temporal Fidelity of Video MLLMs in Capturing Ephemeral Visual Events

Abstract:

While video multimodal large language models (MLLMs) have demonstrated significant advancements in handling general and extended video content, their capacity to retain brief, answer-critical visual evidence remains largely unexamined. Numerous practical inquiries hinge on momentary visual occurrences—such as localized actions or state changes that may persist for only a few frames. This type of evidence is frequently overlooked due to sparse frame sampling, obscured by visual-token compression, or weakened by coarse temporal aggregation. Consequently, these models often fail to resolve such issues, as language-side reasoning alone cannot reliably compensate for the loss of visual data.

To address this, we present Moment-Video, a benchmark designed to evaluate the temporal fidelity of video MLLMs through the lens of momentary visual event comprehension. Each question within the benchmark is anchored in an event that is localized, visually observable, and sensitive to sampling rates. These tasks compel models to detect, quantify, describe, or reason about transient evidence, rather than depending on persistent objects, broad scene context, or linguistic priors. The dataset comprises 1,000 human-verified video-QA pairs distributed across 7 domains and 25 fine-grained subcategories, spanning four distinct task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning.

We conducted evaluations of 33 proprietary and open-source MLLMs using Moment-Video. The results highlight a significant performance deficit: the top-ranking model, Seed-2.0-Pro, attained an overall accuracy of just 39.6%, with most open-source models scoring below 25%. This disparity underscores a substantial gap in understanding momentary visual events. Diagnostic analyses reveal that while denser frame sampling aids certain models, it fails to resolve the underlying bottleneck. Furthermore, longer videos exacerbate challenges related to temporal localization. These insights indicate that contemporary video MLLMs still lack temporally faithful representations necessary to effectively capture, preserve, and utilize brief yet decisive visual evidence.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC