MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
Title: MBench: A Holistic Benchmark for Assessing Memory in Video World Models
Abstract:
While recent breakthroughs in video-based world models have showcased an exceptional capacity to generate high-fidelity visual sequences, a significant disconnect remains between producing visually plausible content and meeting the functional demands of a true world model. Specifically, maintaining a stable and logical internal state over extended periods remains a challenge. Current evaluation frameworks predominantly focus on visual aesthetics, motion smoothness, and alignment between text and video, often neglecting memory—the essential function that allows a world model to uphold consistency across long timeframes and intricate interactions.
To bridge this oversight, we introduce MBench, a specialized benchmark designed to measure and assess the memory capabilities of video world models. We break down memory into three hierarchical, complementary dimensions: entity consistency, environment consistency, and causal consistency. These core areas are further subdivided into 12 quantifiable metrics to provide a thorough characterization of long-term memory retention. The benchmark relies on carefully curated, real-world long-form video data and utilizes both rule-based quantitative matrices and Vision-Language Models (VLMs) to ensure objective and comprehensive consistency evaluation. Our extensive testing of leading state-of-the-art video world models exposes profound systemic weaknesses in long-term state retention, offering the community a standardized evaluation tool and a clear pathway for future research advancement.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





