M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
Title: M$^3$Eval: Evaluating Multi-Modal Memory via Cognitively-Inspired Video Tasks
As multi-modal models increasingly tackle the complexities of long-form video comprehension, memory has become a pivotal capability. Although significant progress has been made in creating video datasets and benchmarks, current research largely prioritizes perception and reasoning, neglecting a systematic assessment of memory. Specifically, there is a lack of inquiry into what information models retain, the fidelity of that preservation, and the robustness of memory when subjected to interference.
To bridge this gap, we present M$^3$Eval, the inaugural comprehensive framework and benchmark designed to probe various dimensions of memory within multi-modal models. Drawing upon principles from cognitive psychology, our approach utilizes meticulously crafted tasks that isolate specific memory components. By applying M$^3$Eval, we performed extensive experiments on a range of representative multi-modal models, uncovering consistent vulnerabilities and unique operational behaviors.
Our analysis reveals several key findings: models face difficulties in maintaining separate representations while processing concurrent video streams; they exhibit interference patterns that diverge significantly from human memory dynamics; they anchor memory sources more accurately in spatial contexts than in temporal ones; and they show constrained capabilities in symbolic memory. Together, this benchmark serves as a crucial asset for subsequent research. Our results underscore memory as a foundational but under-researched ability, providing essential insights for the development of superior memory mechanisms in multi-modal systems. The code and dataset are accessible at https://pku-value-lab.github.io/m3eval-homepage.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






