WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
Title: WorldMemArena: Assessing Multimodal Agent Memory via Action-World Interaction
Abstract:
As multimodal large language models are increasingly utilized as long-horizon agents, the role of memory expands beyond simple recall. It must now monitor a dynamic world, discard outdated information, and retrieve relevant evidence when decisions are required. Current benchmarks fall short in evaluating these capabilities: they typically test recall on static dialogues, summarize memory performance into a single end-of-task accuracy score, and simplify visual data into text captions. Consequently, it is difficult to pinpoint whether failures stem from writing, maintenance, retrieval, or application phases. This limitation is exacerbated by the emergence of agent harnesses that autonomously manage their own memory, as there is currently no standardized method to compare these self-managing systems against hand-crafted pipelines.
To address these shortcomings, we introduce WorldMemArena, which conceptualizes multimodal agent memory as an Action-World Interaction Loop featuring an observable four-stage lifecycle. The benchmark comprises 400 multi-session multimodal tasks designed to evaluate two key dimensions: Lifelong Evolution (tracking changing personal and task states) and Agentic Execution (incorporating memory derived from real-world observations, actions, and feedback). Each task is annotated with gold-standard memory points, updates, distractors, and evidence chains, enabling precise, stage-level diagnosis.
WorldMemArena facilitates the first direct comparison between long-context models, manually engineered systems (such as RAG and external memory architectures), and harness-based memory agents. Our findings reveal four critical insights: (1) superior memory writing and storage capabilities do not necessarily correlate with improved performance; (2) multimodal memory systems continue to face challenges in fully leveraging visual evidence; (3) system performance is inconsistent across different domains and tends to degrade when handling realistic agentic trajectories; and (4) while harness-based memory offers greater flexibility, it remains expensive to operate and less reliable than other approaches.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





