arXiv

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Title: WorldMemArena: Assessing Multimodal Agent Memory via Action-World Interaction

Abstract:

As multimodal large language models are increasingly utilized as long-horizon agents, the role of memory expands beyond simple recall. It must now monitor a dynamic world, discard outdated information, and retrieve relevant evidence when decisions are required. Current benchmarks fall short in evaluating these capabilities: they typically test recall on static dialogues, summarize memory performance into a single end-of-task accuracy score, and simplify visual data into text captions. Consequently, it is difficult to pinpoint whether failures stem from writing, maintenance, retrieval, or application phases. This limitation is exacerbated by the emergence of agent harnesses that autonomously manage their own memory, as there is currently no standardized method to compare these self-managing systems against hand-crafted pipelines.

To address these shortcomings, we introduce WorldMemArena, which conceptualizes multimodal agent memory as an Action-World Interaction Loop featuring an observable four-stage lifecycle. The benchmark comprises 400 multi-session multimodal tasks designed to evaluate two key dimensions: Lifelong Evolution (tracking changing personal and task states) and Agentic Execution (incorporating memory derived from real-world observations, actions, and feedback). Each task is annotated with gold-standard memory points, updates, distractors, and evidence chains, enabling precise, stage-level diagnosis.

WorldMemArena facilitates the first direct comparison between long-context models, manually engineered systems (such as RAG and external memory architectures), and harness-based memory agents. Our findings reveal four critical insights: (1) superior memory writing and storage capabilities do not necessarily correlate with improved performance; (2) multimodal memory systems continue to face challenges in fully leveraging visual evidence; (3) system performance is inconsistent across different domains and tends to degrade when handling realistic agentic trajectories; and (4) while harness-based memory offers greater flexibility, it remains expensive to operate and less reliable than other approaches.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...