Beyond Static Dialogues: Benchmarking Realistic, Heterogeneous, and Evolving Long-Term Memory
Title: Moving Past Static Conversations: Evaluating Realistic, Diverse, and Dynamic Long-Term Memory Systems
Abstract
Current memory benchmarks for Large Language Models (LLMs) frequently suffer from a lack of long-term semantic consistency within evaluated dialogue sessions, while the personas employed tend to be rigid and one-dimensional. Moreover, real-world interactions between users and assistants encompass a wider array of heterogeneous data streams, including emails and documents, which are largely absent in existing evaluations. These gaps significantly undermine the realism and efficacy of present-day assessment methods.
To overcome these challenges, we present RHELM (Realistic, Heterogeneous, and Evolving Long-term Memory). By leveraging carefully constructed user profiles and a novel LOOP (pLan-rOllout-evOlve-Prune) module, we generate realistic dialogues across varied interaction scenarios that feature dynamic temporal evolution and sustained long-term coherence. A key feature of this approach is the deep integration of these dialogues with heterogeneous external sources, which are synchronized with the user’s temporal event trajectory.
The resulting benchmark includes challenging question-answer pairs covering seven distinct inquiry types. Each question is mapped to at least one of 27 critical memory characteristics identified as essential but previously underexplored in current research. Extensive experiments involving full-context models, retrieval-augmented generation (RAG) techniques, and representative memory frameworks demonstrate that contemporary approaches still exhibit significant weaknesses in complex, real-world contexts, particularly regarding multi-source aggregation and contextual reasoning in practical scenarios.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





