arXiv

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

June 4, 2026 · Qiyang Xie, Jialun Wu, Xinjie He, Su Liu, Shuai Xiao, Zhiyuan Lin, Weikai Zhou · Original Source

Title: MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

Abstract

As AI systems evolve, they face the dual challenge of managing multi-session conversation histories and executing deep reading comprehension on extensive texts. Currently, no existing benchmark assesses these capabilities in tandem. To address this gap, we present MemoryDocDataSet, a synthetic evaluation suite comprising 50 micro-worlds and 1,000 question-answer pairs. Each instance within the dataset integrates 3-5 distinct personas, a temporal event graph covering months of activity, and 3-5 substantial real-world documents (ranging from 20,000 to 50,000 tokens, drawn from the Caselaw Access Project). These elements are tied together through multi-session conversations and 20 QA pairs distributed across five reasoning categories.

The dataset’s hallmark is the "Hybrid" source tag, which constitutes 75.1% of all queries. These questions demand that the system first traverse the conversation history to pinpoint the relevant document before extracting the answer from within it. We validated the dataset's quality using a prompt-sensitivity self-consistency analysis with an LLM-as-judge, achieving a median Cohen's $\kappa$ of 0.634 across all 50 micro-worlds.

We tested six baseline configurations, including approaches utilizing truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The top-performing baseline, RAG-Both, reached an overall F1 score of 0.358 and a score of 0.342 on Hybrid questions. Notably, Document-only retrieval (RAG-Doc) performed well on Doc-only questions (0.453) but collapsed to 0.267 on Hybrid questions. This disparity highlights a significant joint-retrieval gap, underscoring the need for architectures that unify conversational memory with long-document navigation. We have publicly released the dataset, the generation pipeline, and all baseline implementations.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC