arXiv

MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

Title: MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning

Abstract

As AI systems evolve, they face the dual challenge of managing multi-session conversation histories and executing deep reading comprehension on extensive texts. Currently, no existing benchmark assesses these capabilities in tandem. To address this gap, we present MemoryDocDataSet, a synthetic evaluation suite comprising 50 micro-worlds and 1,000 question-answer pairs. Each instance within the dataset integrates 3-5 distinct personas, a temporal event graph covering months of activity, and 3-5 substantial real-world documents (ranging from 20,000 to 50,000 tokens, drawn from the Caselaw Access Project). These elements are tied together through multi-session conversations and 20 QA pairs distributed across five reasoning categories.

The dataset’s hallmark is the "Hybrid" source tag, which constitutes 75.1% of all queries. These questions demand that the system first traverse the conversation history to pinpoint the relevant document before extracting the answer from within it. We validated the dataset's quality using a prompt-sensitivity self-consistency analysis with an LLM-as-judge, achieving a median Cohen's $\kappa$ of 0.634 across all 50 micro-worlds.

We tested six baseline configurations, including approaches utilizing truncated context, long-context LLMs, retrieval-augmented generation (RAG), and memory systems. The top-performing baseline, RAG-Both, reached an overall F1 score of 0.358 and a score of 0.342 on Hybrid questions. Notably, Document-only retrieval (RAG-Doc) performed well on Doc-only questions (0.453) but collapsed to 0.267 on Hybrid questions. This disparity highlights a significant joint-retrieval gap, underscoring the need for architectures that unify conversational memory with long-document navigation. We have publicly released the dataset, the generation pipeline, and all baseline implementations.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.