Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams
Title: Can Structural Cues Save LLMs? Evaluating Language Models in Massive Document Streams
Abstract: Although assessing language models within streaming contexts is vital, this area remains largely under-researched. Current benchmarks typically concentrate on isolated complex occurrences or supply pre-curated inputs for individual queries, thereby failing to test model resilience against the conflicts generated when multiple simultaneous events are interwoven in a single document stream. To address this gap, we present StreamBench, a new benchmark derived from significant news stories from 2016 and 2025. This dataset encompasses 605 events and 15,354 documents, structured around three primary tasks: Topic Clustering, Temporal Question Answering, and Summarization. To identify the root causes of model failures, we analyze performance metrics both with and without structural cues—mechanisms that organize essential facts according to specific events. Our results indicate that these structural cues enhance performance in clustering by as much as +4.37% and in temporal QA by up to +9.63%, thereby assisting models in pinpointing relevant data and distinguishing between separate events. Although temporal reasoning continues to pose a fundamental challenge for existing LLMs, the uniform improvements observed across all tasks suggest that incorporating structural cues represents a promising avenue for future advancements in handling massive document streams.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





