arXiv

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Title: 100-LongBench: Do De Facto Long-Context Benchmarks Actually Measure Long-Context Capability?

Abstract: The ability to handle long contexts is widely regarded as a critical feature of Large Language Models (LLMs), as it allows users to effortlessly manage tasks that were previously labor-intensive—such as extracting answers from lengthy documents—rather than relying on direct queries. However, current real-world task-based benchmarks for long-context evaluation suffer from two significant limitations. First, established benchmarks like LongBench often lack appropriate metrics to distinguish between a model’s baseline knowledge and its actual long-context performance, which complicates fair cross-model comparisons. Second, these benchmarks typically rely on fixed input lengths, restricting their versatility across different models and failing to indicate the point at which a model’s performance degrades. To overcome these challenges, we present a length-controllable long-context benchmark alongside a novel metric designed to separate baseline knowledge from genuine long-context proficiency. Our experiments confirm that this approach offers superior effectiveness in evaluating LLMs.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Planet Labs Raises Outlook as War Drives Earth-Imaging Demand
Bloomberg

Planet Labs Raises Outlook as War Drives Earth-Imaging Demand

Planet Labs raised its financial forecast as geopolitical conflicts drive surging demand for high-resolution satellite i...

TechCrunch

Startup Battlefield is returning to Australia — here’s what happened the last time we came to Sydney

Startup Battlefield returns to Sydney on August 19, 2026, partnering with Stripe. Ten finalists pitch for $10,000 in cre...

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks
Bloomberg

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks

A whistleblower alleges IBM and AT&T concealed foreign cyberattacks. This claim contrasts with unrelated news about Micr...

Verizon CEO Sees AI Coming for Customer Service Jobs
Bloomberg

Verizon CEO Sees AI Coming for Customer Service Jobs

Verizon’s CEO predicts AI will disrupt customer service jobs, as automation reshapes support operations and alters tradi...

Verizon CEO Sees AI Replacing Large Share of Customer Service
Bloomberg

Verizon CEO Sees AI Replacing Large Share of Customer Service

Verizon CEO Dan Schulman predicts AI will replace a large share of customer service roles. This outlook was shared at th...

Android's Samat on Integrating AI into the Ecosystem
Bloomberg

Android's Samat on Integrating AI into the Ecosystem

Samat discusses integrating AI into the Android ecosystem. The source text is missing, so no specific details can be sum...