arXiv

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

June 4, 2026 · Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han · Original Source

Title: 100-LongBench: Do De Facto Long-Context Benchmarks Actually Measure Long-Context Capability?

Abstract: The ability to handle long contexts is widely regarded as a critical feature of Large Language Models (LLMs), as it allows users to effortlessly manage tasks that were previously labor-intensive—such as extracting answers from lengthy documents—rather than relying on direct queries. However, current real-world task-based benchmarks for long-context evaluation suffer from two significant limitations. First, established benchmarks like LongBench often lack appropriate metrics to distinguish between a model’s baseline knowledge and its actual long-context performance, which complicates fair cross-model comparisons. Second, these benchmarks typically rely on fixed input lengths, restricting their versatility across different models and failing to indicate the point at which a model’s performance degrades. To overcome these challenges, we present a length-controllable long-context benchmark alongside a novel metric designed to separate baseline knowledge from genuine long-context proficiency. Our experiments confirm that this approach offers superior effectiveness in evaluating LLMs.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Bloomberg

Planet Labs Raises Outlook as War Drives Earth-Imaging Demand

June 4, 2026

Planet Labs raised its financial forecast as geopolitical conflicts drive surging demand for high-resolution satellite i...

TechCrunch

Startup Battlefield is returning to Australia — here’s what happened the last time we came to Sydney

June 4, 2026

Startup Battlefield returns to Sydney on August 19, 2026, partnering with Stripe. Ten finalists pitch for $10,000 in cre...

Bloomberg

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks

June 4, 2026

A whistleblower alleges IBM and AT&T concealed foreign cyberattacks. This claim contrasts with unrelated news about Micr...

Bloomberg

Verizon CEO Sees AI Coming for Customer Service Jobs

June 4, 2026

Verizon’s CEO predicts AI will disrupt customer service jobs, as automation reshapes support operations and alters tradi...

Bloomberg

Verizon CEO Sees AI Replacing Large Share of Customer Service

June 4, 2026

Verizon CEO Dan Schulman predicts AI will replace a large share of customer service roles. This outlook was shared at th...

Bloomberg

Android's Samat on Integrating AI into the Ecosystem

June 4, 2026

Samat discusses integrating AI into the Android ecosystem. The source text is missing, so no specific details can be sum...

Top international news

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Related Articles

Planet Labs Raises Outlook as War Drives Earth-Imaging Demand

Startup Battlefield is returning to Australia — here’s what happened the last time we came to Sydney

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks

Verizon CEO Sees AI Coming for Customer Service Jobs

Verizon CEO Sees AI Replacing Large Share of Customer Service

Android's Samat on Integrating AI into the Ecosystem

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Related Articles

Planet Labs Raises Outlook as War Drives Earth-Imaging Demand

Startup Battlefield is returning to Australia — here’s what happened the last time we came to Sydney

IBM, AT&amp;T Accused by Whistleblower of Covering Up Foreign Hacks

Verizon CEO Sees AI Coming for Customer Service Jobs

Verizon CEO Sees AI Replacing Large Share of Customer Service

Android's Samat on Integrating AI into the Ecosystem

IBM, AT&T Accused by Whistleblower of Covering Up Foreign Hacks