100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Title: 100-LongBench: Do De Facto Long-Context Benchmarks Actually Measure Long-Context Capability?
Abstract: The ability to handle long contexts is widely regarded as a critical feature of Large Language Models (LLMs), as it allows users to effortlessly manage tasks that were previously labor-intensive—such as extracting answers from lengthy documents—rather than relying on direct queries. However, current real-world task-based benchmarks for long-context evaluation suffer from two significant limitations. First, established benchmarks like LongBench often lack appropriate metrics to distinguish between a model’s baseline knowledge and its actual long-context performance, which complicates fair cross-model comparisons. Second, these benchmarks typically rely on fixed input lengths, restricting their versatility across different models and failing to indicate the point at which a model’s performance degrades. To overcome these challenges, we present a length-controllable long-context benchmark alongside a novel metric designed to separate baseline knowledge from genuine long-context proficiency. Our experiments confirm that this approach offers superior effectiveness in evaluating LLMs.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





