OckBench: Measuring the Efficiency of LLM Reasoning
Title: OckBench: Assessing LLM Reasoning Efficiency
Abstract:
The emergence of large language models (LLMs) like GPT-5 and Gemini 3 has significantly advanced the capabilities of automated reasoning and code generation. However, existing evaluation metrics predominantly focus on output quality and accuracy, overlooking a vital aspect: the efficiency of token consumption. In real-world applications, token efficiency fluctuates widely. Even when models achieve comparable accuracy on identical problems, their token usage can vary by as much as 5.0$\times$, revealing substantial disparities in reasoning effectiveness. This inconsistency underscores significant inefficiencies and underscores the urgent need for a standardized benchmark to measure token efficiency gaps.
To address this, we present OckBench, the inaugural benchmark designed to evaluate both accuracy and token efficiency across reasoning and coding tasks. Our analysis demonstrates that current models have not yet optimized token efficiency, resulting in unnecessarily high serving costs and increased latency. These insights offer a clear path for the research community to enhance latent reasoning abilities and improve token efficiency. Ultimately, we advocate for a fundamental shift in evaluation standards, asserting that token usage should not exceed what is strictly necessary. The OckBench resources are accessible at https://ockbench.github.io/.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




