Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
Title: Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
Abstract: Contemporary reasoning models possess the capability to distribute varying levels of test-time computation—including thinking tokens, model invocations, and overall compute budgets—across different tasks. Traditional approaches typically govern this distribution based on predicted task difficulty, directing additional resources to areas where accuracy gains are anticipated. This strategy rests on the implicit assumption that all errors carry equal weight, as standard accuracy metrics treat every task with uniform importance. However, this premise fails in practical deployment scenarios: while a minor typo in a log entry and a database migration that corrupts production data may both register as a single failure in benchmark tests, their actual operational costs differ drastically.
To address this discrepancy, we introduce consequence-aware test-time compute allocation. Rather than relying solely on difficulty predictions, our method employs a lightweight predictor to assess the potential cost of incorrect task resolution based on the issue description. Consequently, the scheduler assigns higher-consequence tasks to larger compute tiers or increased thinking budgets, all while maintaining a constant total budget. We validated our approach through primary experiments on SWE-bench Lite and assessed cross-dataset performance using Multi-SWE-bench mini, encompassing a total of 700 software engineering tasks.
Our findings indicate that task consequence and difficulty are largely orthogonal across various annotations, and that current thinking models do not adequately align compute allocation with task consequences. Notably, our predictor, which relies exclusively on issue text, successfully avoided misclassifying any of the 300 SWE-bench tasks as low-consequence when they were, in fact, high-consequence. When compared to difficulty-aware routing under equivalent compute constraints, our consequence-aware scheduler achieved a 22% to 33% reduction in cost-weighted loss. Specifically, the priority-aware variant, which directs resources based on per-task cost adjusted by marginal utility signals, exceeded a 30% reduction, while its practical, predictor-driven implementation preserved more than 90% of the theoretical gain offered by an oracle.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




