Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
Title: Failed Reasoning Traces Reveal What Is Repairable (Yet Not Through Direct Inspection)
Abstract
When post-trained language models stumble on reasoning tasks, the standard approach for test-time scaling is to allocate additional computational resources to generate more attempts, effectively ignoring the failed trajectories. We contend that this practice squanders a vital signal. Specifically, some failures stem from random sampling variance, which can be mitigated by generating more samples, whereas others are structural and remain unresolvable regardless of increased budget. We propose that failed traces contain a "recoverability structure," serving as an inference-time indicator of which specific interventions can salvage a particular failure. By analyzing the distributional signatures of these failed rollouts rather than their textual content, we derive three trajectory-level features based on the available intervention structure. These features allow us to map the failure landscape, clustering failures into distinct, stable regimes. This method achieves $84.3{\pm}4.3\%$ accuracy, outperforming a majority-class baseline by $20\%$. Furthermore, it enables a training-free routing mechanism that improves rescue rates by $12.2\%$ on the Steerable-Hard subset—a critical deployment-relevant category where simple retries fail but bounded interventions are accessible. The robustness of these features and the routing rule is confirmed through two cross-family probes. Ultimately, these three features transform discarded failed traces into diagnostic tools, facilitating test-time routing and post-training analysis without requiring access to weights or training-time data.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




