An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
Abstract
Research into human cognition reveals that individuals generally possess superior skills in assessing reasoning processes compared to generating them from the ground up. Conversely, Large Reasoning Models (LRMs) are optimized to generate extensive chains of logic to resolve intricate challenges. This disparity prompts the question: how effectively do LRMs evaluate reasoning? To explore this, we utilize the Valid-Answer-Invalid-Reasoning (VAIR) dataset, which comprises mathematical problems featuring correct final answers but containing minor logical flaws in their derivation. This dataset allows us to isolate the act of evaluating reasoning from the complexities of producing it.
Our analysis reveals a significant production-evaluation discrepancy in LRMs, a stark contrast to human performance where the gap in grading versus solving these specific problems is a mere 6%. Frontier LRMs achieve evaluation scores as low as 48%, even though their ability to produce correct solutions remains nearly flawless. To understand this anomaly, we employed chain-of-thought (CoT) analysis, uncovering evidence of answer confirmation bias. Rather than meticulously validating each logical step, LRMs tend to generate a solution and then verify it against the known correct answer. Consequently, the models often invent rationalizations to justify anomalous reasoning steps.
These observations are supported by linear probes, which indicate that while LRM activations do encode some representation of valid logic, they do not robustly identify VAIR solutions as incorrect. Furthermore, causal patching experiments targeting the representations of the final answer demonstrate that the validity of the answer itself drives the modelsā confirmation bias, as manipulating these representations directly alters both the modelās verdicts and its internal activations. These results highlight a critical deficiency in current reasoning training methodologies, which encourage LRMs to construct and validate reasoning toward correct outcomes but fail to equip them with the ability to rigorously assess the underlying logic.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




