Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning
Title: High-Quality Reasoning Yields Superior Demonstrations: Supervising Implicit Reasoning Quality Through In-Context Reinforcement Learning
Abstract:
While Reinforcement Learning with Verifiable Rewards (RLVR) enhances the reasoning capabilities of large language models, it inherently treats all correct outcomes as equivalent. This approach risks reinforcing suboptimal reasoning paths that happen to reach the right answer by luck. We posit that \emph{superior reasoning generates superior demonstrations}: high-caliber solutions function as more potent in-context examples compared to their lower-quality counterparts. We define this instructional efficacy as \textbf{Demonstration Utility}. Our findings indicate that the policy model’s intrinsic in-context learning capacity offers an efficient mechanism for quantifying this utility, resulting in a quality metric we call \textbf{Evidence Gain}. To integrate this metric into the training process, we propose \textbf{In-Context RLVR}, a method that inserts demonstrations prior to each rollout. From a theoretical standpoint, we demonstrate that this straightforward adjustment to the input effectively reweights rewards by a factor roughly proportional to Evidence Gain, thereby prioritizing high-quality traces without incurring expensive computational overhead. Empirical evaluations on mathematical reasoning benchmarks reveal that our approach consistently outperforms standard RLVR baselines in both accuracy and the overall quality of reasoning. The associated code and datasets can be accessed at https://github.com/Mithas-114/IC-DAPO.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






