arXiv

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Title: High-Quality Reasoning Yields Superior Demonstrations: Supervising Implicit Reasoning Quality Through In-Context Reinforcement Learning

Abstract:

While Reinforcement Learning with Verifiable Rewards (RLVR) enhances the reasoning capabilities of large language models, it inherently treats all correct outcomes as equivalent. This approach risks reinforcing suboptimal reasoning paths that happen to reach the right answer by luck. We posit that \emph{superior reasoning generates superior demonstrations}: high-caliber solutions function as more potent in-context examples compared to their lower-quality counterparts. We define this instructional efficacy as \textbf{Demonstration Utility}. Our findings indicate that the policy model’s intrinsic in-context learning capacity offers an efficient mechanism for quantifying this utility, resulting in a quality metric we call \textbf{Evidence Gain}. To integrate this metric into the training process, we propose \textbf{In-Context RLVR}, a method that inserts demonstrations prior to each rollout. From a theoretical standpoint, we demonstrate that this straightforward adjustment to the input effectively reweights rewards by a factor roughly proportional to Evidence Gain, thereby prioritizing high-quality traces without incurring expensive computational overhead. Empirical evaluations on mathematical reasoning benchmarks reveal that our approach consistently outperforms standard RLVR baselines in both accuracy and the overall quality of reasoning. The associated code and datasets can be accessed at https://github.com/Mithas-114/IC-DAPO.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Shark Tank Star Shrinks Data Center Footprint After Backlash
Bloomberg

Shark Tank Star Shrinks Data Center Footprint After Backlash

After public backlash, a Shark Tank entrepreneur reduced the size of a Utah data center project. This decision followed ...

Hatch’s New Bedside Sleep Clock Wirelessly Tracks Sleep Quality
Bloomberg

Hatch’s New Bedside Sleep Clock Wirelessly Tracks Sleep Quality

Hatch’s $250 screen-free sleep clock wirelessly tracks breathing, heart rate, and movement using low-power signals, offe...

Anduril's Stephens on Innovating in an Age of War
Bloomberg

Anduril's Stephens on Innovating in an Age of War

At Bloomberg Tech 2026, Anduril’s Stephens discussed AI’s role in defense and military innovation amid global conflict.

Liftoff Mobile CEO Talks IPO, Advertising and Strategy
Bloomberg

Liftoff Mobile CEO Talks IPO, Advertising and Strategy

Liftoff Mobile’s CEO discusses IPO plans, navigating ad market trends, and outlining the company's strategic direction f...

Samsung Sponsor Spotlight
Bloomberg

Samsung Sponsor Spotlight

The request lacks source text for the "Samsung Sponsor Spotlight" article. Please provide the original content to enable...

AI Isn’t Replacing Credit Hedge Fund Traders Yet, Barclays Says
Bloomberg

AI Isn’t Replacing Credit Hedge Fund Traders Yet, Barclays Says

Barclays states AI hasn’t replaced credit hedge fund traders yet. Human expertise remains vital for complex decisions, m...