arXiv

Reinforcement Learning from Rich Feedback with Distributional DAgger

Title: Enhancing Reinforcement Learning with Rich Feedback via Distributional DAgger

Abstract:

While reasoning models have seen rapid progress, the prevailing approach of Reinforcement Learning from Verifiable Rewards (RLVR) is notably limited. This standard method typically involves generating numerous responses and assigning a binary reward—indicating only the correctness of the final answer. However, numerous scenarios offer more nuanced feedback, such as execution logs, tool outputs, expert interventions, and self-assessments from the model itself. This paper investigates how to leverage such comprehensive feedback through a distributional adaptation of the established imitation learning algorithm, DAgger. In this framework, the learner accesses an expert distribution locally, covering states encountered by the current policy.

This approach results in a straightforward forward cross-entropy objective that functions with a black-box expert. Its sequence-level gradient facilitates rich credit assignment by backpropagating future disagreements between the expert and the student to earlier decision points. We demonstrate that previous RL methods relying on self-distillation objectives based on Reverse KL or Jensen-Shannon divergences do not ensure monotonic policy improvement; specifically, even if the expert possesses a higher reward, their updates might inadvertently increase the probability of selecting inferior actions. Conversely, we prove that forward cross-entropy guarantees monotonic policy improvement and provides regret bounds.

Furthermore, our analysis reveals that this objective optimizes a lower bound on the teacher-weighted likelihood of success, which translates to enhanced Pass@N performance. Empirical evaluations indicate that our method, termed DistIL, outperforms both RLVR and RL with self-distillation baselines across multiple domains, including scientific reasoning, code generation, and the resolution of complex mathematical problems.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...