Reinforcement Learning from Rich Feedback with Distributional DAgger
Title: Enhancing Reinforcement Learning with Rich Feedback via Distributional DAgger
Abstract:
While reasoning models have seen rapid progress, the prevailing approach of Reinforcement Learning from Verifiable Rewards (RLVR) is notably limited. This standard method typically involves generating numerous responses and assigning a binary reward—indicating only the correctness of the final answer. However, numerous scenarios offer more nuanced feedback, such as execution logs, tool outputs, expert interventions, and self-assessments from the model itself. This paper investigates how to leverage such comprehensive feedback through a distributional adaptation of the established imitation learning algorithm, DAgger. In this framework, the learner accesses an expert distribution locally, covering states encountered by the current policy.
This approach results in a straightforward forward cross-entropy objective that functions with a black-box expert. Its sequence-level gradient facilitates rich credit assignment by backpropagating future disagreements between the expert and the student to earlier decision points. We demonstrate that previous RL methods relying on self-distillation objectives based on Reverse KL or Jensen-Shannon divergences do not ensure monotonic policy improvement; specifically, even if the expert possesses a higher reward, their updates might inadvertently increase the probability of selecting inferior actions. Conversely, we prove that forward cross-entropy guarantees monotonic policy improvement and provides regret bounds.
Furthermore, our analysis reveals that this objective optimizes a lower bound on the teacher-weighted likelihood of success, which translates to enhanced Pass@N performance. Empirical evaluations indicate that our method, termed DistIL, outperforms both RLVR and RL with self-distillation baselines across multiple domains, including scientific reasoning, code generation, and the resolution of complex mathematical problems.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






