OPD+: Rethinking the Advantage Design for On-Policy Distillation
Title: OPD+: Reevaluating Advantage Estimation in On-Policy Distillation
Abstract
On-policy distillation (OPD) serves as a prevalent method for transferring advanced capabilities from robust teacher language models to foundational student models. This process is typically framed as a reinforcement learning objective that relies on rollouts generated by the student. Although the divergence reward inherently depends on the likelihood of the student model, most existing implementations employ a stop-gradient mechanism primarily to ensure stability. Consequently, this practice raises concerns regarding the validity of the resulting advantage estimation.
In this study, we introduce a generic optimization framework grounded in the f-divergence between the student and teacher models, offering a rigorous mathematical re-examination of this design space. We demonstrate that applying a general stop-gradient operation results in biased estimates of both the reward objective and its corresponding gradient for any general divergence function. To address this, we propose OPD+, a corrected variant of OPD. This approach not only enables the utilization of various f-divergences but also achieves superior performance compared to the baseline KL approach. We substantiate our findings through evaluations on benchmarks for tool-use and mathematical reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




