arXiv

OPD+: Rethinking the Advantage Design for On-Policy Distillation

June 2, 2026 · Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang · Original Source

Title: OPD+: Reevaluating Advantage Estimation in On-Policy Distillation

Abstract

On-policy distillation (OPD) serves as a prevalent method for transferring advanced capabilities from robust teacher language models to foundational student models. This process is typically framed as a reinforcement learning objective that relies on rollouts generated by the student. Although the divergence reward inherently depends on the likelihood of the student model, most existing implementations employ a stop-gradient mechanism primarily to ensure stability. Consequently, this practice raises concerns regarding the validity of the resulting advantage estimation.

In this study, we introduce a generic optimization framework grounded in the f-divergence between the student and teacher models, offering a rigorous mathematical re-examination of this design space. We demonstrate that applying a general stop-gradient operation results in biased estimates of both the reward objective and its corresponding gradient for any general divergence function. To address this, we propose OPD+, a corrected variant of OPD. This approach not only enables the utilization of various f-divergences but also achieves superior performance compared to the baseline KL approach. We substantiate our findings through evaluations on benchmarks for tool-use and mathematical reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC