arXiv

Trust Region On-Policy Distillation

June 2, 2026 · Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang · Original Source

Title: Trust Region On-Policy Distillation

Abstract:

On-Policy Distillation (OPD) serves as a cornerstone method for the efficient post-training of large language models (LLMs), finding extensive utility in model compression, multi-task improvement, and agent learning. Nevertheless, OPD often suffers from training instability when there is a significant divergence between the teacher and student distributions. In such scenarios, supervision derived from teacher responses to student-generated tokens can produce unreliable policy gradients, potentially leading to optimization collapse. To resolve the challenge of ensuring reliable on-policy token-level supervision, this study introduces credit assignment strategies and presents Trust Region On-Policy Distillation (TrOPD). The proposed framework is defined by three key components:

Trust-Region On-Policy Learning: TrOPD restricts OPD execution to areas where the teacher offers trustworthy supervision. This approach alleviates the optimization challenges associated with the K1 reverse-KL estimator when distribution mismatch occurs.
Outlier Estimation: To counteract the negative impacts of unreliable supervision in outlier regions, the method investigates techniques such as masking, gradient clipping, and forward-KL estimation.
Off-Policy Guidance: The student model extends generation based on teacher prefixes and employs forward KL divergence to mimic off-policy guidance, thereby promoting on-policy exploration into more reliable regions.

Empirical results demonstrate that TrOPD consistently surpasses state-of-the-art OPD baselines, such as EOPD, REOPOLD, and standard OPD, across various benchmarks encompassing general domains, code generation, and mathematical reasoning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC