arXiv

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

June 2, 2026 · Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai · Original Source

Title: SCOPE: Enhancing On-Policy Distillation via Signal Calibration and Dual-Path Adaptive Weighting

Abstract: While on-policy reinforcement learning has emerged as the leading approach for aligning reasoning capabilities in large language models, it faces a significant hurdle: the scarcity of outcome-level rewards complicates token-level credit assignment. On-Policy Distillation (OPD) addresses this challenge by incorporating dense, token-level KL supervision derived from a teacher model. However, standard OPD implementations typically apply this supervision uniformly across all generation rollouts, failing to account for variations in signal quality. To overcome this limitation, we introduce Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a novel dual-path adaptive training framework. SCOPE directs on-policy rollouts into two distinct supervision pathways based on their correctness. In cases of incorrect trajectories, the framework utilizes teacher-perplexity-weighted KL distillation, thereby emphasizing instances where the teacher offers reliable corrective guidance while diminishing the influence of uncertain signals. Conversely, for correct trajectories, SCOPE applies student-perplexity-weighted Maximum Likelihood Estimation (MLE). This strategy focuses reinforcement learning efforts on low-confidence samples near the model’s capability limits, avoiding the redundant reinforcement of already mastered concepts. Both pathways utilize group-level normalization to dynamically adjust weight distributions, effectively managing the inherent difficulty fluctuations across different prompts. Our extensive evaluations across six reasoning benchmarks reveal that SCOPE outperforms strong baselines, yielding an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32, thereby confirming its robust and consistent efficacy.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC