arXiv

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

Title: SCOPE: Enhancing On-Policy Distillation via Signal Calibration and Dual-Path Adaptive Weighting

Abstract: While on-policy reinforcement learning has emerged as the leading approach for aligning reasoning capabilities in large language models, it faces a significant hurdle: the scarcity of outcome-level rewards complicates token-level credit assignment. On-Policy Distillation (OPD) addresses this challenge by incorporating dense, token-level KL supervision derived from a teacher model. However, standard OPD implementations typically apply this supervision uniformly across all generation rollouts, failing to account for variations in signal quality. To overcome this limitation, we introduce Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a novel dual-path adaptive training framework. SCOPE directs on-policy rollouts into two distinct supervision pathways based on their correctness. In cases of incorrect trajectories, the framework utilizes teacher-perplexity-weighted KL distillation, thereby emphasizing instances where the teacher offers reliable corrective guidance while diminishing the influence of uncertain signals. Conversely, for correct trajectories, SCOPE applies student-perplexity-weighted Maximum Likelihood Estimation (MLE). This strategy focuses reinforcement learning efforts on low-confidence samples near the model’s capability limits, avoiding the redundant reinforcement of already mastered concepts. Both pathways utilize group-level normalization to dynamically adjust weight distributions, effectively managing the inherent difficulty fluctuations across different prompts. Our extensive evaluations across six reasoning benchmarks reveal that SCOPE outperforms strong baselines, yielding an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32, thereby confirming its robust and consistent efficacy.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...