arXiv

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

June 2, 2026 · Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Peng Bo, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao · Original Source

Title: OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Abstract

On-Policy Distillation (OPD) enhances student models by training them on their own generative trajectories, leveraging dense, token-level feedback from a superior teacher. This method effectively addresses two major drawbacks of alternative approaches: the off-policy distribution shift inherent in Supervised Fine-Tuning (SFT) and the sparse credit assignment issues typical of Reinforcement Learning (RL). Despite its advantages, conventional OPD is hindered by two intertwined constraints. First, it necessitates direct access to the teacher’s token-level logits, thereby disqualifying many powerful proprietary models from acting as teachers. Second, relying on token-level logit signals is fragile; it depends heavily on a limited overlap of plausible next tokens between the teacher and student, making it susceptible to amplifying degenerate behaviors like repetition loops.

To overcome these challenges, we present OmniOPD, a new framework that eliminates the need for logits by employing a chunk-level supervision signal. OmniOPD substitutes deterministic logit matching with Monte Carlo rollouts, which estimate the teacher’s local preferences using a continuous semantic similarity metric applied to multi-token chunks. To optimize this supervision, a peak-entropy scheduler restricts auditing to the student’s high-uncertainty reasoning forks. Additionally, a Dirichlet-Multinomial Bayesian prior and a base-model KL anchor are utilized to constrain the variance of discrete sampling and safeguard against policy collapse in unaudited tokens.

Evaluations across competitive benchmarks demonstrate that OmniOPD outperforms standard OPD by as much as +28.64% in mathematics. This confirms that chunk-level semantic verification yields a more robust learning signal compared to token-level logit matching, where high information density is counterbalanced by considerable noise and fragility. Moreover, when OmniOPD is paired with strong black-box teachers like Claude-4.5-Haiku and Gemini-2.5-Flash, it secures an additional +9.54% relative improvement in math over its open-weight teacher equivalent, enabling the student model to surpass the performance levels achieved by self-exploratory RL.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC