arXiv

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Title: OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Abstract

On-Policy Distillation (OPD) enhances student models by training them on their own generative trajectories, leveraging dense, token-level feedback from a superior teacher. This method effectively addresses two major drawbacks of alternative approaches: the off-policy distribution shift inherent in Supervised Fine-Tuning (SFT) and the sparse credit assignment issues typical of Reinforcement Learning (RL). Despite its advantages, conventional OPD is hindered by two intertwined constraints. First, it necessitates direct access to the teacher’s token-level logits, thereby disqualifying many powerful proprietary models from acting as teachers. Second, relying on token-level logit signals is fragile; it depends heavily on a limited overlap of plausible next tokens between the teacher and student, making it susceptible to amplifying degenerate behaviors like repetition loops.

To overcome these challenges, we present OmniOPD, a new framework that eliminates the need for logits by employing a chunk-level supervision signal. OmniOPD substitutes deterministic logit matching with Monte Carlo rollouts, which estimate the teacher’s local preferences using a continuous semantic similarity metric applied to multi-token chunks. To optimize this supervision, a peak-entropy scheduler restricts auditing to the student’s high-uncertainty reasoning forks. Additionally, a Dirichlet-Multinomial Bayesian prior and a base-model KL anchor are utilized to constrain the variance of discrete sampling and safeguard against policy collapse in unaudited tokens.

Evaluations across competitive benchmarks demonstrate that OmniOPD outperforms standard OPD by as much as +28.64% in mathematics. This confirms that chunk-level semantic verification yields a more robust learning signal compared to token-level logit matching, where high information density is counterbalanced by considerable noise and fragility. Moreover, when OmniOPD is paired with strong black-box teachers like Claude-4.5-Haiku and Gemini-2.5-Flash, it secures an additional +9.54% relative improvement in math over its open-weight teacher equivalent, enabling the student model to surpass the performance levels achieved by self-exploratory RL.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...