Global News Digest

arXiv

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Title: Connecting Reasoning Paths in On-Policy Distillation Through Near-Future Guidance

Abstract:

On-Policy Distillation (OPD) enhances the reasoning capabilities of large language models by instructing a student model using trajectories drawn from its own policy, guided by a teacher. While OPD relies on complete trajectories, the actual learning signal is applied at the token level. It detects errors by pinpointing high-loss tokens and attempts to correct them via local reverse-KL adjustments. However, we demonstrate that this approach—sampling entire trajectories but learning only at the token level—fails to consistently align student trajectories with those of the teacher. Approximately 30% of tokens exhibiting high loss actually reside in a low-divergence context, suggesting that many of these instances represent superficial form mismatches rather than genuine branching points in reasoning. Furthermore, even when tokens are genuinely divergent, correcting them through isolated token-level supervision proves ineffective, as reasoning breakdowns typically manifest as short-horizon shifts in probability distributions.

To address these limitations, we introduce Trajectory-aware OPD (TOPD). This method leverages near-future trajectory data to accurately identify true divergent states and extends guidance across several subsequent tokens. Empirical results indicate that filtering out non-divergent high-loss tokens boosts the standard OPD average accuracy from 47.8% to 48.2%. TOPD achieves even higher performance, raising the average accuracy to 52.2%. Specific improvements are evident on the AIME24 benchmark, where scores increase from 60.0% to 63.3%, and on AIME25, where performance rises from 46.7% to 53.3%.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.