Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
Title: Connecting Reasoning Paths in On-Policy Distillation Through Near-Future Guidance
Abstract:
On-Policy Distillation (OPD) enhances the reasoning capabilities of large language models by instructing a student model using trajectories drawn from its own policy, guided by a teacher. While OPD relies on complete trajectories, the actual learning signal is applied at the token level. It detects errors by pinpointing high-loss tokens and attempts to correct them via local reverse-KL adjustments. However, we demonstrate that this approachāsampling entire trajectories but learning only at the token levelāfails to consistently align student trajectories with those of the teacher. Approximately 30% of tokens exhibiting high loss actually reside in a low-divergence context, suggesting that many of these instances represent superficial form mismatches rather than genuine branching points in reasoning. Furthermore, even when tokens are genuinely divergent, correcting them through isolated token-level supervision proves ineffective, as reasoning breakdowns typically manifest as short-horizon shifts in probability distributions.
To address these limitations, we introduce Trajectory-aware OPD (TOPD). This method leverages near-future trajectory data to accurately identify true divergent states and extends guidance across several subsequent tokens. Empirical results indicate that filtering out non-divergent high-loss tokens boosts the standard OPD average accuracy from 47.8% to 48.2%. TOPD achieves even higher performance, raising the average accuracy to 52.2%. Specific improvements are evident on the AIME24 benchmark, where scores increase from 60.0% to 63.3%, and on AIME25, where performance rises from 46.7% to 53.3%.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




