arXiv

When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation

Title: Timing the Teacher’s Update: Temporal Coupling and Stability in Self On-Policy Distillation

Abstract

Self on-policy distillation involves training a student policy using a teacher model drawn from the student’s own parameter history. However, the schedule governing when the teacher updates—thereby determining the temporal coupling between the two models—has not been rigorously analyzed as a factor influencing stability. By conducting a systematic sweep of update schedules on Qwen3-8B, we demonstrate that isolation periods (defined as intervals where the teacher remains frozen between updates) are the critical structural element for ensuring stable learning, rather than the teacher’s age.

To better understand these underlying dynamics, we introduce a diagnostic framework that examines temporal KL structure, refresh shock, and length-tail risk. This analysis reveals a phenomenon we term state-oblivious collapse: while optimal fixed schedules may perform well in short-horizon training, they fail catastrophically in long-horizon settings. In such cases, a clock-driven refresh can irreversibly copy a transiently drifting student into the teacher in a single step. This failure mode is not detectable in short-horizon evaluations and is mechanistically different from the chronic contamination associated with Exponential Moving Average (EMA) methods.

To mitigate this issue, we propose Consolidation-Gated Teacher Refresh (CGTR). This approach maintains isolation periods but gates each teacher update on joint evidence of reward improvement and length-tail safety. Consequently, teacher movements are triggered only by genuine student consolidation rather than arbitrary time signals. Using a single shared parameter set without per-dataset retuning, CGTR achieves zero collapse and secures the highest final scores across all four evaluated tasks (Chemistry, Biology, Physics, and ToolUse), automatically adjusting its refresh frequency to match the specific learning dynamics of each domain.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...