When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation
Title: Timing the Teacher’s Update: Temporal Coupling and Stability in Self On-Policy Distillation
Abstract
Self on-policy distillation involves training a student policy using a teacher model drawn from the student’s own parameter history. However, the schedule governing when the teacher updates—thereby determining the temporal coupling between the two models—has not been rigorously analyzed as a factor influencing stability. By conducting a systematic sweep of update schedules on Qwen3-8B, we demonstrate that isolation periods (defined as intervals where the teacher remains frozen between updates) are the critical structural element for ensuring stable learning, rather than the teacher’s age.
To better understand these underlying dynamics, we introduce a diagnostic framework that examines temporal KL structure, refresh shock, and length-tail risk. This analysis reveals a phenomenon we term state-oblivious collapse: while optimal fixed schedules may perform well in short-horizon training, they fail catastrophically in long-horizon settings. In such cases, a clock-driven refresh can irreversibly copy a transiently drifting student into the teacher in a single step. This failure mode is not detectable in short-horizon evaluations and is mechanistically different from the chronic contamination associated with Exponential Moving Average (EMA) methods.
To mitigate this issue, we propose Consolidation-Gated Teacher Refresh (CGTR). This approach maintains isolation periods but gates each teacher update on joint evidence of reward improvement and length-tail safety. Consequently, teacher movements are triggered only by genuine student consolidation rather than arbitrary time signals. Using a single shared parameter set without per-dataset retuning, CGTR achieves zero collapse and secures the highest final scores across all four evaluated tasks (Chemistry, Biology, Physics, and ToolUse), automatically adjusting its refresh frequency to match the specific learning dynamics of each domain.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



