Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Title: Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
Abstract:
Although recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have highlighted that a minor fraction of essential tokens is responsible for the majority of reasoning improvements, the equivalent token-level dynamics within On-Policy Distillation (OPD) have yet to be thoroughly investigated. This study examines high-loss tokens, a category that, based on existing literature, ought to decline as training converges, given that they serve as the most immediate indicator of student-teacher divergence under OPD’s per-token KL objective. Contrary to these expectations, our empirical data reveals a different trend. Even when OPD training appears to have plateaued, a significant portion of tokens continues to display sustained high loss; we designate these as "Rock Tokens," which can constitute as much as 18% of the tokens in generated outputs.
Our analysis uncovers two counterintuitive paradoxes. First, although Rock Tokens appear frequently and contribute a disproportionately large share of the total gradient norms, they remain unchanged throughout the training process, effectively resisting corrections from the teacher model. Second, causal interventions demonstrate that these tokens have almost no impact on the model’s actual reasoning capabilities. These results imply that a considerable amount of optimization resources are allocated to structural and discourse residuals that the student model either cannot or does not need to learn. By dissecting these behaviors, we show that intentionally skipping these "stumbling blocks" can greatly accelerate the alignment process. This approach undermines the need for uniform token weighting and proposes a more efficient framework for large-scale model distillation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





