Rethinking the Role of Temperature in Large Language Model Distillation
Title: Reevaluating Temperature’s Function in Large Language Model Distillation
Original: arXiv:2606.00306v1 Announce Type: cross Abstract: Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $\tau$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $\tau$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $\tau=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.
Rewritten: Title: Reassessing Temperature’s Impact in Large Language Model Distillation
Original: arXiv:2606.00306v1 Announce Type: cross Abstract: The dominance of Reverse Kullback-Leibler (RKL) divergence over Forward KL (FKL) in distilling large language models (LLMs) is typically attributed to experimental setups that neglect the temperature parameter $\tau$. This oversight ignores $\tau$'s crucial function in smoothing teacher probability distributions to facilitate better knowledge transfer. We re-examine the role of temperature in LLM distillation, demonstrating that it drastically alters the comparative landscape between FKL and RKL. Our findings highlight a distinct asymmetry: while temperature primarily acts as a gradient rescaler for RKL, it significantly enhances FKL by incorporating signals from less dominant tokens. Consequently, FKL derives substantially greater advantages from scaling $\tau$ than RKL does. This dynamic challenges prevailing empirical beliefs; while RKL generally yields superior results at a temperature of $\tau=1$, FKL consistently achieves higher performance on instruction-following tasks when higher temperatures are employed. Furthermore, the benefits of temperature extend beyond FKL to a wider range of distillation objectives, allowing straightforward KL-based techniques to rival the latest state-of-the-art LLM distillation methods.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




