Global News Digest

arXiv

Rethinking the Role of Temperature in Large Language Model Distillation

Title: Reevaluating Temperature’s Function in Large Language Model Distillation

Original: arXiv:2606.00306v1 Announce Type: cross Abstract: Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $\tau$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $\tau$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $\tau=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

Rewritten: Title: Reassessing Temperature’s Impact in Large Language Model Distillation

Original: arXiv:2606.00306v1 Announce Type: cross Abstract: The dominance of Reverse Kullback-Leibler (RKL) divergence over Forward KL (FKL) in distilling large language models (LLMs) is typically attributed to experimental setups that neglect the temperature parameter $\tau$. This oversight ignores $\tau$'s crucial function in smoothing teacher probability distributions to facilitate better knowledge transfer. We re-examine the role of temperature in LLM distillation, demonstrating that it drastically alters the comparative landscape between FKL and RKL. Our findings highlight a distinct asymmetry: while temperature primarily acts as a gradient rescaler for RKL, it significantly enhances FKL by incorporating signals from less dominant tokens. Consequently, FKL derives substantially greater advantages from scaling $\tau$ than RKL does. This dynamic challenges prevailing empirical beliefs; while RKL generally yields superior results at a temperature of $\tau=1$, FKL consistently achieves higher performance on instruction-following tasks when higher temperatures are employed. Furthermore, the benefits of temperature extend beyond FKL to a wider range of distillation objectives, allowing straightforward KL-based techniques to rival the latest state-of-the-art LLM distillation methods.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.