Universal One-third Time Scaling in Learning Peaked Distributions
Title: Universal One-third Time Scaling in Learning Peaked Distributions
Abstract: The high computational cost of training large language models (LLMs) is partially attributed to the slow, power-law convergence of their loss functions, a phenomenon whose source is still contested. By conducting a systematic analysis of simplified models and empirically evaluating LLMs, we demonstrate that this behavior stems intrinsically from the combination of softmax and cross-entropy. When the model learns peaked probability distributions, such as next-token predictions, these components typically cause losses and gradients to vanish according to a power law. This effect persists irrespective of various microscopic details, establishing a fundamental optimization bottleneck. Consequently, the loss exhibits a power-law time scaling with a universal exponent of $1/3$. These findings offer a mechanistic account for the neural scaling laws observed in practice and point toward novel strategies for enhancing the efficiency of LLM training.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






