Spectral Scaling Laws of Muon
Title: Power Laws Governing the Singular Value Spectrum in Muon Optimizers
Abstract: Orthonormalized update rules have emerged as a dominant optimization strategy for training large language models, with recent open-source state-of-the-art architectures increasingly adopting the Muon optimizer. To maintain computational tractability, Muon relies on the Newton–Schulz (NS) iteration for orthonormalization. However, because the NS method is an approximation, it fails to orthonormalize directions associated with small singular values. While Muon applies NS to the momentum matrix at every training step, the behavior of the singular value spectrum within these matrices remains poorly understood, particularly regarding how this behavior evolves as model size increases. This paper presents the first systematic investigation into this gap. By tracking singular value quantiles of the momentum buffer across layers in models ranging from 77 million to 2.8 billion parameters, we identify a consistent pattern: following a brief initial burn-in period, the quantiles stabilize at levels dictated by both layer type and model scale. Remarkably, these stabilization values adhere to clean power laws relative to model size, featuring exponents that vary by layer. Mid-to-late depth layers exhibit mild scaling with model size $M$ (approximately $M^{-0.25}$), suggesting that the standard five-step NS configuration employed in academic settings will remain effective for orthonormalizing these layers at significantly larger scales. Conversely, certain late-stage layers scale much more aggressively (up to $M^{-0.96}$) and are at risk of entering the NS failure regime at frontier model scales unless additional NS iterations or optimized coefficients are utilized. Given that NS iterations are computationally costly at scale, our derived laws provide practitioners with a principled, layer-specific method for determining the minimum NS configuration necessary to orthonormalize critical directions. This approach avoids superfluous computation while preserving the quality of the updates.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


