arXiv

Spectral Scaling Laws of Muon

Title: Power Laws Governing the Singular Value Spectrum in Muon Optimizers

Abstract: Orthonormalized update rules have emerged as a dominant optimization strategy for training large language models, with recent open-source state-of-the-art architectures increasingly adopting the Muon optimizer. To maintain computational tractability, Muon relies on the Newton–Schulz (NS) iteration for orthonormalization. However, because the NS method is an approximation, it fails to orthonormalize directions associated with small singular values. While Muon applies NS to the momentum matrix at every training step, the behavior of the singular value spectrum within these matrices remains poorly understood, particularly regarding how this behavior evolves as model size increases. This paper presents the first systematic investigation into this gap. By tracking singular value quantiles of the momentum buffer across layers in models ranging from 77 million to 2.8 billion parameters, we identify a consistent pattern: following a brief initial burn-in period, the quantiles stabilize at levels dictated by both layer type and model scale. Remarkably, these stabilization values adhere to clean power laws relative to model size, featuring exponents that vary by layer. Mid-to-late depth layers exhibit mild scaling with model size $M$ (approximately $M^{-0.25}$), suggesting that the standard five-step NS configuration employed in academic settings will remain effective for orthonormalizing these layers at significantly larger scales. Conversely, certain late-stage layers scale much more aggressively (up to $M^{-0.96}$) and are at risk of entering the NS failure regime at frontier model scales unless additional NS iterations or optimized coefficients are utilized. Given that NS iterations are computationally costly at scale, our derived laws provide practitioners with a principled, layer-specific method for determining the minimum NS configuration necessary to orthonormalize critical directions. This approach avoids superfluous computation while preserving the quality of the updates.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...