DynMuon: A Dynamic Spectral Shaping View of Muon
Title: DynMuon: A Dynamic Spectral Shaping View of Muon
Original: arXiv:2605.17109v3 Announce Type: replace-cross Abstract: In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $U\Sigma^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss. Our code is available at https://github.com/fzwark/DynMuon.
Rewrite: Title: DynMuon: A Dynamic Spectral Shaping View of Muon
Original: arXiv:2605.17109v3 Announce Type: replace-cross Abstract: In recent years, Muon has become the leading approach for training large language models and transformers. Unlike conventional gradient descent, Muon substitutes the standard update matrix $M=U\Sigma V^\top$ with its polar factor $UV^\top$. This study explores a family of Muon-inspired updates, modifying the update rule to $U\Sigma^p V^\top$ by introducing a parameter $p$. We term this modification "spectral shaping" and establish a framework for selecting $p$ based on (a) the loss function's local curvature, (b) noise from stochastic gradients and labels, and (c) the current training phase. Our theoretical analysis and empirical results uncover a significant trend: a positive $p$ benefits the early stages by highlighting high-curvature directions and speeding up signal contraction, whereas a slightly negative $p$ aids later stages by shifting update power toward low-curvature directions that retain valuable training information. Leveraging this finding, we introduce DynMuon, an efficient dynamic spectral shaping technique that transitions $p$ from positive to mildly negative throughout training. Comprehensive tests across various model sizes, architectures, and training configurations demonstrate that DynMuon consistently yields lower validation loss than Muon, needing 10.6-26.5% fewer steps to achieve the same target loss. Our code is available at https://github.com/fzwark/DynMuon.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





