How Much Orthogonalization Does Muon Need?
Title: Reevaluating the Orthogonalization Requirements of Muon
Abstract:
Muon optimizers enhance neural network training by substituting ill-conditioned momentum updates with updates that are approximately semi-orthogonal. This capability raises a practical inquiry: to what extent does Muon actually depend on orthogonalization? To investigate this, we employ a relaxed cubic Newton–Schulz schedule specifically tailored to Muon’s low-precision singular value band. This five-step cubic approach necessitates only ten dominant matrix multiplications, a notable reduction from the fifteen required for five quintic Newton–Schulz iterations. Importantly, this cubic schedule is not designed to serve as a superior polar solver; rather, it functions as a principled, low-cost alternative that allows for an examination of the connections between polar accuracy, spectral shaping, and overall training performance.
Through synthetic diagnostics, NanoGPT ablations, and training trials on hybrid MoE/Mamba architectures, we demonstrate that training quality does not correlate monotonically with the precision of polar decomposition. Specifically, the truncated Polar Express, Muon-Jordan, the cubic Newton–Schulz method, and an explicit FP32 SVD polar factor all achieve nearly identical final loss metrics on GPT-2 Small. Furthermore, cubic5 aligns with the Muon-Jordan quintic update within a margin of approximately $10^{-3}$ in validation loss when applied to hybrid MoE/Mamba models ranging from one to four billion parameters. These findings validate cubic5 as a viable, low-cost orthogonalization variant for Muon, offering empirical proof of training-quality parity in the tested environments.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





