arXiv

Why Muon Outperforms Adam: A Curvature Perspective

June 4, 2026 · Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, Zhuoran Yang · Original Source

Title: Decoding Muon’s Edge Over Adam: Insights from a Curvature Lens

Abstract

While Muon has been observed to double the training efficiency of Adam in large language model development, the underlying geometric mechanisms responsible for this performance gap have remained elusive. This study initiates an effort to clarify Muon’s superiority over Adam by examining the problem through the lens of curvature.

First, we utilize a second-order Taylor approximation to map the training landscape. Our analysis reveals that Muon secures a greater reduction in loss per step compared to Adam when validation loss is held constant. Although both optimizers deliver similar first-order improvements, Muon consistently suffers from a reduced second-order curvature penalty.

Next, we dissect this curvature penalty into two components: the squared norm of the update and Normalized Directional Sharpness (NDS). Our findings indicate that the update norms for Muon and Adam are similar; consequently, Muon’s reduced curvature penalty stems from its lower NDS rather than a difference in update magnitude.

We further investigate how model architecture and training data influence Muon’s NDS benefit. By employing Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled levels of imbalance, we demonstrate that data imbalance enhances Muon’s NDS advantage relative to Adam. Additionally, a decomposition of layers reveals that during the intermediate and final phases of training, Muon’s lower NDS is primarily maintained by reduced within-layer curvature.

Supplementing these empirical results, we analyze stylized quadratic problems characterized by heterogeneous curvature and gradients aligned with high-curvature modes. We prove that Muon achieves a lower average NDS than Gradient Descent (GD) by distributing update energy across different curvature groups. When curvature heterogeneity is pronounced, this strategy also results in a lower local quadratic loss after an equivalent number of iterations.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC