arXiv

Why Muon Outperforms Adam: A Curvature Perspective

Title: Decoding Muon’s Edge Over Adam: Insights from a Curvature Lens

Abstract

While Muon has been observed to double the training efficiency of Adam in large language model development, the underlying geometric mechanisms responsible for this performance gap have remained elusive. This study initiates an effort to clarify Muon’s superiority over Adam by examining the problem through the lens of curvature.

First, we utilize a second-order Taylor approximation to map the training landscape. Our analysis reveals that Muon secures a greater reduction in loss per step compared to Adam when validation loss is held constant. Although both optimizers deliver similar first-order improvements, Muon consistently suffers from a reduced second-order curvature penalty.

Next, we dissect this curvature penalty into two components: the squared norm of the update and Normalized Directional Sharpness (NDS). Our findings indicate that the update norms for Muon and Adam are similar; consequently, Muon’s reduced curvature penalty stems from its lower NDS rather than a difference in update magnitude.

We further investigate how model architecture and training data influence Muon’s NDS benefit. By employing Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled levels of imbalance, we demonstrate that data imbalance enhances Muon’s NDS advantage relative to Adam. Additionally, a decomposition of layers reveals that during the intermediate and final phases of training, Muon’s lower NDS is primarily maintained by reduced within-layer curvature.

Supplementing these empirical results, we analyze stylized quadratic problems characterized by heterogeneous curvature and gradients aligned with high-curvature modes. We prove that Muon achieves a lower average NDS than Gradient Descent (GD) by distributing update energy across different curvature groups. When curvature heterogeneity is pronounced, this strategy also results in a lower local quadratic loss after an equivalent number of iterations.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.