arXiv

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Title: A Symmetry-Compatible Framework for Optimizer Design: Addressing Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Abstract:

A persistent geometric disconnect has long characterized the field of deep learning. Although contemporary neural network architectures inherently possess sophisticated symmetry and equivariance properties, widely used optimizers like Adam and its derivatives function in a coordinate-wise manner, thereby failing to align with the equivariance structures inherent to the parameter space. To bridge this gap, we propose a symmetry-compatible principle for optimizer design, positing that gradient update rules must remain equivariant under the symmetry groups acting on specific weight blocks.

Adhering to this principle, we first offer a unified framework for bi-orthogonally equivariant updates applicable to general matrix layers, encompassing methods such as stochastic spectral descent, Muon, Scion, and polar gradient approaches. Crucially, by shifting focus from orthogonal groups to permutation and shared-shift symmetries, we develop symmetry-compatible optimizers tailored for parameter blocks with distinct symmetry profiles. These include embedding and language model (LM) head matrices, SwiGLU MLP projections, and Mixture-of-Experts (MoE) router matrices. Our proposed constructions feature a variety of update mechanisms, including one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. This results in a cohesive, end-to-end layerwise optimizer stack where each major category of matrix-valued parameters receives an update rule whose equivariance corresponds precisely to its symmetry group.

We validate this principle through pre-training experiments involving both dense and sparse MoE language models, specifically utilizing architectures similar to Qwen3-0.6B, Gemma 3 1B, OLMoE-1B-7B, and downsized gpt-oss models. The results demonstrate that symmetry-compatible update rules consistently lower final validation loss, mitigate load imbalance in sparse MoE configurations, and, in numerous instances, enhance training stability compared to standard AdamW updates.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...