arXiv

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

June 3, 2026 · Tim Tsz-Kit Lau, Weijie Su · Original Source

Title: A Symmetry-Compatible Framework for Optimizer Design: Addressing Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Abstract:

A persistent geometric disconnect has long characterized the field of deep learning. Although contemporary neural network architectures inherently possess sophisticated symmetry and equivariance properties, widely used optimizers like Adam and its derivatives function in a coordinate-wise manner, thereby failing to align with the equivariance structures inherent to the parameter space. To bridge this gap, we propose a symmetry-compatible principle for optimizer design, positing that gradient update rules must remain equivariant under the symmetry groups acting on specific weight blocks.

Adhering to this principle, we first offer a unified framework for bi-orthogonally equivariant updates applicable to general matrix layers, encompassing methods such as stochastic spectral descent, Muon, Scion, and polar gradient approaches. Crucially, by shifting focus from orthogonal groups to permutation and shared-shift symmetries, we develop symmetry-compatible optimizers tailored for parameter blocks with distinct symmetry profiles. These include embedding and language model (LM) head matrices, SwiGLU MLP projections, and Mixture-of-Experts (MoE) router matrices. Our proposed constructions feature a variety of update mechanisms, including one-sided spectral, row-norm, hybrid row-norm/spectral, row-aware, column-aware, centered row-norm, and left-spectral updates. This results in a cohesive, end-to-end layerwise optimizer stack where each major category of matrix-valued parameters receives an update rule whose equivariance corresponds precisely to its symmetry group.

We validate this principle through pre-training experiments involving both dense and sparse MoE language models, specifically utilizing architectures similar to Qwen3-0.6B, Gemma 3 1B, OLMoE-1B-7B, and downsized gpt-oss models. The results demonstrate that symmetry-compatible update rules consistently lower final validation loss, mitigate load imbalance in sparse MoE configurations, and, in numerous instances, enhance training stability compared to standard AdamW updates.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC