arXiv

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

June 4, 2026 · Mingyu Li · Original Source

Title: A Spectral-Geometric Perspective on Low-Rank Decay and Grokking in Scale-Invariant Transformers

Abstract:

Contemporary Transformer models often utilize normalization techniques, including RMSNorm and Query-Key Normalization, which render specific components of the architecture nearly scale-invariant relative to weight magnitudes. Within this context, conventional weight decay based on the Frobenius norm operates exclusively along the radial axis of the weight space. Consequently, it fails to directly simplify the function approximated by the normalized layer. This paper investigates the phenomenon of grokking in small-scale algorithmic tasks from this specific perspective and introduces \emph{Low-Rank Decay} (LRD). LRD functions as a nuclear-norm-like spectral regularizer. Its subgradient, defined as the polar factor $UV^\top$, maintains a tangential component even when the system is scale-invariant. This structural difference yields significant dynamical implications: once a model has memorized the training data and task-specific gradients diminish, standard L2 decay loses the ability to alter the weight spectrum. In contrast, LRD persists in compressing singular values in a manner reminiscent of $\ell_1$ regularization. Experiments on modular arithmetic tasks demonstrate that LRD triggers a swift reduction in the effective rank of Query and Key matrices, thereby extending the data-fraction threshold required for delayed generalization, or grokking. Additionally, we offer a spectral-geometric explanation via the "needle-to-fan" expansion of the nuclear-norm subdifferential in the vicinity of low-rank strata.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC