arXiv

Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View

Title: A Spectral-Geometric Perspective on Low-Rank Decay and Grokking in Scale-Invariant Transformers

Abstract:

Contemporary Transformer models often utilize normalization techniques, including RMSNorm and Query-Key Normalization, which render specific components of the architecture nearly scale-invariant relative to weight magnitudes. Within this context, conventional weight decay based on the Frobenius norm operates exclusively along the radial axis of the weight space. Consequently, it fails to directly simplify the function approximated by the normalized layer. This paper investigates the phenomenon of grokking in small-scale algorithmic tasks from this specific perspective and introduces \emph{Low-Rank Decay} (LRD). LRD functions as a nuclear-norm-like spectral regularizer. Its subgradient, defined as the polar factor $UV^\top$, maintains a tangential component even when the system is scale-invariant. This structural difference yields significant dynamical implications: once a model has memorized the training data and task-specific gradients diminish, standard L2 decay loses the ability to alter the weight spectrum. In contrast, LRD persists in compressing singular values in a manner reminiscent of $\ell_1$ regularization. Experiments on modular arithmetic tasks demonstrate that LRD triggers a swift reduction in the effective rank of Query and Key matrices, thereby extending the data-fraction threshold required for delayed generalization, or grokking. Additionally, we offer a spectral-geometric explanation via the "needle-to-fan" expansion of the nuclear-norm subdifferential in the vicinity of low-rank strata.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...