arXiv

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

June 2, 2026 · Tiberiu Musat · Original Source

Title: The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Original: arXiv:2511.01938v3 Announce Type: replace-cross Abstract: Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

Rewrite: The abstract defines grokking as an enigmatic behavior in neural networks, characterized by a significant lag between the model's ability to memorize training examples and its subsequent achievement of full generalization. While earlier studies have attributed this postponed generalization to representation learning induced by weight decay, the specific mechanisms governing this process have remained unclear. This study proposes viewing post-memorization training through the framework of constrained optimization, positing that gradient descent functions to minimize the weight norm within the zero-loss manifold. We provide a formal proof of this hypothesis under conditions where both learning rates and weight decay coefficients approach zero. To analyze this specific regime in greater detail, we present an approximation method that isolates the learning dynamics of a specific parameter subset from the remainder of the network. Utilizing this approach, we obtain a closed-form solution describing the post-memorization evolution of the first layer in a two-layer architecture. Our experimental results demonstrate that training simulations driven by these predicted gradients successfully replicate the hallmark features of grokking, including both the delayed generalization and the associated representation learning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC