arXiv

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

June 2, 2026 · Sherin Muckatira, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky · Original Source

Title: Investigating Delayed Grammatical Generalization: A Pre-Training Equivalent to Grokking in Language Models

Abstract:

Grokking refers to the behavior where neural networks achieve generalization capabilities significantly later than the point at which they have memorized their training data. While this phenomenon has been extensively examined in supervised learning contexts over numerous epochs, Large Language Model (LLM) pre-training operates differently. Pre-training relies on next-token prediction across unlabeled corpora, characterized by minimal data repetition and the absence of a distinct train/validation split.

To bridge this methodological gap, we introduce an exposure-based framework designed to facilitate the investigation of grokking-like dynamics within the LLM pre-training process. Our evaluation methodology is anchored in BLiMP minimal pairs, which offer controlled contrasts in grammatical structures. For each BLiMP minimal pair, we isolate a "critical phrase"—defined as the shortest continuous segment that encapsulates both the specific grammatical distinction and the context relevant to the phenomenon.

We categorize examples based on the presence of this critical phrase within the pre-training window: instances containing the phrase are allocated to a proxy-training split, while those lacking it form the proxy-validation split. Through experiments spanning five distinct grammatical phenomena, we document evidence of delayed generalization.

An analysis of pre-training checkpoints taken before and after the onset of generalization reveals key structural changes. Specifically, grammatical concept vectors exhibit increased predictability regarding grammatical acceptability and reside within a higher-dimensional subspace post-generalization. Furthermore, our findings indicate that attention mechanisms from the critical token to its relevant context token are heavily concentrated in a limited number of attention heads.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC