arXiv

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

Title: Investigating Delayed Grammatical Generalization: A Pre-Training Equivalent to Grokking in Language Models

Abstract:

Grokking refers to the behavior where neural networks achieve generalization capabilities significantly later than the point at which they have memorized their training data. While this phenomenon has been extensively examined in supervised learning contexts over numerous epochs, Large Language Model (LLM) pre-training operates differently. Pre-training relies on next-token prediction across unlabeled corpora, characterized by minimal data repetition and the absence of a distinct train/validation split.

To bridge this methodological gap, we introduce an exposure-based framework designed to facilitate the investigation of grokking-like dynamics within the LLM pre-training process. Our evaluation methodology is anchored in BLiMP minimal pairs, which offer controlled contrasts in grammatical structures. For each BLiMP minimal pair, we isolate a "critical phrase"—defined as the shortest continuous segment that encapsulates both the specific grammatical distinction and the context relevant to the phenomenon.

We categorize examples based on the presence of this critical phrase within the pre-training window: instances containing the phrase are allocated to a proxy-training split, while those lacking it form the proxy-validation split. Through experiments spanning five distinct grammatical phenomena, we document evidence of delayed generalization.

An analysis of pre-training checkpoints taken before and after the onset of generalization reveals key structural changes. Specifically, grammatical concept vectors exhibit increased predictability regarding grammatical acceptability and reside within a higher-dimensional subspace post-generalization. Furthermore, our findings indicate that attention mechanisms from the critical token to its relevant context token are heavily concentrated in a limited number of attention heads.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...