arXiv

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Title: The Mechanics of Scale: How Capacity, Interference, and Rare-Task Retention Drive Learning in Large Models

Abstract:

Why do larger neural networks succeed in mastering tasks that elude their smaller counterparts? In this study, we investigate the underlying mechanisms of this phenomenon through a phenomenological argument grounded in power-law scaling. Our analysis suggests that, theoretically, a larger model possesses the inherent capacity to capture portions of the data distribution that remain inaccessible to smaller models, even when provided with an infinite amount of training data.

To empirically validate this hypothesis and pinpoint the specific causes, we conducted experiments using a synthetic environment composed of a mixture of tasks exhibiting monotonic scaling behaviors. Our findings reveal that the primary driver is a competition for resources—specifically, neurons—induced by the data itself. Smaller models tend to prioritize high-frequency or low-complexity tasks, leading to solutions that perform poorly on rare or complex scenarios. Crucially, this limitation persists even when the architectural capacity to represent the desired task exists.

We further examined how larger models overcome this data-centric bottleneck, identifying reduced interference as the key mechanism. Larger models can dedicate sufficient resources to common tasks, thereby weakening the gradient updates associated with them. This allows the model to avoid overwriting features critical to rare tasks, enabling the gradual accumulation of these features.

To corroborate these insights, we pretrained OLMo models ranging from 4 million to 4 billion parameters on novel tasks with varying degrees of frequency and complexity. The results from these real-world experiments mirrored those observed in our synthetic tests: only the larger OLMo models successfully learned infrequent and complex tasks. These larger models demonstrated richer task feature embedding in their representations and exhibited significantly less gradient interference between competing tasks.

Ultimately, this work provides a data-centric explanation for the superior learning capabilities of larger models. These insights not only clarify why larger models perform better in practice but also offer valuable guidance for optimizing model sizing and determining effective training data compositions.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...