Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
Title: The Mechanics of Scale: How Capacity, Interference, and Rare-Task Retention Drive Learning in Large Models
Abstract:
Why do larger neural networks succeed in mastering tasks that elude their smaller counterparts? In this study, we investigate the underlying mechanisms of this phenomenon through a phenomenological argument grounded in power-law scaling. Our analysis suggests that, theoretically, a larger model possesses the inherent capacity to capture portions of the data distribution that remain inaccessible to smaller models, even when provided with an infinite amount of training data.
To empirically validate this hypothesis and pinpoint the specific causes, we conducted experiments using a synthetic environment composed of a mixture of tasks exhibiting monotonic scaling behaviors. Our findings reveal that the primary driver is a competition for resources—specifically, neurons—induced by the data itself. Smaller models tend to prioritize high-frequency or low-complexity tasks, leading to solutions that perform poorly on rare or complex scenarios. Crucially, this limitation persists even when the architectural capacity to represent the desired task exists.
We further examined how larger models overcome this data-centric bottleneck, identifying reduced interference as the key mechanism. Larger models can dedicate sufficient resources to common tasks, thereby weakening the gradient updates associated with them. This allows the model to avoid overwriting features critical to rare tasks, enabling the gradual accumulation of these features.
To corroborate these insights, we pretrained OLMo models ranging from 4 million to 4 billion parameters on novel tasks with varying degrees of frequency and complexity. The results from these real-world experiments mirrored those observed in our synthetic tests: only the larger OLMo models successfully learned infrequent and complex tasks. These larger models demonstrated richer task feature embedding in their representations and exhibited significantly less gradient interference between competing tasks.
Ultimately, this work provides a data-centric explanation for the superior learning capabilities of larger models. These insights not only clarify why larger models perform better in practice but also offer valuable guidance for optimizing model sizing and determining effective training data compositions.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





