Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models
Title: Geometric Constraints on Feature Representation: Investigating Representational Capacity in Transformer Language Models
Abstract:
While the model dimension ($d_{model}$) serves as a core hyperparameter in transformer language models, its specific function in defining the geometric boundaries of feature representation has received limited attention. Building upon the Linear Representation and Superposition Hypotheses—which suggest that models store features as nearly orthogonal vectors within latent space—we introduce a framework to quantify the number of such directions a model can sustain. We posit that the embedding matrix acts as a measurable indicator of near-orthogonality constraints throughout the latent space. Specifically, the threshold separating significant token relationships from random similarity within the pairwise cosine similarity distribution provides a tangible estimate of the model’s tolerated deviation ($\varepsilon$) from perfect orthogonality.
By applying this metric to numerous open-source models, we identify two distinct categories: those exhibiting high $\varepsilon$ values, characterized by a lack of near-orthogonal structure in their embeddings, and those with low $\varepsilon$ values that preserve this structure. Furthermore, we demonstrate that the conventional Johnson-Lindenstrauss lemma significantly underestimates the packing efficiency of trained representations. To address this, we derive a revised capacity formula where the number of near-orthogonal directions is determined by the ratio of vectors to dimensions ($k/d$), rather than the absolute count. This singular adjustment reduces prediction error by two orders of magnitude without introducing additional parameters.
Synthesizing these findings, we define representational capacity as the upper limit on the number of distinguishable directions available for features and embeddings within a model’s latent space. We find that capacity is exponentially sensitive to $\varepsilon$, and that larger models tend to prioritize stricter orthogonality constraints over the maximization of raw capacity. This behavior aligns with several potential explanations, including a trade-off between stability and capacity, a ceiling on the number of usable concepts, or confounding factors related to model scale, which we reserve for further investigation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



