arXiv

Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models

Title: Geometric Constraints on Feature Representation: Investigating Representational Capacity in Transformer Language Models

Abstract:

While the model dimension ($d_{model}$) serves as a core hyperparameter in transformer language models, its specific function in defining the geometric boundaries of feature representation has received limited attention. Building upon the Linear Representation and Superposition Hypotheses—which suggest that models store features as nearly orthogonal vectors within latent space—we introduce a framework to quantify the number of such directions a model can sustain. We posit that the embedding matrix acts as a measurable indicator of near-orthogonality constraints throughout the latent space. Specifically, the threshold separating significant token relationships from random similarity within the pairwise cosine similarity distribution provides a tangible estimate of the model’s tolerated deviation ($\varepsilon$) from perfect orthogonality.

By applying this metric to numerous open-source models, we identify two distinct categories: those exhibiting high $\varepsilon$ values, characterized by a lack of near-orthogonal structure in their embeddings, and those with low $\varepsilon$ values that preserve this structure. Furthermore, we demonstrate that the conventional Johnson-Lindenstrauss lemma significantly underestimates the packing efficiency of trained representations. To address this, we derive a revised capacity formula where the number of near-orthogonal directions is determined by the ratio of vectors to dimensions ($k/d$), rather than the absolute count. This singular adjustment reduces prediction error by two orders of magnitude without introducing additional parameters.

Synthesizing these findings, we define representational capacity as the upper limit on the number of distinguishable directions available for features and embeddings within a model’s latent space. We find that capacity is exponentially sensitive to $\varepsilon$, and that larger models tend to prioritize stricter orthogonality constraints over the maximization of raw capacity. This behavior aligns with several potential explanations, including a trade-off between stability and capacity, a ceiling on the number of usable concepts, or confounding factors related to model scale, which we reserve for further investigation.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...