arXiv

L$^3$: Large Lookup Layers

Title: L$^3$: Large Lookup Layers

Abstract:

Current sparse language models generally rely on Mixture-of-Experts (MoE) layers to achieve sparsity, dynamically directing tokens to dense MLP "experts." This approach to hard routing, however, presents several challenges, including suboptimal hardware efficiency and the requirement for auxiliary losses to ensure training stability. Conversely, the tokenizer’s embedding table, which is inherently sparse, sidesteps many of these problems by assigning a single embedding to each token, though this method sacrifices contextual information.

To address this, we propose the Large Lookup Layer (L$^3$), a technique that extends embedding tables to model decoder layers, thereby enabling greater sparsity. L$^3$ employs static, token-based routing to aggregate a collection of learned embeddings per token in a manner dependent on context. This allows the model to efficiently manage the trade-off between memory usage and computational load by storing information within embeddings.

The L$^3$ framework consists of two primary elements: (1) an architecture designed for system efficiency, facilitating rapid training and CPU-offloaded inference without any performance penalty; and (2) an information-theoretic algorithm for allocating embeddings that optimally balances speed against model quality. Through empirical evaluation, we trained Transformer models with up to 2.6B active parameters using L$^3$. Our results demonstrate that L$^3$ significantly surpasses both dense models and iso-sparse MoEs in performance on language modeling and various downstream tasks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...