L$^3$: Large Lookup Layers
Title: L$^3$: Large Lookup Layers
Abstract:
Current sparse language models generally rely on Mixture-of-Experts (MoE) layers to achieve sparsity, dynamically directing tokens to dense MLP "experts." This approach to hard routing, however, presents several challenges, including suboptimal hardware efficiency and the requirement for auxiliary losses to ensure training stability. Conversely, the tokenizer’s embedding table, which is inherently sparse, sidesteps many of these problems by assigning a single embedding to each token, though this method sacrifices contextual information.
To address this, we propose the Large Lookup Layer (L$^3$), a technique that extends embedding tables to model decoder layers, thereby enabling greater sparsity. L$^3$ employs static, token-based routing to aggregate a collection of learned embeddings per token in a manner dependent on context. This allows the model to efficiently manage the trade-off between memory usage and computational load by storing information within embeddings.
The L$^3$ framework consists of two primary elements: (1) an architecture designed for system efficiency, facilitating rapid training and CPU-offloaded inference without any performance penalty; and (2) an information-theoretic algorithm for allocating embeddings that optimally balances speed against model quality. Through empirical evaluation, we trained Transformer models with up to 2.6B active parameters using L$^3$. Our results demonstrate that L$^3$ significantly surpasses both dense models and iso-sparse MoEs in performance on language modeling and various downstream tasks.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




