Customizing the Inductive Biases of Softmax Attention using Structured Matrices
Title: Tailoring Softmax Attention’s Inductive Biases via Structured Matrices
Abstract: The fundamental mechanism of attention relies on a scoring function that projects inputs into low-dimensional query and key vectors, subsequently calculating the dot product for every pair. Although this low-dimensional projection enhances computational efficiency, it often results in information loss for tasks involving intrinsically high-dimensional inputs. Furthermore, standard attention applies a uniform scoring function to all input pairs, failing to incorporate a distance-dependent computational bias that favors neighboring tokens within the sequence. To overcome these limitations, we introduce novel scoring functions built upon computationally efficient, high-rank structured matrices, specifically Block Tensor-Train (BTT) and contiguous Multi-Level Low Rank (MLR) matrices. Our experiments demonstrate that on in-context regression tasks with high-dimensional data, these new scoring functions surpass standard attention across any fixed compute budget. In the realm of language modeling—a domain characterized by locality patterns—our MLR-based approach exhibits superior scaling laws relative to both conventional attention and sliding window variants. We further establish that both BTT and MLR belong to a wider class of efficient structured matrices capable of encoding either full-rank or distance-dependent computational biases, thereby resolving major deficiencies inherent in standard attention. Lastly, we present evidence that MLR attention yields promising outcomes for long-range time-series forecasting.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




