Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
Title: Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
Abstract: Long-context applications have identified the quadratic computational complexity inherent in softmax transformers as a significant performance bottleneck. Linear attention models offer a viable alternative for more efficient sequential processing by compressing past key-value (KV) states into a single hidden representation, which substantially lowers complexity during both training and inference phases. However, the expressive power of these linear models is constrained by the capacity of their fixed-size hidden states. While prior research has suggested interleaving softmax and linear attention layers to balance computational efficiency with model expressivity, the persistent presence of softmax layers continues to limit overall efficiency. To address this, we introduce Neural Attention Search Linear (NAtS-L), a novel framework that integrates both linear and softmax attention operations within a single layer, assigning them to different tokens based on specific criteria. NAtS-L dynamically decides whether a token is suitable for linear attention—specifically those with short-term influence that can be captured in fixed-size states—or requires softmax attention, particularly for tokens containing long-term retrieval information that must be retained for future queries. Through an optimization of Gated DeltaNet and softmax attention configurations across tokens, our results demonstrate that NAtS-L achieves a robust yet highly efficient hybrid architecture at the token level.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





