Stochastic Sparse Attention for Memory-Bound Inference
Title: Stochastic Sparse Attention for Memory-Bound Inference
Abstract:
As context lengths expand, autoregressive decoding increasingly faces bandwidth constraints, since the generation of every new token necessitates retrieving all $n_k$ key and value vectors from the Key-Value (KV) cache. To address this, we introduce Stochastic Additive No-mulT Attention (SANTA), a technique that reduces the computational burden by sampling only $S \ll n_k$ indices from the post-softmax distribution to aggregate value rows. This approach provides an unbiased estimate of value aggregation, effectively substituting traditional multiply-accumulate operations with more efficient gather-and-add processes. To further optimize performance, we developed GPU-friendly variants utilizing stratified and systematic sampling to minimize variance.
Our evaluation of the S$^2$ANTA variant on Llama-3.1-8B-Instruct, with contexts extending to 32k tokens, demonstrates that it maintains accuracy levels comparable to baselines. Notably, it delivers up to a $1.5\times$ speedup in the attention kernel during the decode step when compared to FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada. In scenarios involving batched long-context generation, these kernel improvements result in up to a $1.25\times$ reduction in end-to-end decode latency. Additionally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a supplementary method to sparsify the scoring phase, thereby decreasing access to key features via stochastic ternary queries. Both strategies are designed to work in tandem with existing techniques such as upstream quantization, low-rank projection, and various KV-cache compression and selection methods. Collectively, these advancements pave the way for inference processes that are sparse, eliminate the need for multipliers, and enhance energy efficiency. The associated kernels are available for open-source use at: https://github.com/OPUSLab/SANTA.git
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






