arXiv

Stochastic Sparse Attention for Memory-Bound Inference

Title: Stochastic Sparse Attention for Memory-Bound Inference

Abstract:

As context lengths expand, autoregressive decoding increasingly faces bandwidth constraints, since the generation of every new token necessitates retrieving all $n_k$ key and value vectors from the Key-Value (KV) cache. To address this, we introduce Stochastic Additive No-mulT Attention (SANTA), a technique that reduces the computational burden by sampling only $S \ll n_k$ indices from the post-softmax distribution to aggregate value rows. This approach provides an unbiased estimate of value aggregation, effectively substituting traditional multiply-accumulate operations with more efficient gather-and-add processes. To further optimize performance, we developed GPU-friendly variants utilizing stratified and systematic sampling to minimize variance.

Our evaluation of the S$^2$ANTA variant on Llama-3.1-8B-Instruct, with contexts extending to 32k tokens, demonstrates that it maintains accuracy levels comparable to baselines. Notably, it delivers up to a $1.5\times$ speedup in the attention kernel during the decode step when compared to FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada. In scenarios involving batched long-context generation, these kernel improvements result in up to a $1.25\times$ reduction in end-to-end decode latency. Additionally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a supplementary method to sparsify the scoring phase, thereby decreasing access to key features via stochastic ternary queries. Both strategies are designed to work in tandem with existing techniques such as upstream quantization, low-rank projection, and various KV-cache compression and selection methods. Collectively, these advancements pave the way for inference processes that are sparse, eliminate the need for multipliers, and enhance energy efficiency. The associated kernels are available for open-source use at: https://github.com/OPUSLab/SANTA.git


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...

Netflix Top Tech Exec Stone on Integrating AI
Bloomberg

Netflix Top Tech Exec Stone on Integrating AI

Netflix’s top tech exec discusses integrating AI to enhance content discovery and production efficiency.

Microsoft’s AI Chief Says Anthropic Models Are Too Expensive
Bloomberg

Microsoft’s AI Chief Says Anthropic Models Are Too Expensive

Microsoft AI CEO Mustafa Suleyman criticized Anthropic’s models as too expensive. Meanwhile, Microsoft plans to allow us...

Ramp Notches $44 Billion Valuation in New Funding Round
Bloomberg

Ramp Notches $44 Billion Valuation in New Funding Round

RAMP secured a $44 billion valuation in its latest funding round. CEO Eric Glyman attended the 2026 Reagan National Econ...