arXiv

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Title: SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Abstract

While sparse attention mechanisms are designed to lower computational demands and memory bandwidth usage during long-context large language model (LLM) inference, two significant hurdles persist. First, the KV cache size continues to scale with sequence length, and shifting this data to CPU memory creates a bottleneck due to PCIe transfer limitations. Second, the process of selecting sparse tokens retains an $O(T^2)$ complexity, which can become the dominant cost factor when handling long contexts.

To address these issues, we introduce SparDA, a decoupled sparse attention framework that incorporates a new fourth per-layer projection called "Forecast," in addition to the standard Query, Key, and Value projections. This Forecast module predicts which KV blocks will be required by the subsequent layer, allowing for lookahead selection. This capability enables the prefetching of data from CPU to GPU to overlap with the execution of the current layer. Since the Forecast is independent of the attention query, our implementation of Grouped-Query Attention (GQA) assigns one Forecast head per GQA group. This design reduces the overhead associated with selection compared to traditional multi-head selectors.

SparDA introduces fewer than 0.5% additional parameters and requires training only the Forecast projections, using the original selector’s attention distribution as a target. Evaluations on two sparse-pretrained 8B models demonstrate that SparDA achieves accuracy that matches or slightly exceeds existing methods. It provides speedups of up to 1.25$\times$ for prefilling and 1.7$\times$ for decoding compared to sparse-attention offloading baselines. Furthermore, by supporting larger feasible batch sizes on a single GPU, SparDA achieves decode throughput that is up to 5.3$\times$ higher than non-offload sparse baselines. The source code is publicly available at https://github.com/NVlabs/SparDA.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Ramp raises $750M at $44B valuation as investors hunger for fintechs with an AI story

Ramp secured $750M at a $44B valuation, driven by AI integration and $1.5B+ revenue. The fintech firm now serves 70,000 ...

TechCrunch

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

Hello Robot’s Stretch avoids Silicon Valley hype, focusing on practical home deployment to gather essential real-world d...

Canada to Provide Funding, Buy Equity Stakes in AI Startups
Bloomberg

Canada to Provide Funding, Buy Equity Stakes in AI Startups

Canada will fund and buy equity stakes in AI startups to boost the sector. This investment aims to strengthen the nation...

TechCrunch

Chinese spies are using LinkedIn to lure Westerners into sharing sensitive information

A joint Western security alert warns that Chinese spies use LinkedIn to impersonate recruiters and extract sensitive dat...

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower
Bloomberg

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower

Peter Thiel’s family office set a record rent for a Miami tower lease. This deal establishes a new benchmark for the cit...

Who’s Excited for SpaceX’s I.P.O.? Space Nerds.
New York Times

Who’s Excited for SpaceX’s I.P.O.? Space Nerds.

Space enthusiasts are the most eager for SpaceX’s IPO, driven by their passion for space exploration.