arXiv

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

June 4, 2026 · Yaosheng Fu, Guangxuan Xiao, Xin Dong, Song Han, Oreste Villa · Original Source

Title: SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

Abstract

While sparse attention mechanisms are designed to lower computational demands and memory bandwidth usage during long-context large language model (LLM) inference, two significant hurdles persist. First, the KV cache size continues to scale with sequence length, and shifting this data to CPU memory creates a bottleneck due to PCIe transfer limitations. Second, the process of selecting sparse tokens retains an $O(T^2)$ complexity, which can become the dominant cost factor when handling long contexts.

To address these issues, we introduce SparDA, a decoupled sparse attention framework that incorporates a new fourth per-layer projection called "Forecast," in addition to the standard Query, Key, and Value projections. This Forecast module predicts which KV blocks will be required by the subsequent layer, allowing for lookahead selection. This capability enables the prefetching of data from CPU to GPU to overlap with the execution of the current layer. Since the Forecast is independent of the attention query, our implementation of Grouped-Query Attention (GQA) assigns one Forecast head per GQA group. This design reduces the overhead associated with selection compared to traditional multi-head selectors.

SparDA introduces fewer than 0.5% additional parameters and requires training only the Forecast projections, using the original selector’s attention distribution as a target. Evaluations on two sparse-pretrained 8B models demonstrate that SparDA achieves accuracy that matches or slightly exceeds existing methods. It provides speedups of up to 1.25$\times$ for prefilling and 1.7$\times$ for decoding compared to sparse-attention offloading baselines. Furthermore, by supporting larger feasible batch sizes on a single GPU, SparDA achieves decode throughput that is up to 5.3$\times$ higher than non-offload sparse baselines. The source code is publicly available at https://github.com/NVlabs/SparDA.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC