SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space
Title: SSA: Aligning Full and Sparse Attention Outputs in Feature Space
Abstract:
While sparse attention mechanisms mitigate the quadratic computational burden inherent in full self-attention, they are hindered by two primary obstacles. First, the "attention gap" arises when sparse attention is applied to models trained with full attention, leading to performance drops caused by a mismatch between training and inference distributions. Second, the "capability gap" occurs in models trained exclusively with sparse attention; these models suffer from incomplete gradient flow, which inhibits their ability to reach the performance levels of full-attention counterparts. To address these issues, we introduce SSA (Sparse Sparse Attention), a novel training framework that incorporates bidirectional alignment between full and sparse attention outputs. We provide a theoretical analysis demonstrating that the approximation error is linearly proportional to the amount of attention mass discarded during sparse processing, and we show that SSA’s alignment objective significantly minimizes this error relative to baseline methods. Empirical results indicate that SSA delivers state-of-the-art results across both inference modes, adapts effectively to different sparsity constraints, and exhibits enhanced capabilities for handling long contexts.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




