MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference
Title: MomentKV: Bridging the Directional Divide in KV Cache Eviction for Extended Context Processing
Abstract: In Transformer-based language models, autoregressive decoding depends heavily on the Key-Value (KV) cache. However, as sequence length increases, the cache’s memory usage expands linearly, creating a significant bottleneck for long-context inference. To mitigate this, KV cache eviction techniques retain a fixed-size subset of key-value pairs while discarding the remainder. Our investigation reveals that the primary cause of performance degradation in existing methods is not the residual attention mass assigned to evicted tokens—a factor these methods already strive to minimize—but rather a directional discrepancy between the retained and evicted token groups. In practice, evicted tokens are frequently near-orthogonal to those kept, meaning that even a minor portion of evicted attention mass can disproportionately skew the resulting direction distribution, leading to substantial output errors. This finding exposes a fundamental limitation in current strategies.
To overcome this, we introduce MomentKV, a method that preserves compact, small-size moment statistics for the evicted token set, specifically tracking counts, key means, value means, and value-key covariance. During the eviction phase, these statistics help identify tokens that are already well-aligned with and represented by the accumulated summary, thereby maintaining geometric regularity within the evicted set. During inference, these statistics enable a closed-form first-order approximation of the evicted attention output, creating a mutually reinforcing cycle between selective eviction and precise correction. Evaluated on LongBench and RULER using LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV surpasses all baseline methods across every cache budget, achieving its most significant improvements under conditions of aggressive compression.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





