arXiv

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

June 2, 2026 · Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen · Original Source

Title: Anticipating Long-Term Value: A Global Combinatorial Approach to Task-Agnostic KV Cache Eviction

Abstract:

Due to the quadratic computational complexity inherent in attention mechanisms, the eviction of Key-Value (KV) cache entries has become essential for accelerating model inference. Existing eviction strategies generally depend on instantaneous heuristic metrics, operating under the implicit assumption that score magnitudes serve as uniform proxies for importance across all attention heads. This approach, however, fails to account for the heterogeneity in predictive fidelity among different heads. While some heads focus on the immediate contribution of tokens, others are specialized in capturing utility over extended horizons.

In this study, we argue that optimal budget allocation should be dictated by the marginal utility derived from preserving long-term semantic information. Leveraging this perspective, we introduce LU-KV, a new framework that treats head-level budget allocation as a global combinatorial optimization problem. The objective of this formulation is to maximize the long-horizon marginal contribution of the tokens retained in the cache. To address the non-convex nature of this problem, we utilize a convex-hull relaxation technique combined with a greedy solver based on marginal utility, which yields near-optimal solutions. Additionally, we establish a data-driven offline profiling protocol to support the practical implementation of LU-KV.

Benchmarking on LongBench and RULER reveals that LU-KV can shrink the KV cache size by 80% with negligible impact on performance. Furthermore, this approach significantly lowers inference latency and reduces the GPU memory footprint.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC