arXiv

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Title: Anticipating Long-Term Value: A Global Combinatorial Approach to Task-Agnostic KV Cache Eviction

Abstract:

Due to the quadratic computational complexity inherent in attention mechanisms, the eviction of Key-Value (KV) cache entries has become essential for accelerating model inference. Existing eviction strategies generally depend on instantaneous heuristic metrics, operating under the implicit assumption that score magnitudes serve as uniform proxies for importance across all attention heads. This approach, however, fails to account for the heterogeneity in predictive fidelity among different heads. While some heads focus on the immediate contribution of tokens, others are specialized in capturing utility over extended horizons.

In this study, we argue that optimal budget allocation should be dictated by the marginal utility derived from preserving long-term semantic information. Leveraging this perspective, we introduce LU-KV, a new framework that treats head-level budget allocation as a global combinatorial optimization problem. The objective of this formulation is to maximize the long-horizon marginal contribution of the tokens retained in the cache. To address the non-convex nature of this problem, we utilize a convex-hull relaxation technique combined with a greedy solver based on marginal utility, which yields near-optimal solutions. Additionally, we establish a data-driven offline profiling protocol to support the practical implementation of LU-KV.

Benchmarking on LongBench and RULER reveals that LU-KV can shrink the KV cache size by 80% with negligible impact on performance. Furthermore, this approach significantly lowers inference latency and reduces the GPU memory footprint.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Advantech's Tsai on Nvidia Collaboration, AI Strategy
Bloomberg

Advantech's Tsai on Nvidia Collaboration, AI Strategy

Advantech's Tsai discusses the Nvidia partnership and AI strategy.

SK Hynix to Double Wafer Capacity to Ease Memory Chip Crunch
Bloomberg

SK Hynix to Double Wafer Capacity to Ease Memory Chip Crunch

SK Hynix plans to double its wafer capacity to alleviate the ongoing global memory chip shortage. This expansion aims to...

AI Productivity Boost Is Overhyped | 3-Minute MLIV
Bloomberg

AI Productivity Boost Is Overhyped | 3-Minute MLIV

The video argues that AI’s productivity boost is overhyped, challenging the assumption that it will significantly enhanc...

Intel's Lip-Bu Tan on Agentic AI & Partner Networks
Bloomberg

Intel's Lip-Bu Tan on Agentic AI & Partner Networks

Intel’s Lip-Bu Tan discusses Agentic AI and the vital role of partner networks in driving innovation.

Haas Says Arm May Hit $15 Billion AI Chip Revenue Goal Early
Bloomberg

Haas Says Arm May Hit $15 Billion AI Chip Revenue Goal Early

Haas suggests Arm may achieve its $15 billion AI chip revenue target sooner than expected. This indicates strong market ...

Arm May Hit $15 Billion AI Chip Revenue Goal Early, CEO Says
Bloomberg

Arm May Hit $15 Billion AI Chip Revenue Goal Early, CEO Says

Arm’s CEO predicts the company could hit its $15 billion AI chip revenue target ahead of schedule. This optimistic outlo...