Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving
Title: Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving
Abstract:
Large Language Model (LLM) inference depends on key-value (KV) caches to eliminate redundant attention computations. While approximate retention methods lower memory consumption at the cost of accuracy, lossless strategies maintain output precision by evicting KV cache blocks from GPU memory and reconstructing them as needed. Current lossless management systems typically determine eviction based on access frequency or positional heuristics, largely ignoring how specific KV cache blocks influence the performance of GPU attention kernels.
To address this, we introduce AsymCache, a KV cache management framework for LLM inference that aligns residency decisions with GPU attention kernel performance by accounting for computation latency. AsymCache comprises three core elements: Multi-Segment Attention (MSA), which facilitates efficient processing of non-contiguous KV contexts; a cache eviction policy that balances hit rates with position-aware recomputation costs; and an adaptive chunking scheduler designed to maximize hardware utilization.
Experimental results demonstrate that AsymCache outperforms recent baselines, reducing Time to First Token (TTFT) by a factor of 1.90 to 2.03 and Time Per Output Token (TPOT) by 1.62 to 1.71. These findings confirm the method’s effectiveness across common workloads and validate its objective of balancing computational efficiency with cache hit rates. Furthermore, AsymCache’s low-level architecture enables seamless integration into agent serving platforms like Continuum, where it achieves an additional reduction in average job latency of up to 18.1%.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



