TGV-KV: Text-Grounded KV Eviction for Vision-Language Models
Title: TGV-KV: Text-Grounded KV Eviction for Vision-Language Models
Abstract: Vision-Language Models (VLMs) typically employ an auto-regressive generation framework, caching the keys and values (KV) of all preceding tokens to speed up inference. However, this practice causes memory usage to grow linearly with context length. This problem is especially acute in VLMs because the visual modality contains significant redundancy. While KV cache eviction techniques can lower memory demands, they frequently lead to notable performance drops in VLMs. This occurs because most existing eviction strategies are tailored for language models and fail to account for the fundamental disparity between text and vision. In this study, we systematically examine the modality gap within VLMs, positing that the significance of visual data should be evaluated through textual guidance. Based on this insight, we introduce TGV-KV, a Text-Grounded KV Eviction method designed for VLMs. TGV-KV integrates three distinct components: (1) Text-Vision Budgeting (TVB), which distributes resources to each layer according to mutual information interactions; (2) Text-Weighted Ranking (TWR), which determines the priority of text and ranks visual importance using weighted text-image attention; and (3) Text-Prioritised Retention (TPR), a strategy that safeguards text KV to prevent severe information loss. We tested TGV-KV on five models of varying sizes and architectures. The results demonstrate that TGV-KV maintains 99.2% of full-KV accuracy on the VizWiz-VQA task when using LLaVA-NeXT, and increases end-to-end throughput by 52.6% under an extreme retention budget of 5%. The implementation is accessible at https://github.com/Danielement321/TGV-KV.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





