arXiv

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

June 2, 2026 · Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang · Original Source

Title: STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Abstract:

While vision-language-model (VLM)-based agents designed for graphical user interface (GUI) interactions demonstrate significant potential for automation, their practical deployment is hindered by the linear expansion of the key-value (KV) cache as interaction steps accumulate. To illustrate the scale of this issue, the UI-TARS-1.5-7B model requires 76 GB of GPU memory to process just five screenshots, a volume that nearly saturates the capacity of standard 80 GB accelerators. Current approaches to KV compression generally rely on two structural premises: consolidating visual-token importance into a unified saliency map and enforcing a rigid top-B threshold on the combined score distribution. However, initial measurements challenge these assumptions. We find that spatial specialization operates at the attention-subspace level and shifts across layers, while the shape of the score distribution evolves dynamically throughout the sequence.

To address these limitations, we introduce STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free framework for compressing KV caches. This method recalibrates token importance across three distinct dimensions: (i) subspace-aware scoring, which leverages online spatial mutual information; (ii) a temporal stability discount mechanism that filters out redundant entries from subspaces that remain persistently attended; and (iii) an entropy-based temperature parameter that dynamically adjusts the score distribution. Evaluated across four GUI benchmarks, STaR-KV delivers the highest average accuracy among leading KV compression techniques, such as SnapKV and GUIKV, under equivalent memory constraints. Notably, it introduces negligible computational overhead (-0.07% in FLOPs) during compression and reduces peak GPU memory usage by approximately 40% when operating at a 20% KV-cache budget. The project code is accessible at https://github.com/kawhiiiileo/STaR-KV.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC