EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models
**Title: EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models
Abstract:
While Large Vision-Language Models (LVLMs) deliver robust results in image and video comprehension, their inference speed is often hampered by the excessive volume of visual tokens generated by vision encoders. Current visual token compression techniques typically assess token significance based on attention scores or representation characteristics within individual layers. This approach neglects the dynamic evolution of visual tokens throughout the vision encoder, potentially leading to incomplete importance assessments and compromised performance retention post-compression.
To resolve this limitation, we investigate the layer-wise evolution trajectories of visual tokens, discovering that tokens organize into distinct group evolution directions across the encoder layers. Furthermore, our analysis reveals that tokens carrying high information content consistently deviate from these collective group evolution paths. Leveraging this insight, we introduce EvoCut, a novel, training-free, and attention-free method for visual token compression. EvoCut determines token importance by measuring deviations across multiple layers. Empirical evaluations demonstrate the method's efficacy: when applied to LLaVA-1.5-7B, EvoCut retains just 11.1% of the original visual tokens while maintaining 94.4% of the average performance, thereby effectively balancing computational efficiency with model accuracy.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





