CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
Title: LiteLVLM: A Training-Free Approach to Efficient Pixel Grounding via Text-Guided Token Pruning in Large Vision-Language Models
Abstract: In the realm of large vision-language models, visual tokens typically account for the bulk of input data, resulting in significant computational burdens. While recent research has focused on pruning redundant or less informative visual tokens to optimize image understanding, these approaches often falter in pixel grounding tasks. This is largely because token relevance in grounding is heavily dependent on the specific input text. Our in-depth examination of CLIP reveals a counterintuitive phenomenon: visual tokens located within referent regions frequently show low similarity to their corresponding textual descriptions. Leveraging this finding, we propose LiteLVLM, a novel, training-free strategy that utilizes text guidance to prune tokens efficiently for pixel grounding inference. LiteLVLM works by inverting the standard ranking of CLIP’s visual-text similarity scores. This reversal ensures that visual tokens encompassing referent regions are preserved, while simultaneously recovering context tokens to facilitate distinct foreground-background differentiation. Comprehensive experiments indicate that LiteLVLM surpasses current state-of-the-art methods by more than 5% across various token budget constraints. Notably, LiteLVLM achieves a 22% increase in speed and a 2.3-fold reduction in memory usage while retaining 90% of the original model’s performance, all without requiring any training or fine-tuning. The code for LiteLVLM is accessible at https://github.com/sejong-rcv/LiteLVLM.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





