Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference
Title: Enhancing Visual Token Reduction by Correcting Distortions for Efficient Multimodal LLM Inference
Abstract:
Despite the impressive progress Multimodal Large Language Models (MLLMs) have made in vision-language applications, the sheer volume of visual tokens creates substantial memory and latency challenges due to quadratic computational complexity. Although visual token reduction (VTR) techniques have been developed to alleviate these constraints, current approaches often fail to preserve positional and attentional alignment between original and compressed sequences, leading to distorted representations. To address this issue, we introduce RESTORE, a new VTR framework designed to correct these distortions without sacrificing efficiency. Our approach features a straightforward yet potent calibration technique that recovers diminished visual attention by adjusting attention weights according to relative distances. Additionally, we propose a unique anchor selection strategy for token merging to minimize information loss during feature averaging. Experiments across various benchmarks reveal that our method consistently boosts the accuracy of existing reduction techniques, delivering state-of-the-art results while ensuring computational efficiency.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





