On the Limits of Token Reduction for Efficient Unified Vision Language Training
Title: Investigating the Boundaries of Token Reduction for Efficient Unified Vision-Language Training
Abstract:
Unified vision-language models (VLMs) combine visual comprehension and generation into a single autoregressive framework. However, the joint training of these components is resource-intensive and has received little attention regarding computational efficiency. This study examines the potential and constraints of employing token reduction to accelerate the training of unified VLMs.
By conducting a systematic analysis of how attention is allocated across layers, we identify a fundamental asymmetry: visual understanding tasks display significant redundancy in visual information during later layers, whereas visual generation tasks retain a continuous reliance on image tokens throughout the network depth. Leveraging this insight, we developed accelerators tailored to specific tasks that selectively minimize the computational load of image tokens for each respective objective.
Although these approaches deliver substantial efficiency improvements in isolated contexts, our experiments reveal a persistent loss of synergy during unified training. Specifically, the reduction of tokens for individual tasks forces parameters down divergent pathways, thereby negating the performance benefits usually gained through joint optimization. These results indicate that achieving efficiency in unified modeling depends on maintaining shared structures across tasks, underscoring the necessity for acceleration techniques that are aware of task synergies.
Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




