AdaCodec: A Predictive Visual Code for Video MLLMs
Title: AdaCodec: A Predictive Visual Code for Video MLLMs
Abstract:
Video data is characterized by significant temporal redundancy, as neighboring frames typically retain the majority of their objects, backgrounds, and structural layouts. However, current video multimodal large language models (video MLLMs) generally process each sampled frame as an isolated RGB image, leading to the repetition of visual content already captured in previous frames. To address this inefficiency, we propose a more streamlined video interface: transmit a complete reference frame only when the current scene cannot be accurately predicted from prior context; otherwise, send a concise description of the changes between frames. We term this approach a predictive visual code and implement it for video MLLMs through AdaCodec. AdaCodec allocates full visual tokens to a reference frame solely when the conditional predictive cost is elevated; in other instances, it compresses inter-frame variations, such as motion and prediction residuals, into compact P-tokens. Evaluated across eleven benchmarks, AdaCodec outperforms the Qwen3-VL-8B per-frame RGB baseline while maintaining an equivalent visual-token budget. Notably, with just 32k tokensārepresenting only $1/7$ of the baseline budgetāAdaCodec exceeds the performance of the 224k-token baseline on all long-video benchmarks. Furthermore, on five general-video benchmarks, it boosts average scores and significantly reduces time-to-first-token, dropping from 9.26 seconds to 1.62 seconds.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




