arXiv

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

Title: ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

Original: arXiv:2606.00543v1 Announce Type: new Abstract: In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

Rewrite: In Vision-Language Models (VLMs), the processing of high-resolution images generates a vast quantity of visual tokens, which leads to significant computational burdens and increased KV-cache demands during the inference phase. To tackle this challenge, we introduce Extreme Token Compression (ETC), a framework designed to minimize task loss while significantly reducing the volume of input tokens, guided by the principles of variational information distillation. From an information-theoretic standpoint, we demonstrate that to minimize task loss, the compressed representation must retain the instruction-aware sufficient statistics of the visual data pertinent to the task for accurate prediction. Practically, ETC employs text-to-image cross-attention mechanisms to assign weights to original visual features, thereby approximating the latent instruction-aware predictive statistic. Additionally, the framework incorporates variational information distillation, which ensures that the condensed representation retains the critical information necessary to reconstruct this predictive statistic. Evaluations conducted on LLaVA-1.5-7B and Qwen3-VL-2B models indicate that ETC maintains its efficacy even when compressing down to a single token, significantly lowering KV-cache usage without compromising task performance.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...