arXiv

CARES: Context-Aware Resolution Selector for VLMs

June 2, 2026 · Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz · Original Source

Title: CARES: Context-Aware Resolution Selector for VLMs

Abstract:

To maintain effectiveness across a wide range of tasks, Large Vision-Language Models (VLMs) typically process images at their native or high resolutions. However, this practice often causes visual tokens to constitute 97–99% of the total token count, leading to significant computational costs and latency, even in scenarios where lower-resolution inputs would be adequate. To address this inefficiency, we present CARES (Context-Aware Resolution Selector), a lightweight preprocessing module designed to identify the minimal sufficient input resolution for a given image-query pair.

CARES employs a compact VLM (350M parameters) to extract features and determine when the response of a target, pretrained VLM converges to its maximum accuracy. Although the model is trained as a discrete classifier across a defined set of resolutions, it supports the interpolation of continuous resolutions during inference, allowing for fine-grained control. Evaluated across five multimodal benchmarks featuring both documents and natural images, as well as various target VLMs, CARES maintains task performance while achieving up to an 80% reduction in computational requirements.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC