CARES: Context-Aware Resolution Selector for VLMs
Title: CARES: Context-Aware Resolution Selector for VLMs
Abstract:
To maintain effectiveness across a wide range of tasks, Large Vision-Language Models (VLMs) typically process images at their native or high resolutions. However, this practice often causes visual tokens to constitute 97–99% of the total token count, leading to significant computational costs and latency, even in scenarios where lower-resolution inputs would be adequate. To address this inefficiency, we present CARES (Context-Aware Resolution Selector), a lightweight preprocessing module designed to identify the minimal sufficient input resolution for a given image-query pair.
CARES employs a compact VLM (350M parameters) to extract features and determine when the response of a target, pretrained VLM converges to its maximum accuracy. Although the model is trained as a discrete classifier across a defined set of resolutions, it supports the interpolation of continuous resolutions during inference, allowing for fine-grained control. Evaluated across five multimodal benchmarks featuring both documents and natural images, as well as various target VLMs, CARES maintains task performance while achieving up to an 80% reduction in computational requirements.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




