LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models
Title: LookWise: Optimizing the Timing and Location for Fine-Grained Visual Reasoning in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) are increasingly adopting a "Thinking with Images" approach, which involves actively scrutinizing image details. Although this strategy is effective, the computational burden of large-scale training has driven a rising demand for lightweight, training-free alternatives. However, current training-free methods face two significant limitations: they often suffer from perceptual redundancy due to indiscriminate cropping, which raises computational costs and injects noise, and they exhibit a disconnect between semantic intent and spatial attention, hindering the precise localization of regions of interest to the user.
To overcome these obstacles, we introduce LookWise, a framework designed for adaptive visual reasoning. LookWise employs a two-stage pipeline: first, a confidence-based module determines when a closer look is necessary; second, a semantic-guided localization module identifies where to focus. This architecture allows MLLMs to adaptively gather fine-grained visual evidence without requiring additional training. Our experiments on high-resolution and fine-grained visual reasoning benchmarks demonstrate that LookWise consistently outperforms strong baselines in accuracy. Furthermore, it delivers an inference speedup of approximately $4.0\times$ compared to the search-based ZoomEye method, highlighting its robust cross-model generalization capabilities.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






