arXiv

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

June 2, 2026 · Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang · Original Source

Title: LookWise: Optimizing the Timing and Location for Fine-Grained Visual Reasoning in Multimodal Large Language Models

Abstract:

Multimodal Large Language Models (MLLMs) are increasingly adopting a "Thinking with Images" approach, which involves actively scrutinizing image details. Although this strategy is effective, the computational burden of large-scale training has driven a rising demand for lightweight, training-free alternatives. However, current training-free methods face two significant limitations: they often suffer from perceptual redundancy due to indiscriminate cropping, which raises computational costs and injects noise, and they exhibit a disconnect between semantic intent and spatial attention, hindering the precise localization of regions of interest to the user.

To overcome these obstacles, we introduce LookWise, a framework designed for adaptive visual reasoning. LookWise employs a two-stage pipeline: first, a confidence-based module determines when a closer look is necessary; second, a semantic-guided localization module identifies where to focus. This architecture allows MLLMs to adaptively gather fine-grained visual evidence without requiring additional training. Our experiments on high-resolution and fine-grained visual reasoning benchmarks demonstrate that LookWise consistently outperforms strong baselines in accuracy. Furthermore, it delivers an inference speedup of approximately $4.0\times$ compared to the search-based ZoomEye method, highlighting its robust cross-model generalization capabilities.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC