VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch
Title: VistaHop: A Benchmark for Evaluating Multi-hop Visual Reasoning in Visual DeepSearch
Abstract: Visual DeepSearch necessitates that Multimodal Large Reasoning Model (MLRM) agents address intricate visual inquiries by iteratively examining specific image areas, anchoring intermediate logical steps in visual proof, and linking subtle details across extended reasoning sequences. Nevertheless, current benchmarks predominantly concentrate on one-step visual comprehension or static image-question pairs, providing insufficient assessment of iterative inspection, visual-anchor grounding, and the integration of multi-hop evidence. To address this gap, we present VistaHop, a benchmark designed to evaluate vision-centric search capabilities and multi-hop visual reasoning within the context of Visual DeepSearch. The dataset comprises 300 high-resolution images, 25 distinct visual search scenarios, and 350 multi-hop QA tasks, requiring models to traverse evidence chains originating from visual anchors or synthesize data from multiple image-grounded reasoning pathways. Additionally, we introduce VistaArena, a comprehensive evaluation framework that facilitates tool-enhanced reasoning, incorporating text and image searches, image cropping, and answer validation based on evidence. Our experiments involving seven representative MLRMs indicate that existing models are significantly underperforming on VistaHop; the top-performing model, SenseNova-MARS-32B, attained a mere 24.31% Pass@1 score. These findings underscore enduring challenges in visual grounding, evidence re-examination, long-chain logic, and the fusion of information from multiple anchors, emphasizing the urgent requirement for more robust benchmarks and training strategies for Visual DeepSearch.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



