arXiv

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

June 4, 2026 · Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne · Original Source

Title: FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

Abstract:

While Multimodal Large Language Models (MLLMs) have traditionally been assessed primarily through free-form vision-language tasks—including visual question answering, image captioning, and summarization—their real-world applications are increasingly shifting toward structured computer vision environments. In these contexts, users frequently employ prompts to execute localization-centric tasks, such as object detection, often within broader agentic frameworks or decision-making pipelines. Despite this growing demand, a standardized, large-scale benchmark capable of systematically evaluating these specific capabilities has been lacking.

To address this gap, we present FindIt, the first comprehensive benchmark tailored to evaluate the promptable localization skills of generalist MLLMs. This benchmark encompasses four primary task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To ensure evaluations are both consistent and fair, we established a unified framework that standardizes input data, mandates parsable bounding box outputs, and implements transparent evaluation protocols across all tasks.

Leveraging this suite, we conducted an in-depth analysis of a wide array of open-source and proprietary MLLMs. Our assessment goes beyond mere accuracy metrics; we also scrutinized the models’ capacity to comply with output format specifications. The findings reveal that current systems are highly sensitive to formatting constraints, frequently failing to generalize even when faced with minor variations in requirements. Ultimately, our results delineate the strengths and weaknesses of state-of-the-art MLLMs in localization scenarios, offering critical insights for future advancements in multimodal model design and evaluation methodologies.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC