CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences
Title: CV-Arena: A Public Benchmark for Evaluating Instructional Computer Vision Tasks via Human-AI Collaborative Preferences
Abstract:
While instruction-guided image editing is increasingly serving as a universal interface for visual tasks, current evaluation benchmarks remain limited. They predominantly address narrow aesthetic modifications, failing to encompass the full spectrum of complexity found in professional, real-world workflows. In this work, we broaden the definition of image editing into "instructional computer vision problem solving." This formulation requires a system to take a real input image and a natural-language command, then generate an edited output that not only executes the requested transformation but also strictly adheres to explicit preservation, geometric, physical, and usability constraints.
To assess capabilities at a professional level, we present CV-Arena, an open benchmark comprising 12,000 high-resolution image-instruction pairs. These pairs cover 16 distinct instruction-based visual task types. The dataset was constructed using CogRetriever, a dual-track pipeline that integrates targeted web searches, agentic query refinement, verification processes, and traceability mechanisms.
For scalable evaluation that maintains human fidelity, we introduce Active Elo, a collaborative preference protocol. This system utilizes CV-Judge, a multi-dimensional vision-language model (VLM) evaluator with logic gating, to automatically reject obvious failures and resolve comparisons with high confidence. Only ambiguous, high-quality comparisons are routed to human experts. The final scores are derived by aggregating mixed human and AI supervision through reliability-weighted Elo updates.
Our extensive assessment of 21 systemsâranging from proprietary and open-source models to agentic architecturesâon CV-Arena highlights significant shortcomings in instruction adherence, physical reasoning, structural control, and the preservation of fine details. Additionally, we introduce CV-Agent, a lightweight agentic framework that integrates planning, editing, and verification. Our results suggest that closed-loop reasoning represents a promising path toward achieving professional-grade, instruction-following visual editing.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




