Global News Digest

arXiv

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

Title: CV-Arena: A Public Benchmark for Evaluating Instructional Computer Vision Tasks via Human-AI Collaborative Preferences

Abstract:

While instruction-guided image editing is increasingly serving as a universal interface for visual tasks, current evaluation benchmarks remain limited. They predominantly address narrow aesthetic modifications, failing to encompass the full spectrum of complexity found in professional, real-world workflows. In this work, we broaden the definition of image editing into "instructional computer vision problem solving." This formulation requires a system to take a real input image and a natural-language command, then generate an edited output that not only executes the requested transformation but also strictly adheres to explicit preservation, geometric, physical, and usability constraints.

To assess capabilities at a professional level, we present CV-Arena, an open benchmark comprising 12,000 high-resolution image-instruction pairs. These pairs cover 16 distinct instruction-based visual task types. The dataset was constructed using CogRetriever, a dual-track pipeline that integrates targeted web searches, agentic query refinement, verification processes, and traceability mechanisms.

For scalable evaluation that maintains human fidelity, we introduce Active Elo, a collaborative preference protocol. This system utilizes CV-Judge, a multi-dimensional vision-language model (VLM) evaluator with logic gating, to automatically reject obvious failures and resolve comparisons with high confidence. Only ambiguous, high-quality comparisons are routed to human experts. The final scores are derived by aggregating mixed human and AI supervision through reliability-weighted Elo updates.

Our extensive assessment of 21 systems—ranging from proprietary and open-source models to agentic architectures—on CV-Arena highlights significant shortcomings in instruction adherence, physical reasoning, structural control, and the preservation of fine details. Additionally, we introduce CV-Agent, a lightweight agentic framework that integrates planning, editing, and verification. Our results suggest that closed-loop reasoning represents a promising path toward achieving professional-grade, instruction-following visual editing.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.