arXiv

TECCI: Tricky Edits of Collected and Curated Images

June 2, 2026 · Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge · Original Source

Title: TECCI: Evaluating the Nuances of Collected and Curated Image Edits

Despite significant advancements in recent years, text-guided image editing technologies continue to face substantial hurdles. Current methods often falter in areas such as strict instruction adherence, preserving the original source image with minimal alterations, and maintaining high visual fidelity. These limitations are particularly pronounced when handling complex requests, including adjustments to position, motion, viewpoint, scale, or creative transformations.

To provide a rigorous framework for testing generative image editors, we introduce TECCI (Tricky Edits of Collected and Curated Images), a novel benchmark designed to expose these weaknesses. TECCI features a newly released dataset comprising images across seven distinct categories. These categories were carefully selected and curated to specifically target the known deficiencies of existing editing models. The dataset includes 7,550 pairs of images and corresponding edit instructions. The instructions were automatically generated by Gemini, with five distinct edit types applied to each source image. Additionally, we curated a subset of 530 images accompanied by challenging, manually crafted edit instructions.

We conducted human evaluations of five leading image editing models using the TECCI dataset. Human judges assessed the model outputs based on three key criteria: instruction following, the minimality of changes made to the source, and overall visual quality. To facilitate a larger-scale assessment, we developed an automated rater powered by Gemini, which demonstrated a 74.7% accuracy rate in aligning with human judgments.

Our analysis yielded several critical findings: 1. The benchmark proves highly demanding, as no model achieved an overall success rate exceeding 22%. 2. Among the tested models, Nano Banana Pro emerged as the top performer. 3. Models showed considerably stronger performance in following instructions compared to their ability to perform minimal edits or maintain visual quality. 4. Significant difficulties were observed when editing architectural structures and natural scenes, tasks that require a deep understanding of spatial layouts and fine visual details. 5. Reasoning-based and creative edits proved to be the most challenging, while edits involving color and appearance were the easiest to execute.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC