arXiv

Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

Title: Clarifying Intent in Composed Image Retrieval Through Calibrated Interaction

Abstract:

Composed image retrieval (CIR) operates by searching a database using a reference image alongside text instructions on how to alter it. Although the field has advanced rapidly—from models trained on triplets to zero-shot and generative approaches—all existing systems rely on a core assumption: that a user’s query corresponds to one specific target image, evaluated via Recall@K against a single ground-truth annotation. We contend that this assumption is fundamentally misaligned with the nature of the task. For instance, a request like "make it more formal" does not pinpoint a single image but rather defines a region within the corpus, leaving the specific intended item genuinely underdetermined. This lack of specification is the primary cause of the persistent false-negative issue and prevents current models from distinguishing between precise and ambiguous queries.

To address this, we reframe CIR as a problem of calibrated intent resolution under uncertainty. Our approach wraps the retriever in a conformal prediction layer, which outputs a candidate set with a guaranteed coverage rate. The size of this set serves as a principled metric for ambiguity. When the set is large, an expected-information-gain policy selects the single most informative clarifying question from interpretable ambiguity axes, thereby narrowing the candidate pool.

We introduce AmbiCIR, a benchmark featuring a human-validated user simulator that revitalizes the dormant auxiliary and dialogue annotations from CIRR and expands upon the multiple-positive framework of CIRCO. Our method achieves state-of-the-art performance in single-turn retrieval across both open-domain and fashion benchmarks. Crucially, it confirms that calibrated resolution incurs no cost for precise queries. Furthermore, it reaches the intended target using a fraction of the interaction budget required by naive conversational baselines. Notably, this work is the first to report valid coverage and calibration metrics for the CIR task.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...