arXiv

Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

June 2, 2026 · Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang · Original Source

Title: Clarifying Intent in Composed Image Retrieval Through Calibrated Interaction

Abstract:

Composed image retrieval (CIR) operates by searching a database using a reference image alongside text instructions on how to alter it. Although the field has advanced rapidly—from models trained on triplets to zero-shot and generative approaches—all existing systems rely on a core assumption: that a user’s query corresponds to one specific target image, evaluated via Recall@K against a single ground-truth annotation. We contend that this assumption is fundamentally misaligned with the nature of the task. For instance, a request like "make it more formal" does not pinpoint a single image but rather defines a region within the corpus, leaving the specific intended item genuinely underdetermined. This lack of specification is the primary cause of the persistent false-negative issue and prevents current models from distinguishing between precise and ambiguous queries.

To address this, we reframe CIR as a problem of calibrated intent resolution under uncertainty. Our approach wraps the retriever in a conformal prediction layer, which outputs a candidate set with a guaranteed coverage rate. The size of this set serves as a principled metric for ambiguity. When the set is large, an expected-information-gain policy selects the single most informative clarifying question from interpretable ambiguity axes, thereby narrowing the candidate pool.

We introduce AmbiCIR, a benchmark featuring a human-validated user simulator that revitalizes the dormant auxiliary and dialogue annotations from CIRR and expands upon the multiple-positive framework of CIRCO. Our method achieves state-of-the-art performance in single-turn retrieval across both open-domain and fashion benchmarks. Crucially, it confirms that calibrated resolution incurs no cost for precise queries. Furthermore, it reaches the intended target using a fraction of the interaction budget required by naive conversational baselines. Notably, this work is the first to report valid coverage and calibration metrics for the CIR task.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC