Mixed-Modality Dual Face-Hair Retrieval
Title: Mixed-Modality Dual Face-Hair Retrieval
Abstract:
This paper presents Dual Face-Hair Retrieval (DFHR), a novel image retrieval task that operates as a mixed-modality dual-reference system. In this framework, a query is defined by two inputs: a face image that establishes identity and a hairstyle reference provided either as an image or as text. DFHR diverges from previous retrieval paradigms by necessitating cross-component reasoning between two semantically distinct attributes—identity and hairstyle—that stem from heterogeneous modalities. To address this complexity, the formulation requires localized feature disentanglement, alignment of semantics across modalities, and the composition of mixed modalities within a single embedding space.
To support this new task, we introduce DFHR-Bench, the inaugural benchmark for mixed-modality face-hair retrieval. This dataset contains more than 180,000 annotated triplets covering both dual-image and image-text scenarios. The data was generated using a multi-stage annotation protocol designed to preserve both semantic accuracy and identity integrity. Additionally, we propose MFHC (Multimodal Face-Hair Combiner), a comprehensive framework that integrates disentangled identity and hairstyle embeddings via token injection and multi-view supervision. Together, DFHR and DFHR-Bench define a new standard for visual retrieval that is both identity-aware and capable of attribute control across different modalities.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





