Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Title: Omni-Embed-Audio: Harnessing Multimodal LLMs for Resilient Audio-Text Retrieval
Original: arXiv:2604.18360v2 Announce Type: replace-cross
Abstract: While Contrastive Language-Audio Pretraining (CLAP)-based audio-text retrieval systems excel on standard benchmarks, these evaluations often depend on caption-style queries that diverge significantly from actual user search habits. This discrepancy restricts the ability to gauge the practical robustness of these systems. To address this, we introduce Omni-Embed-Audio (OEA), a retrieval-focused encoder that integrates multimodal Large Language Models (LLMs) equipped with native audio comprehension.
To rigorously test robustness beyond simple captions, we propose User-Intent Queries (UIQs), comprising five formats that mirror organic search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For the negative queries, we established a hard negative mining pipeline and introduced discrimination metrics—HNSR and TFR—to measure how effectively models can filter out acoustically similar distractors.
Our experiments on the AudioCaps, Clotho, and MECAT datasets reveal that OEA matches the text-to-audio retrieval performance of the state-of-the-art M2D-CLAP model. Furthermore, OEA demonstrates significant strengths in two key areas: (1) text-to-text retrieval, where it secures a dominant position with a +22% relative improvement, and (2) hard negative discrimination, showing markedly better performance (+4.3 percentage points in HNSR@10 and a +34.7% relative increase in TFR@10). These results indicate that LLM backbones offer enhanced semantic comprehension for handling complex queries.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





