arXiv

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

June 4, 2026 · Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou · Original Source

Title: Retrieving Granular Segments from Multi-modal Long-form Conversations

Abstract:

As multi-modal communication platforms gain widespread traction, conversations that interleave text and images over extended durations have become increasingly prevalent. In these contexts, users typically seek to retrieve coherent dialogue segments tied to specific themes, rather than individual, isolated utterances. To address this need, we introduce Fine-grained Fragment Retrieval (FFR), a method designed to locate multi-utterance and multi-image fragments that are semantically relevant within lengthy multi-modal dialogues.

Our research investigates two distinct retrieval scenarios: (1) Single-Dialogue FFR, which extracts fragments from a specific, given conversation, and (2) Dialogue Corpus FFR, which searches a large-scale corpus to support open-domain applications. For the single-dialogue setting, we present F2RVLM, a retrieval model based on generation techniques. This model is optimized via reinforcement learning, utilizing difficulty-aware curriculum sampling and multi-objective rewards to improve the coherence of the retrieved fragments.

In the corpus-level scenario, we propose FFRS, a two-stage architecture that integrates offline fragment-level indexing with online retrieval processes. In this system, each dialogue is broken down into minimal semantic units. These units are encoded by a Fragment Embedding Model (FEM) and stored in a vector database. During inference, FEM quickly retrieves the Top-K most promising candidates, after which F2RVLM conducts fine-grained reasoning to pinpoint the most relevant sub-content.

To facilitate research in this area, we have constructed MLDR, currently the largest multi-modal dialogue retrieval dataset available, alongside a real-world test set derived from WeChat. Experimental results on both benchmarks confirm that F2RVLM and FFRS consistently outperform existing methods in both single-dialogue and corpus-level Fine-grained Fragment Retrieval tasks.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC