arXiv

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

Title: Retrieving Granular Segments from Multi-modal Long-form Conversations

Abstract:

As multi-modal communication platforms gain widespread traction, conversations that interleave text and images over extended durations have become increasingly prevalent. In these contexts, users typically seek to retrieve coherent dialogue segments tied to specific themes, rather than individual, isolated utterances. To address this need, we introduce Fine-grained Fragment Retrieval (FFR), a method designed to locate multi-utterance and multi-image fragments that are semantically relevant within lengthy multi-modal dialogues.

Our research investigates two distinct retrieval scenarios: (1) Single-Dialogue FFR, which extracts fragments from a specific, given conversation, and (2) Dialogue Corpus FFR, which searches a large-scale corpus to support open-domain applications. For the single-dialogue setting, we present F2RVLM, a retrieval model based on generation techniques. This model is optimized via reinforcement learning, utilizing difficulty-aware curriculum sampling and multi-objective rewards to improve the coherence of the retrieved fragments.

In the corpus-level scenario, we propose FFRS, a two-stage architecture that integrates offline fragment-level indexing with online retrieval processes. In this system, each dialogue is broken down into minimal semantic units. These units are encoded by a Fragment Embedding Model (FEM) and stored in a vector database. During inference, FEM quickly retrieves the Top-K most promising candidates, after which F2RVLM conducts fine-grained reasoning to pinpoint the most relevant sub-content.

To facilitate research in this area, we have constructed MLDR, currently the largest multi-modal dialogue retrieval dataset available, alongside a real-world test set derived from WeChat. Experimental results on both benchmarks confirm that F2RVLM and FFRS consistently outperform existing methods in both single-dialogue and corpus-level Fine-grained Fragment Retrieval tasks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...

TechCrunch

Cash App launches a wand for tap-and-pay

Cash App launched a $25 NFC "Magic Wand" for tap-and-pay, blending viral novelty with practical contactless payments. It...

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings
Bloomberg

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings

Databricks CEO plans to avoid an IPO in 2021, despite a surge in public offerings. This contrasts with earlier reports t...

TechCrunch

Waymo’s spent robotaxi batteries will be used as grid storage

Waymo partners with B2U to repurpose retired robotaxi batteries for grid storage in California and Texas, aligning with ...