arXiv

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

Title: EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1): A Comprehensive Review

Abstract

Although retrieving information from visually dense documents—pages that combine textual content with figures, tables, and charts—is a critical component of multimodal retrieval-augmented generation, the majority of current retrieval models continue to ignore the visual data. Addressing this gap, the Multimodal Document Retrieval Challenge, specifically Track 1 of the MIR Challenge at the inaugural EReL@MIR workshop (held alongside The Web Conference 2025), challenged participants to develop a unified retrieval system capable of managing two distinct yet complementary scenarios. These scenarios include MMDocIR, which involves closed-set document page retrieval within lengthy documents based on a text query, and M2KR, an open-domain task requiring the retrieval of Wikipedia-style passages using image-only or image-plus-text queries.

Participants were evaluated based on the macro-average of mean Recall@${1,3,5}$ across both tasks. The competition attracted significant interest, resulting in 586 submissions from 22 teams, representing a total of 455 entrants. This report outlines the challenge’s structure, including its datasets and evaluation methodology, presents the final rankings, and provides an in-depth analysis of the systems developed by the top three teams. Notably, all winning entries utilized decoder-based Multimodal-LLM embedders from the Qwen2-VL family, diverging from the CLIP-style encoders commonly used in similar contexts. The primary differentiators among the top performers lay in their integration strategies: one team employed fine-tuned ensembles, another utilized a strong vision-language re-ranker for training-free multi-route fusion, and the third relied on zero-shot late interaction. Impressively, the training-free approach finished within just $0.1$ points of the team that achieved the highest score through fine-tuning.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...