Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
Title: EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1): A Comprehensive Review
Abstract
Although retrieving information from visually dense documents—pages that combine textual content with figures, tables, and charts—is a critical component of multimodal retrieval-augmented generation, the majority of current retrieval models continue to ignore the visual data. Addressing this gap, the Multimodal Document Retrieval Challenge, specifically Track 1 of the MIR Challenge at the inaugural EReL@MIR workshop (held alongside The Web Conference 2025), challenged participants to develop a unified retrieval system capable of managing two distinct yet complementary scenarios. These scenarios include MMDocIR, which involves closed-set document page retrieval within lengthy documents based on a text query, and M2KR, an open-domain task requiring the retrieval of Wikipedia-style passages using image-only or image-plus-text queries.
Participants were evaluated based on the macro-average of mean Recall@${1,3,5}$ across both tasks. The competition attracted significant interest, resulting in 586 submissions from 22 teams, representing a total of 455 entrants. This report outlines the challenge’s structure, including its datasets and evaluation methodology, presents the final rankings, and provides an in-depth analysis of the systems developed by the top three teams. Notably, all winning entries utilized decoder-based Multimodal-LLM embedders from the Qwen2-VL family, diverging from the CLIP-style encoders commonly used in similar contexts. The primary differentiators among the top performers lay in their integration strategies: one team employed fine-tuned ensembles, another utilized a strong vision-language re-ranker for training-free multi-route fusion, and the third relied on zero-shot late interaction. Impressively, the training-free approach finished within just $0.1$ points of the team that achieved the highest score through fine-tuning.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






