R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation
Title: R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation
Original: arXiv:2602.00104v3 Announce Type: replace-cross Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.
Rewrite:
arXiv:2602.00104v3 Announcement Type: replace-cross
Abstract: In visual question answering (VQA), vision-centric retrieval involves fetching images to provide absent visual context and incorporating these elements into the logical deduction process. Nevertheless, identifying the most appropriate images and seamlessly embedding them into the model's thought process presents significant difficulties. To overcome these obstacles, we introduce R3G, a modular framework built on a Reasoning-Retrieval-Reranking architecture. The system begins by generating a concise reasoning plan that outlines the necessary visual indicators. It then employs a two-phase approach—starting with broad retrieval and moving to detailed reranking—to pinpoint relevant evidence images. When evaluated on MRAG-Bench, R3G demonstrated enhanced accuracy across nine sub-scenarios and six different MLLM backbones, securing state-of-the-art results overall. Ablation studies indicate that reasoning steps and sufficiency-aware reranking work synergistically, enabling the model to not only select optimal images but also utilize them effectively. The associated code and datasets are publicly available at https://github.com/czh24/R3G.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






