arXiv

R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Title: R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation

Original: arXiv:2602.00104v3 Announce Type: replace-cross Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model's reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

Rewrite:

arXiv:2602.00104v3 Announcement Type: replace-cross

Abstract: In visual question answering (VQA), vision-centric retrieval involves fetching images to provide absent visual context and incorporating these elements into the logical deduction process. Nevertheless, identifying the most appropriate images and seamlessly embedding them into the model's thought process presents significant difficulties. To overcome these obstacles, we introduce R3G, a modular framework built on a Reasoning-Retrieval-Reranking architecture. The system begins by generating a concise reasoning plan that outlines the necessary visual indicators. It then employs a two-phase approach—starting with broad retrieval and moving to detailed reranking—to pinpoint relevant evidence images. When evaluated on MRAG-Bench, R3G demonstrated enhanced accuracy across nine sub-scenarios and six different MLLM backbones, securing state-of-the-art results overall. Ablation studies indicate that reasoning steps and sufficiency-aware reranking work synergistically, enabling the model to not only select optimal images but also utilize them effectively. The associated code and datasets are publicly available at https://github.com/czh24/R3G.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...