arXiv

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

June 4, 2026 · Erfan Nourbakhsh, Rocky Slavin, Ke Yang, Anthony Rios · Original Source

Title: The Limits of Retrieval: A Comprehensive Analysis of Biomedical RAG Performance

Medical question answering represents a high-stakes domain where factual inaccuracies can lead to severe outcomes. While Retrieval-Augmented Generation (RAG) is generally regarded as a promising remedy, and previous research has documented significant performance boosts for large medical QA models, this study challenges those prevailing assumptions. We re-evaluate this premise by testing a diverse array of open-weight, instruction-tuned models ranging from 7 billion to 72 billion parameters.

Our extensive evaluation encompasses five distinct models, ten biomedical QA datasets, four different retrieval methodologies, and four separate retrieval corpora. The findings reveal that incorporating retrieval mechanisms results in only marginal and inconsistent gains compared to a baseline without retrieval, typically improving scores by just 1 to 2 points. Conversely, the selection of the backbone model exerts a far more substantial influence on performance than the choice of retriever or corpus. Additionally, retrieval sources tailored for experts and those designed for laypeople yield comparable results in the majority of scenarios.

These outcomes indicate that the primary constraint is not merely the quality of the retrieved information, but rather the models' inherent difficulty in effectively utilizing the evidence provided to them.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC