Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs
Title: Balancing Cost and Performance in RAG: An Empirical Study of Retrieval Depth Trade-offs
Abstract
Retrieval-augmented generation (RAG) systems are caught in a critical tripartite dilemma: while increasing retrieval depth enhances factual accuracy, it simultaneously drives up token expenditures and end-to-end latency. Static retrieval setups fail to address this conflict across diverse query types; straightforward definitional questions often incur unnecessary context costs, whereas complex analytical tasks suffer from insufficient context when retrieval is shallow. To address this, we present Cost-Aware RAG (CA-RAG), a dynamic routing framework that assigns per-query strategy bundles. Each bundle pairs a specific retrieval depth—ranging from direct inference with no retrieval to dense retrieval of the top-$k=10$ results—with a fixed generation profile. The system selects the optimal bundle by maximizing a utility score that linearly integrates an estimated quality prior with normalized penalties for predicted latency and total billed tokens.
Built on FAISS for dense retrieval and OpenAI’s chat and embedding APIs, CA-RAG was tested on a 28-query benchmark covering four distinct bundles. The router flexibly utilizes all available bundles, resulting in a 26% reduction in billed tokens compared to systems that always employ heavy retrieval, and a 34% decrease in mean latency relative to systems that always rely on direct inference, all without compromising answer quality. Detailed per-query analysis shows that cost savings are not evenly distributed but are primarily driven by simpler queries, highlighting the need for complexity-aware safeguards. Furthermore, sensitivity analysis demonstrates that the same catalog of bundles can support various cost-latency-quality balances simply by adjusting weights. All findings are derived directly from logged CSV data to ensure complete reproducibility. CA-RAG offers a transparent and auditable basis for deploying LLMs with cost efficiency in mind.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



