arXiv

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

June 4, 2026 · Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang · Original Source

Title: QO-Bench: Evaluating Query-Operator-Preserving Retrieval Across Typed Event Tuples

Abstract:

A significant portion of inquiries found within business, legal, and scientific domains are essentially natural-language equivalents of database queries targeting records embedded within text. While current Retrieval-Augmented Generation (RAG) systems excel at semantic relevance, the retrieval of plausible passages does not ensure the accurate execution of queries. To address this, we present QO-Bench, a diagnostic benchmark designed for question answering involving typed event tuples and specific query operators.

The benchmark comprises 22,984 news articles and 614 corporate events, structured around 18 distinct query templates and assessed through 785 questions. Gold answers are deterministically derived from typed event tuples and evaluated based on recall. Unlike approaches relying on LLM judges, our method matches answers to gold tuples via exact match, facilitating operator-level diagnostics for operations such as joins and intersections.

Under controlled conditions, we evaluated RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL pipelines, utilizing a long-context oracle ceiling to isolate retrieval performance. We propose a two-axis framework that predicts failure points based on index-time preservation versus query-time execution, a hypothesis supported by our findings. The results indicate that while systems successfully retrieve relevant text, they frequently discard the typed values required by operators. Consequently, the ranking of deployable paradigms shifts depending on the operator: similarity retrieval performs best on filtering and projection, whereas extraction-to-SQL excels in counting and intersection tasks.

Furthermore, even when provided with gold evidence, the long-context oracle remains far from saturation. This demonstrates that operator execution, rather than retrieval alone, constitutes a fundamental bottleneck that cannot be resolved simply by enhancing the answer model. QO-Bench thus shifts the primary objective from mere passage relevance to ensuring query-operator-preserving retrieval.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC