arXiv

QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

Title: QO-Bench: Evaluating Query-Operator-Preserving Retrieval Across Typed Event Tuples

Abstract:

A significant portion of inquiries found within business, legal, and scientific domains are essentially natural-language equivalents of database queries targeting records embedded within text. While current Retrieval-Augmented Generation (RAG) systems excel at semantic relevance, the retrieval of plausible passages does not ensure the accurate execution of queries. To address this, we present QO-Bench, a diagnostic benchmark designed for question answering involving typed event tuples and specific query operators.

The benchmark comprises 22,984 news articles and 614 corporate events, structured around 18 distinct query templates and assessed through 785 questions. Gold answers are deterministically derived from typed event tuples and evaluated based on recall. Unlike approaches relying on LLM judges, our method matches answers to gold tuples via exact match, facilitating operator-level diagnostics for operations such as joins and intersections.

Under controlled conditions, we evaluated RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL pipelines, utilizing a long-context oracle ceiling to isolate retrieval performance. We propose a two-axis framework that predicts failure points based on index-time preservation versus query-time execution, a hypothesis supported by our findings. The results indicate that while systems successfully retrieve relevant text, they frequently discard the typed values required by operators. Consequently, the ranking of deployable paradigms shifts depending on the operator: similarity retrieval performs best on filtering and projection, whereas extraction-to-SQL excels in counting and intersection tasks.

Furthermore, even when provided with gold evidence, the long-context oracle remains far from saturation. This demonstrates that operator execution, rather than retrieval alone, constitutes a fundamental bottleneck that cannot be resolved simply by enhancing the answer model. QO-Bench thus shifts the primary objective from mere passage relevance to ensuring query-operator-preserving retrieval.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Reuters

Europe's tech 'liberation day'? Computer says not yet

Europe’s expected tech breakthrough remains unrealized, as current systems indicate that a true "liberation day" has not...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.