arXiv

Test-Time Optimization of Physical Query Plans with LLMs

June 3, 2026 · Mehmet Hamza Erol, Xiangpeng Hao, Federico Bianchi, Ciro Greco, Jacopo Tagliabue, James Zou · Original Source

Title: Enhancing Physical Query Plans via Test-Time Optimization Using Large Language Models

Abstract: Conventional query optimization depends on cost-based optimizers that forecast execution expenses—such as runtime, memory usage, and I/O—by leveraging statistical models and fixed heuristics. While refining these components demands significant engineering resources, they frequently fail to capitalize on semantic correlations within schemas and queries that could yield superior physical plans. In contrast, Large Language Models (LLMs) possess the ability to interpret column semantics, value distributions, and broader domain contexts, offering insights that classical statistical methods overlook.

This study presents DBPlanBench, a framework built on the DataFusion engine. This tool exposes physical plans via a compact serialized format and facilitates the application of edits proposed by LLMs as JSON patches. Leveraging this infrastructure, we implement a test-time optimization process: an LLM analyzes physical query plans and suggests targeted modifications grounded in semantic reasoning, while an evolutionary search algorithm iteratively refines these proposals. Our approach targets OLAP queries, where the high frequency of execution means that even marginal efficiency improvements result in significant cumulative cost reductions.

Our evaluation concentrates on join reordering and join-side selection, scenarios in which errors in cardinality estimation tend to multiply. The results indicate median speedups ranging from $1.10$ to $1.12\times$ on TPC-H and from $1.05$ to $1.07\times$ on TPC-DS, with certain queries experiencing speedups as high as $4.78\times$. Furthermore, we show that optimizations identified at smaller scale factors generalize effectively to larger ones, thereby validating a cost-efficient workflow that scales from small to large datasets.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC