arXiv

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

June 2, 2026 · Leo Luo, Haining Xie, Siqi Shen, Zhipeng Ma, Rui Ling, Hang Xu, Hefeng Jiang, Dingwei Chen, Yang Li, Peng Chen, Jie Jiang · Original Source

Title: SIRIUS-SQL: Leveraging Execution Feedback to Anchor Multi-Candidate Text-to-SQL Generation

Abstract:

Generating accurate SQL queries for complex schemas remains a challenge when relying on single-pass approaches. To mitigate this, contemporary systems often produce multiple SQL candidates and employ voting mechanisms to discard errors. However, voting in isolation proves insufficient due to three interconnected limitations inherent in multi-candidate strategies. First, drawing additional samples from a single generator yields increasingly redundant outputs. Second, current pipelines typically apply a uniform correction to all non-clean execution results, ignoring the fact that runtime errors, timeouts, and empty results signify varying degrees of deviation from the correct answer. Third, existing selection methods depend on a singular perspective—such as majority voting on results or pairwise SQL comparisons—failing to capture insights from other analytical angles.

SIRIUS-SQL is introduced to resolve these three specific weaknesses. The system utilizes a difficulty-smoothing reinforcement learning (RL) framework to train SIRIUS-32B, enabling the generation of diverse, executable SQL candidates. This specialist model is complemented by a generalist LLM designed to address any gaps left by the primary generator. The pipeline incorporates an execution-grounded lifecycle that categorizes each outcome and applies precise repairs before returning viable candidates to the pool. Furthermore, a confidence-gated hybrid selector merges execution-result consensus with pairwise SQL-form evaluation, resorting to a deterministic structural check only for closely contested cases.

In evaluations, SIRIUS-SQL achieved a score of 75.88% on the BIRD development set and 91.20% on the SPIDER test set. Notably, two out of three generalist pairings outperformed Agentar-Scale-SQL, which currently stands as the most robust published multi-candidate system on the BIRD development benchmark.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC