SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback
Title: SIRIUS-SQL: Leveraging Execution Feedback to Anchor Multi-Candidate Text-to-SQL Generation
Abstract:
Generating accurate SQL queries for complex schemas remains a challenge when relying on single-pass approaches. To mitigate this, contemporary systems often produce multiple SQL candidates and employ voting mechanisms to discard errors. However, voting in isolation proves insufficient due to three interconnected limitations inherent in multi-candidate strategies. First, drawing additional samples from a single generator yields increasingly redundant outputs. Second, current pipelines typically apply a uniform correction to all non-clean execution results, ignoring the fact that runtime errors, timeouts, and empty results signify varying degrees of deviation from the correct answer. Third, existing selection methods depend on a singular perspectiveāsuch as majority voting on results or pairwise SQL comparisonsāfailing to capture insights from other analytical angles.
SIRIUS-SQL is introduced to resolve these three specific weaknesses. The system utilizes a difficulty-smoothing reinforcement learning (RL) framework to train SIRIUS-32B, enabling the generation of diverse, executable SQL candidates. This specialist model is complemented by a generalist LLM designed to address any gaps left by the primary generator. The pipeline incorporates an execution-grounded lifecycle that categorizes each outcome and applies precise repairs before returning viable candidates to the pool. Furthermore, a confidence-gated hybrid selector merges execution-result consensus with pairwise SQL-form evaluation, resorting to a deterministic structural check only for closely contested cases.
In evaluations, SIRIUS-SQL achieved a score of 75.88% on the BIRD development set and 91.20% on the SPIDER test set. Notably, two out of three generalist pairings outperformed Agentar-Scale-SQL, which currently stands as the most robust published multi-candidate system on the BIRD development benchmark.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




