arXiv

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

Title: QUBRIC: Joint Optimization of Queries and Rubrics for Reinforcement Learning Beyond Verifiable Rewards

Abstract:

While rubric-based reinforcement learning (RL) offers a viable path for extending RL capabilities beyond strictly verifiable rewards, current methodologies face a critical limitation: they optimize rubrics while keeping the query distribution static. This approach encounters a structural bottleneck, as the quality of the rubric is inherently tied to the structure of the query. Specifically, open-ended queries tend to produce vague rubrics, whereas attempts to narrow their scope often result in fabricated references that no model can verify, leading to zero reward signals and failed training.

To address these challenges, we introduce QUBRIC, a novel framework that co-designs both queries and rubrics. The process begins by using teacher-derived key points to rewrite open-ended queries into specific, scenario-based questions that are evaluable. Subsequently, contrastive rubric generation converts gaps in the teacher-policy into criteria at the query level. To ensure training efficiency, learnability filtering is applied to retain only informative query-rubric pairs for GRPO training.

Our experiments demonstrate that QUBRIC delivers a 5.5-point improvement on the ArenaHard benchmark relative to the supervised fine-tuning (SFT) baseline. Notably, when trained exclusively on instruction-following data, the model successfully transfers to three held-out benchmarks covering legal, moral, and narrative reasoning, achieving an average gain of 6.3 points. These enhancements are primarily concentrated in reasoning-related dimensions. These findings suggest that the co-design of queries and rubrics can render rubric-based RL a practical and effective complement to RLVR, particularly for tasks that lack strictly verifiable outcomes.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...