arXiv

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

June 3, 2026 · Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang · Original Source

Title: QUBRIC: Joint Optimization of Queries and Rubrics for Reinforcement Learning Beyond Verifiable Rewards

Abstract:

While rubric-based reinforcement learning (RL) offers a viable path for extending RL capabilities beyond strictly verifiable rewards, current methodologies face a critical limitation: they optimize rubrics while keeping the query distribution static. This approach encounters a structural bottleneck, as the quality of the rubric is inherently tied to the structure of the query. Specifically, open-ended queries tend to produce vague rubrics, whereas attempts to narrow their scope often result in fabricated references that no model can verify, leading to zero reward signals and failed training.

To address these challenges, we introduce QUBRIC, a novel framework that co-designs both queries and rubrics. The process begins by using teacher-derived key points to rewrite open-ended queries into specific, scenario-based questions that are evaluable. Subsequently, contrastive rubric generation converts gaps in the teacher-policy into criteria at the query level. To ensure training efficiency, learnability filtering is applied to retain only informative query-rubric pairs for GRPO training.

Our experiments demonstrate that QUBRIC delivers a 5.5-point improvement on the ArenaHard benchmark relative to the supervised fine-tuning (SFT) baseline. Notably, when trained exclusively on instruction-following data, the model successfully transfers to three held-out benchmarks covering legal, moral, and narrative reasoning, achieving an average gain of 6.3 points. These enhancements are primarily concentrated in reasoning-related dimensions. These findings suggest that the co-design of queries and rubrics can render rubric-based RL a practical and effective complement to RLVR, particularly for tasks that lack strictly verifiable outcomes.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC