arXiv

Deep Research as Rubric for Reinforcement Learning

Title: Deep Research as Rubric for Reinforcement Learning

Abstract

Open-ended reasoning and long-form generation tasks currently suffer from a lack of reliable automatic verification signals, which hinders reward-based policy optimization. While rubrics present a viable alternative, current methodologies typically treat them as pre-existing artifacts—whether manually crafted or generated via prompts—frequently overlooking critical, task-specific, and knowledge-heavy dimensions. This oversight often distorts the resulting reward signal. We posit that constructing rubrics is, in itself, a research challenge: determining the criteria for a correct or insightful response necessitates the discovery and synthesis of external knowledge.

To address this, we introduce Deep Research as Rubric (DR-rubric), a two-stage framework designed to build such comprehensive rubrics. In Stage I, the system employs iterative, multi-turn agentic search to uncover domain facts, structural constraints, and potential failure modes. Stage II then distills this gathered evidence into atomic, independently verifiable constraints tailored for GRPO-based policy optimization. Notably, because the model being trained can function as its own rubric generator, DR-rubric-8B enables bootstrap rubric generation without relying on frontier-model assistance.

We assessed the framework across six benchmarks covering agentic research and expert reasoning. Our experiments indicate that DR-Rubric delivers highly competitive performance using only 1,000 to 3,000 training instances. Specific findings reveal that rubrics generated by GPT-5 significantly enhance breadth coverage in agentic tasks, while those from Gemini offer the most balanced performance across both agentic and expert reasoning domains. Furthermore, bootstrap rubrics demonstrate a shift from specialization to rebalancing, achieving the best overall performance by the third iteration. These results suggest that transforming rubric construction from a static evaluation template into an evidence-driven research process produces more scalable and fine-grained reward signals for open-ended tasks.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...