arXiv

Deep Research as Rubric for Reinforcement Learning

June 2, 2026 · Wangyi Mei, Zhouhong Gu, Zhenhan Bai, Yin Cai, Lefan Zhang, Zhenxin Ding, Bo Chen, Yan Gao, Yi Wu, Yao Hu, Jiaqing Liang, Deqing Yang · Original Source

Title: Deep Research as Rubric for Reinforcement Learning

Abstract

Open-ended reasoning and long-form generation tasks currently suffer from a lack of reliable automatic verification signals, which hinders reward-based policy optimization. While rubrics present a viable alternative, current methodologies typically treat them as pre-existing artifacts—whether manually crafted or generated via prompts—frequently overlooking critical, task-specific, and knowledge-heavy dimensions. This oversight often distorts the resulting reward signal. We posit that constructing rubrics is, in itself, a research challenge: determining the criteria for a correct or insightful response necessitates the discovery and synthesis of external knowledge.

To address this, we introduce Deep Research as Rubric (DR-rubric), a two-stage framework designed to build such comprehensive rubrics. In Stage I, the system employs iterative, multi-turn agentic search to uncover domain facts, structural constraints, and potential failure modes. Stage II then distills this gathered evidence into atomic, independently verifiable constraints tailored for GRPO-based policy optimization. Notably, because the model being trained can function as its own rubric generator, DR-rubric-8B enables bootstrap rubric generation without relying on frontier-model assistance.

We assessed the framework across six benchmarks covering agentic research and expert reasoning. Our experiments indicate that DR-Rubric delivers highly competitive performance using only 1,000 to 3,000 training instances. Specific findings reveal that rubrics generated by GPT-5 significantly enhance breadth coverage in agentic tasks, while those from Gemini offer the most balanced performance across both agentic and expert reasoning domains. Furthermore, bootstrap rubrics demonstrate a shift from specialization to rebalancing, achieving the best overall performance by the third iteration. These results suggest that transforming rubric construction from a static evaluation template into an evidence-driven research process produces more scalable and fine-grained reward signals for open-ended tasks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC