arXiv

Self-Evolving Deep Research via Joint Generation and Evaluation

June 4, 2026 · Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo · Original Source

Title: Self-Evolving Deep Research via Joint Generation and Evaluation

Original: arXiv:2606.04507v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.

Rewrite: Title: Self-Evolving Deep Research via Joint Generation and Evaluation

Original: arXiv:2606.04507v1 Announce Type: cross Abstract: The integration of Large Language Models (LLMs) into everyday applications has surged, with deep research emerging as a critical function. In contrast to standard question-answering (QA) scenarios, the creation of deep research reports does not have a clear ground-truth baseline. This absence complicates reward formulation, rendering it unverifiable and hindering the efficacy of reinforcement learning. While current methods attempt to overcome these hurdles using LLM-as-a-judge mechanisms and evaluation criteria tailored to specific queries, they depend on fixed evaluators. These static systems fail to adjust their benchmarks as the solving model advances, resulting in inadequate and ultimately plateauing optimization signals. To overcome this bottleneck, we propose SCORE (\textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation). This framework integrates an evaluator and a solver through a shared-parameter learning mechanism. Instead of handling generation and assessment as separate entities, our approach exploits their inherent relationship to foster simultaneous enhancement within a unified model. We implement a meta-harness to regulate this dynamic, adjusting the evaluation landscape according to the solver's progress. This mechanism promotes robust evaluation metrics and drives the evaluator to explore deeper search spaces. Our comprehensive tests on deep research benchmarks reveal steady enhancements in the quality of generated reports, indicating that the co-evolution of evaluation and generation offers a viable path for developing open-ended research agents.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC