ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
Title: ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
Abstract
Current training methodologies for LLM-based search agents rely almost exclusively on outcome-based rewards, effectively leaving the intermediate search processes unsupervised. This approach faces a critical limitation: the reward signal degrades within groups of outcomes that are uniformly correct. In such scenarios, because all sampled trajectories yield the same correctness, there is no within-group advantage, resulting in a lack of gradient signal. Previous attempts at process supervision have struggled with inefficiency, either by training expensive verifiers or by generating query-specific rubrics that vary inconsistently and are discarded after a single use.
To address these challenges, we introduce ARBOR (Adaptive Rubric Buffer for Online Reward), a framework designed for reusable process rewards. ARBOR maintains a shared rubric memory that persists across different queries. The system admits query-local drafts derived from contrastive trajectories, consolidates them into common rubrics applicable across queries, and retires them as the policy improves. By employing a small, active subset of these common rubrics to score trajectories through sparse pairwise judging, ARBOR generates process-level gradients even when outcome rewards are uniform. These scores are then integrated with the base reward.
Experimental results demonstrate that ARBOR consistently surpasses GRPO and DAPO baselines across four multi-hop question-answering benchmarks. The method improves average LLM-judge accuracy by as much as 4.2 points and successfully transforms up to 42% of training groups that previously offered zero-gradient signals into informative ones.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





