arXiv

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

Title: ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

Abstract

Current training methodologies for LLM-based search agents rely almost exclusively on outcome-based rewards, effectively leaving the intermediate search processes unsupervised. This approach faces a critical limitation: the reward signal degrades within groups of outcomes that are uniformly correct. In such scenarios, because all sampled trajectories yield the same correctness, there is no within-group advantage, resulting in a lack of gradient signal. Previous attempts at process supervision have struggled with inefficiency, either by training expensive verifiers or by generating query-specific rubrics that vary inconsistently and are discarded after a single use.

To address these challenges, we introduce ARBOR (Adaptive Rubric Buffer for Online Reward), a framework designed for reusable process rewards. ARBOR maintains a shared rubric memory that persists across different queries. The system admits query-local drafts derived from contrastive trajectories, consolidates them into common rubrics applicable across queries, and retires them as the policy improves. By employing a small, active subset of these common rubrics to score trajectories through sparse pairwise judging, ARBOR generates process-level gradients even when outcome rewards are uniform. These scores are then integrated with the base reward.

Experimental results demonstrate that ARBOR consistently surpasses GRPO and DAPO baselines across four multi-hop question-answering benchmarks. The method improves average LLM-judge accuracy by as much as 4.2 points and successfully transforms up to 42% of training groups that previously offered zero-gradient signals into informative ones.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...