CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback
Title: CAPF: Directing Search-Agent Trajectories via Credit-Attenuated Privileged Feedback
Abstract:
Contemporary LLM-based search agents frequently employ reinforcement learning with verifiable rewards (RLVR) to acquire search-augmented reasoning capabilities driven by outcome-based rewards. However, when tackling complex tasks, these agents seldom generate successful end-to-end rollouts, resulting in outcome-only RLVR approaches suffering from a scarcity of positive-reward trajectories. We contend that enhancing learning on difficult problems necessitates supplementary guidance during the training phase. Fortunately, RLVR systems already possess verifier-side data that can serve this purpose; this information can pinpoint errors or omissions in the agent’s proposed answer, thereby directing the revision process within the rollout.
To leverage this, we introduce a training-time framework named Credit-Attenuated Privileged Feedback (CAPF). This mechanism exposes verifier-side insights through a Privileged Feedback invocation during the training stage. CAPF enables the policy to transform zero-reward attempts into successful repair trajectories with positive rewards. Furthermore, it adjusts the credit assignment for both the feedback call and preceding actions, ensuring seamless deployment in environments where such privileged feedback is unavailable. Our empirical studies show that CAPF elevates Qwen3-4B’s average exact-match score from 44.7%—achieved under standard outcome-only RLVR—to 48.5% across seven open-domain question-answering benchmarks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




