arXiv

Process Reward Agents for Steering Knowledge-Intensive Reasoning

June 2, 2026 · Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor · Original Source

Title: Leveraging Process Reward Agents to Guide Reasoning in Knowledge-Heavy Domains

Abstract:

Performing reasoning in fields that demand extensive knowledge is difficult because intermediate steps are frequently not verifiable in isolation. Unlike mathematics or programming, where correctness can often be assessed locally, determining the validity of a step in these domains may require synthesizing information from vast external knowledge bases. Consequently, minor mistakes can cascade through a reasoning chain, often going unnoticed. While previous studies have introduced process reward models (PRMs)—including versions enhanced by retrieval mechanisms—these tools function retrospectively. They score finished trajectories, which makes it impossible to incorporate them into dynamic inference workflows.

To address this limitation, we present Process Reward Agents (PRA), a method designed for inference-time application that delivers online, step-by-step rewards grounded in domain specifics to a frozen policy. Unlike earlier retrieval-augmented PRMs, PRA facilitates search-based decoding, allowing it to evaluate and eliminate candidate paths at each stage of generation. Our testing across various medical reasoning benchmarks shows that PRA consistently surpasses robust baseline models. Notably, it reached 81.9% accuracy on the MedQA dataset using Qwen3-4B, establishing a new state-of-the-art for models at the 4-billion-parameter scale.

Crucially, PRA demonstrates strong generalization capabilities across unseen frozen policy models with sizes ranging from 0.5B to 8B parameters. It boosts their accuracy by as much as 25.7% without requiring any updates to the policy models themselves. More significantly, PRA introduces a new framework where reasoning engines are separated from domain-specific reward systems. This decoupling enables the integration of new backbone architectures into complex domains without the need for retraining.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC