AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
**Title: AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
Abstract:
Although Large Language Models (LLMs) have advanced into agents capable of utilizing external tools, they continue to struggle with the fragility inherent in long-horizon interactions. In contrast to mathematical reasoning, where mistakes can often be corrected through backtracking, failures in tool usage frequently result in irreversible consequences, thereby making precise step-level verification essential. Nevertheless, current process-level benchmarks are largely restricted to closed-world mathematical contexts, failing to reflect the dynamic and open-ended characteristics of tool execution. To address this limitation, we present AgentProcessBench, the inaugural benchmark designed to assess step-level effectiveness within realistic, tool-augmented trajectories. This benchmark includes 1,000 varied trajectories and 8,509 annotations labeled by humans, achieving an inter-annotator agreement rate of 89.1%. It employs a ternary labeling system to account for exploration, alongside an error propagation rule intended to minimize labeling ambiguity. Our comprehensive experiments highlight several critical findings: (1) less capable policy models display artificially high ratios of correct steps due to premature termination; (2) current models still face substantial difficulties in distinguishing between neutral actions and errors; and (3) signals derived from process evaluation offer complementary benefits to outcome-based supervision, notably improving performance during test-time scaling. We anticipate that AgentProcessBench will stimulate further research into reward models and contribute to the development of general-purpose agents. The associated code and data are accessible at https://github.com/RUCBM/AgentProcessBench.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




