Global News Digest

arXiv

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

**Title: AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Abstract:

Although Large Language Models (LLMs) have advanced into agents capable of utilizing external tools, they continue to struggle with the fragility inherent in long-horizon interactions. In contrast to mathematical reasoning, where mistakes can often be corrected through backtracking, failures in tool usage frequently result in irreversible consequences, thereby making precise step-level verification essential. Nevertheless, current process-level benchmarks are largely restricted to closed-world mathematical contexts, failing to reflect the dynamic and open-ended characteristics of tool execution. To address this limitation, we present AgentProcessBench, the inaugural benchmark designed to assess step-level effectiveness within realistic, tool-augmented trajectories. This benchmark includes 1,000 varied trajectories and 8,509 annotations labeled by humans, achieving an inter-annotator agreement rate of 89.1%. It employs a ternary labeling system to account for exploration, alongside an error propagation rule intended to minimize labeling ambiguity. Our comprehensive experiments highlight several critical findings: (1) less capable policy models display artificially high ratios of correct steps due to premature termination; (2) current models still face substantial difficulties in distinguishing between neutral actions and errors; and (3) signals derived from process evaluation offer complementary benefits to outcome-based supervision, notably improving performance during test-time scaling. We anticipate that AgentProcessBench will stimulate further research into reward models and contribute to the development of general-purpose agents. The associated code and data are accessible at https://github.com/RUCBM/AgentProcessBench.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.