arXiv

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

June 4, 2026 · Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng · Original Source

Title: Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Abstract:

Effective planning is a cornerstone of Large Language Model (LLM) agents, requiring them to break down objectives, choose appropriate tools, evaluate constraints, and recognize when tasks cannot be completed. However, current evaluation methods predominantly focus on final outcomes, which obscures the root causes of failures by failing to distinguish between planning errors and execution issues. To address this gap, we present the Agent Planning Benchmark (APB), a specialized diagnostic tool designed to assess planning abilities. The benchmark comprises 4,209 multimodal examples spanning 22 distinct domains and five operational settings. It evaluates holistic planning, step-wise planning influenced by feedback, and robustness in scenarios involving extraneous or broken tools, as well as unsolvable tasks.

Our analysis of 12 Multimodal Large Language Models (MLLMs) using APB highlights consistent deficiencies in long-horizon planning, resilience to tool noise, calibrated refusal mechanisms, and the ability to refine plans during inference. Furthermore, we validated APB’s utility on 200 tasks from ToolSandbox and 200 from $\tau^2$-bench. In these tests, applying APB-guided refinement led to measurable improvements in plan correctness, plan grading, and downstream execution performance across three representative models. Consequently, APB functions as a crucial upstream diagnostic complement to traditional execution-focused benchmarks.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC