arXiv

CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation

June 2, 2026 · Danqing Wang, Akshay Sivaraman, Lei Li · Original Source

Title: CRAB-Bench: Assessing LLM Agents Amidst Intricate Task Dependencies and Human-Centric User Simulation

Abstract: To properly evaluate Large Language Model (LLM) agents within authentic service environments, assessments must account for multifaceted task dependencies, the unpredictability of user conduct, and the existence of multiple acceptable outcomes. We present CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to bridge this evaluation gap. CRAB-Bench constructs tasks using a constraint graph that links various interdependent entities, incorporating structured distractors. This design forces agents to engage in rigorous reasoning across thousands of misleading options, with only a negligible portion constituting valid solutions. Complementing this, RUSE moves away from cooperative, template-driven simulators by introducing realistic users derived from human behavioral studies. These users are embodied across a range of personas and four distinct behavioral dimensions. Our experiments involving four leading LLM agents reveal that even the top-performing model reaches merely a 61% pass@1 rate on CRAB-Bench. The introduction of RUSE exacerbates this decline, causing performance drops of up to 57%. These reductions are primarily attributable to weaknesses in task-solving capabilities rather than conversational fluency. Notably, the "Information Disclosure" behavioral dimension proves to be the most detrimental; agents interacting with RUSE tend to avoid admitting errors, opting instead to conceal mistakes through implicit corrections.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC