PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios
Title: PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios
Abstract:
This study offers a comprehensive assessment of Large Language Models’ (LLMs) negotiation capabilities, a critical business function that demands strategic reasoning, theory of mind, and the ability to generate economic value. To facilitate this analysis, we introduce PieArena, a large-scale benchmark for negotiation that relies on multi-agent interactions within realistic scenarios derived from MBA negotiation curricula at a prestigious business school. Our evaluation framework encompasses three distinct pairing regimes: mirror-play, cross-play, and human-LM interactions.
We have developed a ranking model designed for continuous negotiation payoffs. This model generates order-invariant leaderboards with quantified uncertainty, while simultaneously addressing systematic experimental asymmetries. Additionally, we investigate the impact of joint-intentionality agentic scaffolding, observing asymmetric benefits: significant performance boosts for mid- and lower-tier LMs, contrasted with diminishing returns for frontier models.
Using trained business school students as calibration anchors, we gathered human-human and human-LM negotiation data. Our findings indicate that a representative frontier language agent, GPT-5, performs on par with or better than this human baseline within our evaluation parameters. Beyond merely reporting deal outcomes, PieArena delivers a multi-dimensional behavioral profile. This profile exposes cross-model heterogeneity in areas such as instruction compliance, computational accuracy, and judge-assessed metrics of deception and reputation, thereby demonstrating the utility of evaluation methods that extend beyond outcome-centric leaderboards.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



