arXiv

Towards a Science of AI Agent Reliability

June 3, 2026 · Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan · Original Source

Toward a Scientific Framework for AI Agent Reliability

Abstract

As artificial intelligence agents become integral to executing high-stakes tasks, a significant gap has emerged between their performance on standard benchmarks and their real-world behavior. Although accuracy metrics indicate rapid advancement, numerous agents continue to malfunction in practical applications. This disconnect reveals a core flaw in current assessment methods: reducing complex agent behaviors to a single success score masks vital operational deficiencies. Specifically, standard evaluations often overlook whether an agent maintains consistent performance across multiple runs, resists external perturbations, fails in a predictable manner, or limits the severity of its errors.

Drawing inspiration from safety-critical engineering disciplines, this study introduces a comprehensive performance profile through twelve specific metrics. These metrics break down agent reliability into four distinct dimensions: consistency, robustness, predictability, and safety. Our analysis of 15 models across two complementary benchmarks reveals that recent increases in capability have resulted in only marginal gains in reliability. By highlighting these enduring challenges, our proposed metrics serve as a supplement to traditional evaluations, providing essential tools for understanding how agents operate, deteriorate, and fail.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Global News Digest

Towards a Science of AI Agent Reliability

Toward a Scientific Framework for AI Agent Reliability

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

Publishers in UK can opt out of Google AI search results

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Morning Bid: Marvell, a fitting name for the latest AI darling

Tim Hayward: I built the Jaguar E-Type of computer keyboards

AI Labs: Zuckerberg’s $100bn gamble