Towards a Science of AI Agent Reliability
Toward a Scientific Framework for AI Agent Reliability
Abstract
As artificial intelligence agents become integral to executing high-stakes tasks, a significant gap has emerged between their performance on standard benchmarks and their real-world behavior. Although accuracy metrics indicate rapid advancement, numerous agents continue to malfunction in practical applications. This disconnect reveals a core flaw in current assessment methods: reducing complex agent behaviors to a single success score masks vital operational deficiencies. Specifically, standard evaluations often overlook whether an agent maintains consistent performance across multiple runs, resists external perturbations, fails in a predictable manner, or limits the severity of its errors.
Drawing inspiration from safety-critical engineering disciplines, this study introduces a comprehensive performance profile through twelve specific metrics. These metrics break down agent reliability into four distinct dimensions: consistency, robustness, predictability, and safety. Our analysis of 15 models across two complementary benchmarks reveals that recent increases in capability have resulted in only marginal gains in reliability. By highlighting these enduring challenges, our proposed metrics serve as a supplement to traditional evaluations, providing essential tools for understanding how agents operate, deteriorate, and fail.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



