STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
Title: STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
Original: arXiv:2605.02122v2 Announce Type: replace-cross Abstract: Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.
Rewrite: Title: STABLEVAL: A Framework for Stable, Disagreement-Aware AI Assessment
Abstract: While human judgment continues to serve as the gold standard for evaluating contemporary AI systems, the prevalence of annotator bias, inconsistency, and disagreement renders traditional majority vote aggregation insufficient, leading to fragile system rankings. Standard majority voting overlooks individual annotator reliability and the inherent ambiguity of specific items, frequently resulting in inconsistent comparisons when different subsets of annotators are used. To address these limitations, we present STABLEVAL, a novel evaluation framework that accounts for disagreement by modeling latent item correctness and annotator-specific error patterns. This approach generates calibrated agent-level scores and posterior expected item credits. Distinct from label-denoising methods like Dawid-Skene, which aim to recover definitive labels, STABLEVAL is specifically engineered to ensure stable, uncertainty-aware system evaluation. We establish ranking stability as a core evaluation metric and examine how various aggregation techniques either maintain or distort the fundamental behaviors of annotators. Our analysis, spanning controlled synthetic tests and diverse real-world human-annotated datasets, reveals that majority vote suffers from heightened score errors and ranking instability in the presence of annotator heterogeneity and adversarial noise. In contrast, STABLEVAL produces rankings that are both statistically robust and significantly more stable. These findings underscore the critical importance of incorporating disagreement modeling to achieve reliable and reproducible AI evaluations.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





