arXiv

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Title: STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

Original: arXiv:2605.02122v2 Announce Type: replace-cross Abstract: Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

Rewrite: Title: STABLEVAL: A Framework for Stable, Disagreement-Aware AI Assessment

Abstract: While human judgment continues to serve as the gold standard for evaluating contemporary AI systems, the prevalence of annotator bias, inconsistency, and disagreement renders traditional majority vote aggregation insufficient, leading to fragile system rankings. Standard majority voting overlooks individual annotator reliability and the inherent ambiguity of specific items, frequently resulting in inconsistent comparisons when different subsets of annotators are used. To address these limitations, we present STABLEVAL, a novel evaluation framework that accounts for disagreement by modeling latent item correctness and annotator-specific error patterns. This approach generates calibrated agent-level scores and posterior expected item credits. Distinct from label-denoising methods like Dawid-Skene, which aim to recover definitive labels, STABLEVAL is specifically engineered to ensure stable, uncertainty-aware system evaluation. We establish ranking stability as a core evaluation metric and examine how various aggregation techniques either maintain or distort the fundamental behaviors of annotators. Our analysis, spanning controlled synthetic tests and diverse real-world human-annotated datasets, reveals that majority vote suffers from heightened score errors and ranking instability in the presence of annotator heterogeneity and adversarial noise. In contrast, STABLEVAL produces rankings that are both statistically robust and significantly more stable. These findings underscore the critical importance of incorporating disagreement modeling to achieve reliable and reproducible AI evaluations.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...