Position: State-of-the-Art Claims Require State-of-the-Art Evidence
Title: Advanced Assertions Demand Advanced Proof
Abstract: Assertions of State-of-the-Art (SOTA) status are ubiquitous in Artificial Intelligence (AI) and Machine Learning (ML) research, typically grounded in benchmark evaluations that rank models based on aggregate task scores. While public leaderboards serve as the most prominent example of this structure, the same methodology is prevalent in the tabular data within academic papers. However, such limited evidence is often insufficient to substantiate such bold claims. We highlight a pervasive disconnect between the claims made and the evidence provided in AI benchmarking.
Asserting SOTA status implies more than just a higher mean score; it suggests that a model demonstrates meaningful superiority over its counterparts across the majority of tasks. In reality, a slight edge in average performance only indicates a top-ranking position, not necessarily true dominance. Our analysis of ten cross-domain benchmarks sourced from public leaderboards reveals that in over 50% of comparisons involving top-performing models, at least one of the commonly assumed characteristics of superiority was absent. These missing attributes included significant effect sizes, consistency across various tasks, or resilience to the removal of specific datasets.
Instead of broad superiority, aggregate improvements were often the result of performance on outlier datasets. This vulnerability remains evident even in benchmarks comprising a large number of tasks. We contend that the language used to describe results should accurately mirror the strength of the supporting evidence. Achieving this does not require new experiments; rather, it demands transparent reporting of what the data actually demonstrates, thereby facilitating more precise and interpretable model comparisons.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





