Knowledge Index of Noah's Ark
Title: The KINA Benchmark for Noah’s Ark Knowledge
Abstract
Large Language Model (LLM) knowledge assessments currently suffer from three primary deficiencies: design choices driven by scaling that fail to properly operationalize disciplinary representativeness; flat-payment annotation schemes that encourage lazy consensus; and unverified ranking instability when test budgets are constrained. To address these challenges, we present KINA, a comprehensive benchmark comprising 899 items distributed across 261 distinct disciplines, accompanied by two formal theoretical contributions.
First, we frame representativeness as a coverage-style objective utilizing expert-elicited anchors. By employing a proxy for disciplinary representativeness, we establish a greedy approximation algorithm with a guaranteed performance ratio of (1-1/e) (Proposition 1). It is important to note that this theoretical guarantee applies specifically to the proxy metric, rather than to overall population representativeness. Second, we demonstrate through Theorem 1 that a bonus-on-bar tournament structure weakly First-Order Stochastically Dominates (FOSD) flat payment in terms of released-review quality, provided the incentive-compatibility threshold satisfies B > Delta C / Delta p_min.
In our empirical evaluation of 42 models from 13 different laboratories, the leading model, Gemini-3.1-Pro-Preview, achieved a score of 53.17%. It was followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, indicating significant room for improvement as performance remains well below saturation levels. The full leaderboard reveals a tiered distribution rather than a smooth continuum: a small frontier group of models scores above 48%, a dense cluster of strong models occupies the 38–45% range, and lower-performing models hover only slightly above the 10% random chance baseline. Additionally, tool augmentation improved scores by up to 5.17 points across five tool-use evaluations, though the magnitude of these gains varied considerably among models. Finally, we provide bootstrap ranking-stability statistics to explicitly highlight variance under bounded budgets, aiming to discourage the over-interpretation of minor rank differences.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


