arXiv

Knowledge Index of Noah's Ark

Title: The KINA Benchmark for Noah’s Ark Knowledge

Abstract

Large Language Model (LLM) knowledge assessments currently suffer from three primary deficiencies: design choices driven by scaling that fail to properly operationalize disciplinary representativeness; flat-payment annotation schemes that encourage lazy consensus; and unverified ranking instability when test budgets are constrained. To address these challenges, we present KINA, a comprehensive benchmark comprising 899 items distributed across 261 distinct disciplines, accompanied by two formal theoretical contributions.

First, we frame representativeness as a coverage-style objective utilizing expert-elicited anchors. By employing a proxy for disciplinary representativeness, we establish a greedy approximation algorithm with a guaranteed performance ratio of (1-1/e) (Proposition 1). It is important to note that this theoretical guarantee applies specifically to the proxy metric, rather than to overall population representativeness. Second, we demonstrate through Theorem 1 that a bonus-on-bar tournament structure weakly First-Order Stochastically Dominates (FOSD) flat payment in terms of released-review quality, provided the incentive-compatibility threshold satisfies B > Delta C / Delta p_min.

In our empirical evaluation of 42 models from 13 different laboratories, the leading model, Gemini-3.1-Pro-Preview, achieved a score of 53.17%. It was followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, indicating significant room for improvement as performance remains well below saturation levels. The full leaderboard reveals a tiered distribution rather than a smooth continuum: a small frontier group of models scores above 48%, a dense cluster of strong models occupies the 38–45% range, and lower-performing models hover only slightly above the 10% random chance baseline. Additionally, tool augmentation improved scores by up to 5.17 points across five tool-use evaluations, though the magnitude of these gains varied considerably among models. Finally, we provide bootstrap ranking-stability statistics to explicitly highlight variance under bounded budgets, aiming to discourage the over-interpretation of minor rank differences.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

Meta’s Oversight Board says account bans lack due process, transparency

Meta’s Oversight Board criticized account bans for lacking due process and transparency, citing inconsistent enforcement...

TechCrunch

Meta rolls out a new AI creator assistant on Facebook

Meta launched an AI creator assistant on Facebook to streamline analytics and content brainstorming. Initially available...

TechCrunch

What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates

WWDC 2026 promises a Siri revamp powered by Google’s Gemini and standalone app, plus AI agents in the App Store and Came...

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...