arXiv

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Title: The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Abstract

While Large Language Models (LLMs) acting as judges have become an industry standard, a significant disconnect persists: while these models demonstrate high levels of agreement with one another, their alignment with human judgments remains weak. To determine whether this phenomenon stems from a shared underlying signal or a shared bias, we analyzed four geometric metrics across a standard LLM-as-judge framework. Our study utilized 41 LLM judges, eight Indic languages, and four community-constructed Indic datasets, employing bootstrap confidence intervals for robustness. The metrics assessed included score spread, effective rank, the principal angle relative to the human subspace, and stacked correlations among both judges and humans.

On subjective evaluation rubrics, the data reveals that judges utilize less than half the scoring range of human evaluators ($\sigma_J / \sigma_H \approx 0.3$--$0.5$). Furthermore, the axis along which judges evaluate content is nearly orthogonal to the human axis, deviating significantly more from human perspectives than humans do from one another ($87^\circ$--$89^\circ$ compared to $78^\circ$--$81^\circ$). Consequently, inter-LLM agreement ($r_{LL} \approx 0.35$) surpasses the agreement between LLMs and humans ($r_{LH} \approx 0.27$--$0.32$).

However, this divergence disappears when applying the same diagnostics to a rubric with a verifiable factual answer. In such cases, the geometric indicators revert to human-like ranges, with the axis at $58.5^\circ$ and an LLM-human correlation of $r_{LH} = 0.519$. Attempts to rectify the issue through fine-tuning and preference optimization successfully recovered the score spread (increasing from $0.32$ to $1.08$) but failed to alter the evaluation axis, which remained between $87^\circ$ and $88^\circ$.

Only post-hoc calibration using a small, human-anchored dataset managed to improve all four community-health rubrics simultaneously. This approach allowed a calibrated 24B Indic judge ($r = 0.184$) to outperform GPT-5.5 ($r = 0.123$), although it still fell short of human reliability, which stood at $r = 0.474$ on the verifiable rubric. We contend that inter-LLM agreement should only be interpreted as evidence of human alignment if a direct geometric verification of the judge's score subspace confirms it; otherwise, such consensus merely reflects agreement within a collapsed subspace.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...