arXiv

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

June 3, 2026 · Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram · Original Source

Title: The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Abstract

While Large Language Models (LLMs) acting as judges have become an industry standard, a significant disconnect persists: while these models demonstrate high levels of agreement with one another, their alignment with human judgments remains weak. To determine whether this phenomenon stems from a shared underlying signal or a shared bias, we analyzed four geometric metrics across a standard LLM-as-judge framework. Our study utilized 41 LLM judges, eight Indic languages, and four community-constructed Indic datasets, employing bootstrap confidence intervals for robustness. The metrics assessed included score spread, effective rank, the principal angle relative to the human subspace, and stacked correlations among both judges and humans.

On subjective evaluation rubrics, the data reveals that judges utilize less than half the scoring range of human evaluators ($\sigma_J / \sigma_H \approx 0.3$--$0.5$). Furthermore, the axis along which judges evaluate content is nearly orthogonal to the human axis, deviating significantly more from human perspectives than humans do from one another ($87^\circ$--$89^\circ$ compared to $78^\circ$--$81^\circ$). Consequently, inter-LLM agreement ($r_{LL} \approx 0.35$) surpasses the agreement between LLMs and humans ($r_{LH} \approx 0.27$--$0.32$).

However, this divergence disappears when applying the same diagnostics to a rubric with a verifiable factual answer. In such cases, the geometric indicators revert to human-like ranges, with the axis at $58.5^\circ$ and an LLM-human correlation of $r_{LH} = 0.519$. Attempts to rectify the issue through fine-tuning and preference optimization successfully recovered the score spread (increasing from $0.32$ to $1.08$) but failed to alter the evaluation axis, which remained between $87^\circ$ and $88^\circ$.

Only post-hoc calibration using a small, human-anchored dataset managed to improve all four community-health rubrics simultaneously. This approach allowed a calibrated 24B Indic judge ($r = 0.184$) to outperform GPT-5.5 ($r = 0.123$), although it still fell short of human reliability, which stood at $r = 0.474$ on the verifiable rubric. We contend that inter-LLM agreement should only be interpreted as evidence of human alignment if a direct geometric verification of the judge's score subspace confirms it; otherwise, such consensus merely reflects agreement within a collapsed subspace.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC