arXiv

Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

June 3, 2026 · Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh · Original Source

Title: Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge

Abstract:

When employing Thinking Large Language Models (LLMs) as judges to assess pairwise preferences, individual samples often yield noisy results. Furthermore, standard aggregation techniques—such as majority voting, soft self-consistency, or instruction-driven self-aggregation—prove unreliable when ties are permitted. To address these challenges, we investigate the role of inference-time compute (ITC) in evaluators that produce $n$ independent thinking-rating samples for each item. We introduce a rigorous, distribution-calibrated aggregation framework. This method utilizes a Bradley-Terry-Davidson formulation applied to rating counts to model three-way preferences. By integrating metrics for polarity (the margin among non-tied outcomes) and decisiveness (the frequency of non-tied outcomes), the approach effectively differentiates between narrow margins and robust consensus. Empirical evaluations across multiple benchmarks demonstrate that our method consistently lowers Mean Absolute Error (MAE) and boosts pairwise accuracy compared to conventional baselines. Additionally, when tested against human-consensus meta-labels, our system performs on par with or better than individual human raters. These findings indicate that strategic allocation of inference-time compute, combined with distribution-aware aggregation, transforms inconsistent model judgments into dependable evaluation metrics.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC