arXiv

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

June 2, 2026 · Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang · Original Source

Title: Enhancing LLM Capability Assessment Through Evidence-Calibrated Query Clustering

Abstract: Query clustering facilitates capability-aware large language model (LLM) evaluation by grouping queries according to their shared, underlying capability requirements. However, traditional clustering approaches, which depend heavily on semantic taxonomies or embeddings, frequently fall short in capturing these latent demands. This failure stems from a disconnect between surface-level semantic meaning and the actual performance characteristics of the model. To address this, we introduce ECC, an algorithm designed to bridge the gap between superficial semantics and latent capability needs. ECC achieves this by refining initial semantic embeddings with limited posterior model comparisons. The method defines each cluster via a capability profile governed by a Bradley-Terry model and employs trainable mixture weights to handle queries that require multiple capabilities. This approach jointly learns a flexible, capability-aware clustering framework that enables the specific inference of LLM capabilities for individual queries. Comprehensive quantitative and qualitative assessments reveal that ECC substantially enhances the quality of LLM capability rankings. It surpasses human-labeled and embedding-based baselines by an average margin of 17.64 and 18.02 percentage points, respectively, and demonstrates strong utility in downstream applications such as query routing.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC