arXiv

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

June 2, 2026 · Denica Kjorvezir, Marko Djukanovi\'c, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov · Original Source

Title: Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

Abstract

Assessing large language models (LLMs) using extensive benchmark suites is often prohibitively costly and time-intensive. To address this, we introduce a graph-driven framework for prompt selection that represents each benchmark as a similarity graph. In this structure, nodes correspond to prompts, and edges are established when the embedding-space distance between them exceeds a user-defined threshold. By employing Maximum Independent Set (MIS) algorithms, the framework identifies a subset of prompts that is both maximally diverse and free from redundancy.

We tested four distinct MIS solvers—CPLEX, GREEDY, Online-MIS, and ReduMIS—across a comprehensive experimental matrix involving six embedding models, three distance metrics, six percentile thresholds, and four major benchmarks (GPQA, IFEval, MMLU-Pro, and Omni-MATH). This evaluation spanned 66 different LLMs. Our primary hypothesis posits that conducting repeated selections with varying random seeds produces consistent LLM rankings, which may also diverge from those derived using the full benchmark baseline. This hypothesis is robustly supported by our findings: Kendall’s $W$ exceeded 0.90 in 99.2% of stochastic configurations, with a mean value of $0.997 \pm 0.008$. Furthermore, at elevated percentile thresholds, the selected subsets achieved an average prompt reduction of 25–48%.

Deviations in ranking from the full-benchmark baseline ($\rho < 0.95$) were observed in only 15.95% of configurations. These discrepancies were primarily concentrated at lower thresholds ($p_{10}$–$p_{20}$) and specific benchmarks (GPQA and IFEval), highlighting overly dense graphs as the main cause of performance failure.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC