arXiv

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

June 3, 2026 · Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi · Original Source

Title: Assessing and Fine-Tuning LLM Confidence for Queries With Multiple Valid Responses

Abstract: Ensuring that large language models (LLMs) are reliable hinges on accurate confidence calibration; however, current training-free techniques have largely been evaluated within the context of questions requiring a single answer. This study demonstrates that such methods fail when multiple valid responses are possible, as disagreement among equally correct answers results in a systematic underestimation of confidence. To facilitate a comprehensive analysis of this issue, we present MACE, a new benchmark comprising 12,000 factual questions across six distinct domains, featuring varying numbers of correct answers. Our experiments, which evaluate 15 prominent calibration methods against four LLM families ranging from 7B to 72B parameters, indicate that while model accuracy improves as the number of correct answers increases, estimated confidence steadily declines. This disparity leads to significant miscalibration, particularly for questions with mixed answer counts. To mitigate this challenge, we introduce Semantic Confidence Aggregation (SCA), a technique that combines confidence scores from multiple high-probability sampled responses. SCA delivers state-of-the-art calibration results in mixed-answer scenarios while maintaining robust calibration performance on single-answer questions.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC