arXiv

Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Title: Assessing and Fine-Tuning LLM Confidence for Queries With Multiple Valid Responses

Abstract: Ensuring that large language models (LLMs) are reliable hinges on accurate confidence calibration; however, current training-free techniques have largely been evaluated within the context of questions requiring a single answer. This study demonstrates that such methods fail when multiple valid responses are possible, as disagreement among equally correct answers results in a systematic underestimation of confidence. To facilitate a comprehensive analysis of this issue, we present MACE, a new benchmark comprising 12,000 factual questions across six distinct domains, featuring varying numbers of correct answers. Our experiments, which evaluate 15 prominent calibration methods against four LLM families ranging from 7B to 72B parameters, indicate that while model accuracy improves as the number of correct answers increases, estimated confidence steadily declines. This disparity leads to significant miscalibration, particularly for questions with mixed answer counts. To mitigate this challenge, we introduce Semantic Confidence Aggregation (SCA), a technique that combines confidence scores from multiple high-probability sampled responses. SCA delivers state-of-the-art calibration results in mixed-answer scenarios while maintaining robust calibration performance on single-answer questions.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...