arXiv

Retrieval-Augmented Linguistic Calibration

June 2, 2026 · Yi-Fan Yeh, Linwei Tao, Minjing Dong, Tao Huang, Jialin Yu, Philip Torr, Chang Xu · Original Source

Title: Retrieval-Augmented Linguistic Calibration

Original: arXiv:2605.19344v2 Announce Type: replace Abstract: Linguistic cues such as "I believe" and "probably" offer an intuitive interface for communicating confidence, yet a generalisable, principled calibration framework for linguistic confidence expressions remains underexplored. In particular, co-occurring linguistic cues, contextual variation, and subjective audience interpretation pose unique challenges. We therefore model linguistic confidence as a distribution over plausible perceived probability values that a statement is correct, capturing interpretation variability that scalar representations discard. Within this distributional framework, we introduce faithfulness as a complementary evaluation dimension and present Faithfulness Divergence (FD), an information-theoretic metric quantifying the surprise induced in audience beliefs upon truth revelation. Building on these foundations, we present Retrieval-Augmented Linguistic Calibration (RALC), a lightweight post-hoc pipeline that propagates calibrated confidence signals back into natural language via retrieval-augmented rewriting. Across three QA benchmarks and five LLM families, RALC improves in-domain faithfulness and calibration up to 66% and 58%, respectively, outperforming black-box and grey-box calibration baselines.

Rewrite:

While phrases like "I believe" and "probably" provide an accessible means of conveying certainty, there is currently a lack of robust, generalized frameworks for calibrating these linguistic expressions. This gap is largely due to distinct difficulties arising from subjective audience interpretation, shifting contexts, and the simultaneous use of multiple linguistic signals. To address this, we propose modeling linguistic confidence not as a single value, but as a distribution of plausible probabilities reflecting how a statement might be perceived as correct. This approach retains the nuances of interpretation that are often lost when relying on scalar metrics.

Within this distributional model, we define "faithfulness" as an additional metric for assessment. We also introduce Faithfulness Divergence (FD), an information-theoretic measure designed to quantify the degree of surprise experienced by an audience when the truth is revealed. Leveraging these concepts, we develop Retrieval-Augmented Linguistic Calibration (RALC), a streamlined post-processing method. RALC uses retrieval-augmented rewriting to reintegrate calibrated confidence levels into natural language outputs.

Evaluated across five different Large Language Model (LLM) families and three Question Answering (QA) benchmarks, RALC demonstrates significant gains. It enhances in-domain faithfulness by up to 66% and improves calibration by up to 58%, surpassing both black-box and grey-box calibration baselines.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC