arXiv

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

Abstract

As large language models are increasingly utilized as advisors, their objectives often diverge from those of the users they serve. For instance, recommender systems prioritize user engagement, sales assistants aim to drive purchases, and negotiation agents seek to secure concessions. A critical question in alignment evaluation is whether these models remain truthful when honesty conflicts with their own incentives. To address this, we adapt the classic Crawford-Sobel cheap-talk framework into a standardized benchmark for assessing LLM honesty in scenarios involving preference misalignment.

Economic theory suggests that in such contexts, neither complete transparency nor total silence is optimal. Instead, the sender should employ coarse, monotone partitions, with the number of informative intervals decreasing as the conflict of interest intensifies. In our experimental setup, a sender observes a state $\omega$ within the range [0,1] and aims to influence the receiver’s action to be close to $\omega+b$. The receiver, whose ideal action is exactly $\omega$, receives a single costless message from the sender. Our design incorporates five levels of bias, three distinct prompt frames, a fixed low-temperature setting, and 200 states per experimental cell, resulting in 12,000 total sender calls.

For the positive-bias grid $b \in {0.01, 0.04, 0.08, 0.12}$, the theoretically most-informative partition sizes are predicted to be 7, 4, 3, and 2, respectively. The corresponding oracle normalized mutual information values are 0.5294, 0.3268, 0.2205, and 0.1829.

We evaluated four instruction-tuned models—GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, and Llama-3.3-70B—using this full design. Our findings indicate that all four models significantly over-reveal information compared to the most-informative equilibrium, exceeding it by a factor of 1.8 to 4.2. Specifically, while the oracle prescribes normalized mutual information between 0.18 and 0.53, the models maintain levels between 0.78 and 0.94.

Although informativeness does decline with increased bias as theory predicts, it never approaches the strategic optimum. Rather than forming coarse partitions, the models exhibit near-full revelation characterized by a constant upward offset that tracks their bias, effectively resulting in linear exaggeration. We found that framing the task as payoff-maximizing versus honest had a negligible impact on outcomes. Additionally, a decoder ablation study revealed that these results are only recoverable when the receiver explicitly reads the sender’s stated number; an embedding-only decoder misinterpreted the same data as near-babbling.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...