Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
Abstract
As large language models are increasingly utilized as advisors, their objectives often diverge from those of the users they serve. For instance, recommender systems prioritize user engagement, sales assistants aim to drive purchases, and negotiation agents seek to secure concessions. A critical question in alignment evaluation is whether these models remain truthful when honesty conflicts with their own incentives. To address this, we adapt the classic Crawford-Sobel cheap-talk framework into a standardized benchmark for assessing LLM honesty in scenarios involving preference misalignment.
Economic theory suggests that in such contexts, neither complete transparency nor total silence is optimal. Instead, the sender should employ coarse, monotone partitions, with the number of informative intervals decreasing as the conflict of interest intensifies. In our experimental setup, a sender observes a state $\omega$ within the range [0,1] and aims to influence the receiver’s action to be close to $\omega+b$. The receiver, whose ideal action is exactly $\omega$, receives a single costless message from the sender. Our design incorporates five levels of bias, three distinct prompt frames, a fixed low-temperature setting, and 200 states per experimental cell, resulting in 12,000 total sender calls.
For the positive-bias grid $b \in {0.01, 0.04, 0.08, 0.12}$, the theoretically most-informative partition sizes are predicted to be 7, 4, 3, and 2, respectively. The corresponding oracle normalized mutual information values are 0.5294, 0.3268, 0.2205, and 0.1829.
We evaluated four instruction-tuned models—GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, and Llama-3.3-70B—using this full design. Our findings indicate that all four models significantly over-reveal information compared to the most-informative equilibrium, exceeding it by a factor of 1.8 to 4.2. Specifically, while the oracle prescribes normalized mutual information between 0.18 and 0.53, the models maintain levels between 0.78 and 0.94.
Although informativeness does decline with increased bias as theory predicts, it never approaches the strategic optimum. Rather than forming coarse partitions, the models exhibit near-full revelation characterized by a constant upward offset that tracks their bias, effectively resulting in linear exaggeration. We found that framing the task as payoff-maximizing versus honest had a negligible impact on outcomes. Additionally, a decoder ablation study revealed that these results are only recoverable when the receiver explicitly reads the sender’s stated number; an embedding-only decoder misinterpreted the same data as near-babbling.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





