Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs
Title: Truthfulness Signals in Quantized LLMs Are Linearly Decodable from Mid-Layer Hidden States
Abstract:
This study explores the presence of a linearly separable truthfulness indicator within the hidden states of open-source large language models (LLMs), identifying the specific network depths where this signal is most pronounced. We analyzed three instruction-tuned models ranging from 7B to 8B parameters—specifically Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B—all running on 4-bit NF4 quantization. By extracting hidden states from each layer across four distinct hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic dataset), we evaluated four detection methodologies: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy.
Our findings reveal that a linear probe applied to a single intermediate layer yields an AUROC between 0.904 and 1.000 on held-out test splits. In stark contrast, sampling-based detectors failed to surpass an AUROC of 0.541 under identical conditions. The data suggests that the truthfulness signal is predominantly linear, as MLP probes consistently improved performance by no more than 0.01 AUROC over linear counterparts. Furthermore, the layers exhibiting peak probing performance were consistent across different model families when tested on natural-language benchmarks: blocks 13 through 18 of 32 for Llama and Mistral, and blocks 19 through 25 of 28 for Qwen. Additionally, first-block attention entropy offered a complementary signal in knowledge-grounded scenarios, achieving an AUROC of 0.866–0.941 on HaluEval-QA without incurring extra inference costs. The poor performance of sampling methods is attributed to a structural mismatch between paired-label evaluation frameworks and the data these methods utilize, rather than an intrinsic flaw in the methods themselves. To ensure full reproducibility, we have released our code and data, requiring only a single 8 GB GPU for execution.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



