arXiv

Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

Title: Truthfulness Signals in Quantized LLMs Are Linearly Decodable from Mid-Layer Hidden States

Abstract:

This study explores the presence of a linearly separable truthfulness indicator within the hidden states of open-source large language models (LLMs), identifying the specific network depths where this signal is most pronounced. We analyzed three instruction-tuned models ranging from 7B to 8B parameters—specifically Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B—all running on 4-bit NF4 quantization. By extracting hidden states from each layer across four distinct hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic dataset), we evaluated four detection methodologies: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy.

Our findings reveal that a linear probe applied to a single intermediate layer yields an AUROC between 0.904 and 1.000 on held-out test splits. In stark contrast, sampling-based detectors failed to surpass an AUROC of 0.541 under identical conditions. The data suggests that the truthfulness signal is predominantly linear, as MLP probes consistently improved performance by no more than 0.01 AUROC over linear counterparts. Furthermore, the layers exhibiting peak probing performance were consistent across different model families when tested on natural-language benchmarks: blocks 13 through 18 of 32 for Llama and Mistral, and blocks 19 through 25 of 28 for Qwen. Additionally, first-block attention entropy offered a complementary signal in knowledge-grounded scenarios, achieving an AUROC of 0.866–0.941 on HaluEval-QA without incurring extra inference costs. The poor performance of sampling methods is attributed to a structural mismatch between paired-label evaluation frameworks and the data these methods utilize, rather than an intrinsic flaw in the methods themselves. To ensure full reproducibility, we have released our code and data, requiring only a single 8 GB GPU for execution.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...