Geometry-Aware Hallucination Detection in Large Language Models
Title: Geometry-Aware Hallucination Detection in Large Language Models
Abstract:
Large language models (LLMs) are prone to producing content that is factually inaccurate or lacks support, a phenomenon widely known as hallucination. While previous research has investigated various mitigation techniques—including decoding strategies, retrieval-augmented generation, and supervised fine-tuning—recent findings highlight the significant impact of in-context learning (ICL) on factual accuracy. Despite this, current methods for selecting ICL demonstrations often depend on superficial similarity heuristics, resulting in limited robustness across different models and tasks.
To address these limitations, we introduce GA-ICL, a geometry-aware framework for sampling in-context demonstrations. This approach utilizes latent representations derived from frozen LLMs to select examples based on their proximity to learned prototypes, rather than relying solely on lexical or embedding similarity. By integrating local manifold structure with class-aware prototype geometry, GA-ICL enhances the selection process.
Our evaluations on the FEVER benchmark for factual verification and the HaluEval benchmark for hallucination detection demonstrate that GA-ICL surpasses standard ICL selection baselines in most tested scenarios. The framework shows particularly notable improvements in dialogue and summarization tasks. Furthermore, GA-ICL maintains robustness against temperature variations and differences in model architectures, suggesting greater stability compared to heuristic retrieval methods.
Although lexical retrieval can still perform competitively in certain question-answering contexts for smaller models, our findings indicate that geometry-aware prototype selection offers a reliable, training-efficient solution for hallucination detection that does not require modifying LLM parameters. Extended tests on larger models, specifically Phi-14B and Qwen3-32B, confirm that GA-ICL scales effectively. It outperforms all compared baselines, including in question-answering tasks where smaller models exhibit limitations at boundary conditions, thereby providing a principled path forward for improving ICL demonstration selection.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






