arXiv

Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

June 2, 2026 · Yiming Liao, Zeno Franco, Jose Eduardo Lizarraga Mazaba, Keke Chen · Original Source

Title: Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

Abstract: The prevalence of hallucinations in medical large language models (LLMs) presents significant dangers for clinical decision support, especially when these systems are required to reason through complex electronic health records (EHRs). Current benchmarks frequently fail to reflect realistic clinical scenarios and offer insufficient guidance on practical mitigation techniques. To address this gap, we present Med-HEAL, a framework designed to systematically identify, analyze, and reduce hallucinations in medical LLMs using data grounded in clinical reality. Utilizing the EHRNoteQA benchmark, which is sourced from MIMIC-IV discharge summaries, we generated a specialized hallucination dataset by testing BioMistral-7B on open-ended clinical question-answering tasks. We employed a dual evaluation pipeline to label model outputs, integrating LLM-as-a-Judge assessments via GPT-4o with human audits conducted by medical students. This process, facilitated by a custom web-based evaluation system, yielded correctness judgments and detailed annotations of reasoning errors. We subsequently utilized this dataset to explore mitigation strategies, including a self-critique pipeline where the model reviews its own answers to identify errors and regenerate responses for flagged instances, and retrieval-augmented in-context learning (RA-ICL), which trains the model on examples of both hallucinations and their corrections. Our experiments, conducted across five open-source LLMs—BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3—demonstrated that the self-critique approach enhanced accuracy for three of the five models (p < 0.05) without the need for parameter updates. Med-HEAL offers a reusable hallucination dataset and a practical framework for investigating and reducing hallucinations in medical LLMs, thereby aiding the safer deployment of AI in clinical settings. Our code and data are publicly available at https://github.com/yimingliao-blad/med-heal.git.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC