BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
Title: BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
Abstract
Although Bengali ranks as the sixth most widely spoken language globally, there has been no systematic prior research assessing hallucination within large language models (LLMs) for this language. To address this gap, we present BenHalluEval, a granular evaluation framework designed specifically for Bengali. This framework encompasses four distinct tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning.
Our methodology involves the generation of 12,000 hallucinated samples using GPT-5.4. These samples cover twelve task-specific hallucination types and are derived from three existing Bengali datasets. We assessed seven LLMs, categorized as reasoning-oriented, multilingual, and Bengali-centric, utilizing a dual-track protocol. This protocol separately measures the false-positive rate on ground-truth instances (Track A) and the hallucination detection rate on the generated candidates (Track B).
To penalize both failure modes simultaneously and avoid score inflation caused by uniform response bias, we introduce BenHalluScore. This dual-track calibration metric yields scores ranging from 7.72% to 55.42% across the evaluated models and tasks, exposing significant disparities in hallucination calibration. While chain-of-thought prompting was employed as a mitigation strategy, it altered response distributions without consistently enhancing the models' ability to discriminate hallucinations. BenHalluEval marks the creation of the first dedicated hallucination benchmark for Bengali, underscoring the limitations of single-track evaluations and reliance on prompting alone in low-resource language contexts. The associated dataset and code are accessible at https://anonymous.4open.science/r/BanglaHalluEval-EB77.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




