arXiv

Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

June 2, 2026 · Yasser Hamidullah, Koel Dutta Chowdhury, Yusser Al Ghussin, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina Espa\~na-Bonet · Original Source

Title: Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Original: arXiv:2510.18439v3 Announce Type: replace Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

Rewrite: Hallucinations—where models produce coherent text lacking support from visual data—constitute a significant defect in vision-language systems, a problem that is especially acute in sign language translation (SLT). Because SLT relies on accurate video grounding, gloss-free models are particularly prone to error; they translate continuous signer motions directly into natural language, bypassing the intermediate gloss supervision that typically aids alignment. We posit that these hallucinations occur when models prioritize linguistic priors over actual visual content. To address this, we introduce a token-level metric that assesses the extent to which the decoder depends on visual cues. This approach integrates feature-based sensitivity, which tracks internal model shifts when video input is obscured, with counterfactual signals that measure probability variations between original and modified video inputs. By aggregating these indicators, we generate a sentence-level reliability score that offers a concise and transparent assessment of visual grounding. We tested this metric on two SLT datasets, PHOENIX-2014T and CSL-Daily, using both gloss-based and gloss-free architectures. The findings indicate that reliability scores effectively forecast hallucination frequencies, remain consistent across different datasets and model structures, and decline when visual quality is compromised. Furthermore, beyond these numerical patterns, we observed that reliability can differentiate between correctly grounded tokens and those that are merely guessed, enabling risk assessment even without reference texts. When paired with textual indicators such as confidence, perplexity, or entropy, this metric enhances the accuracy of hallucination risk prediction. Our qualitative examination clarifies the reasons for the heightened hallucination vulnerability in gloss-free models. Ultimately, our work positions reliability as a viable, reusable instrument for identifying hallucinations in SLT, setting the stage for stronger detection methods in multimodal generation.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC