arXiv

Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Title: Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation

Original: arXiv:2510.18439v3 Announce Type: replace Abstract: Hallucination, where models generate fluent text unsupported by visual evidence, remains a major flaw in vision-language models and is particularly critical in sign language translation (SLT). In SLT, meaning depends on precise grounding in video, and gloss-free models are especially vulnerable because they map continuous signer movements directly into natural language without intermediate gloss supervision that serves as alignment. We argue that hallucinations arise when models rely on language priors rather than visual input. To capture this, we propose a token-level reliability measure that quantifies how much the decoder uses visual information. Our method combines feature-based sensitivity, which measures internal changes when video is masked, with counterfactual signals, which capture probability differences between clean and altered video inputs. These signals are aggregated into a sentence-level reliability score, providing a compact and interpretable measure of visual grounding. We evaluate the proposed measure on two SLT benchmarks (PHOENIX-2014T and CSL-Daily) with both gloss-based and gloss-free models. Our results show that reliability predicts hallucination rates, generalizes across datasets and architectures, and decreases under visual degradations. Beyond these quantitative trends, we also find that reliability distinguishes grounded tokens from guessed ones, allowing risk estimation without references; when combined with text-based signals (confidence, perplexity, or entropy), it further improves hallucination risk estimation. Qualitative analysis highlights why gloss-free models are more susceptible to hallucinations. Taken together, our findings establish reliability as a practical and reusable tool for diagnosing hallucinations in SLT, and lay the groundwork for more robust hallucination detection in multimodal generation.

Rewrite: Hallucinations—where models produce coherent text lacking support from visual data—constitute a significant defect in vision-language systems, a problem that is especially acute in sign language translation (SLT). Because SLT relies on accurate video grounding, gloss-free models are particularly prone to error; they translate continuous signer motions directly into natural language, bypassing the intermediate gloss supervision that typically aids alignment. We posit that these hallucinations occur when models prioritize linguistic priors over actual visual content. To address this, we introduce a token-level metric that assesses the extent to which the decoder depends on visual cues. This approach integrates feature-based sensitivity, which tracks internal model shifts when video input is obscured, with counterfactual signals that measure probability variations between original and modified video inputs. By aggregating these indicators, we generate a sentence-level reliability score that offers a concise and transparent assessment of visual grounding. We tested this metric on two SLT datasets, PHOENIX-2014T and CSL-Daily, using both gloss-based and gloss-free architectures. The findings indicate that reliability scores effectively forecast hallucination frequencies, remain consistent across different datasets and model structures, and decline when visual quality is compromised. Furthermore, beyond these numerical patterns, we observed that reliability can differentiate between correctly grounded tokens and those that are merely guessed, enabling risk assessment even without reference texts. When paired with textual indicators such as confidence, perplexity, or entropy, this metric enhances the accuracy of hallucination risk prediction. Our qualitative examination clarifies the reasons for the heightened hallucination vulnerability in gloss-free models. Ultimately, our work positions reliability as a viable, reusable instrument for identifying hallucinations in SLT, setting the stage for stronger detection methods in multimodal generation.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...