SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models
Title: SVHalluc: Evaluating Speech-Vision Hallucination in Audio-Visual Large Language Models
Abstract
Although audio-visual large-language models (LLMs) have achieved significant success, they are prone to generating outputs that sound plausible but lack grounding, a phenomenon known as hallucination. While current evaluation metrics largely rely on environmental sounds, such as barking dogs, to detect events, human speech possesses distinct semantic depth and temporal complexity. It remains unclear, however, whether existing models can accurately synchronize spoken content with matching visual cues. This study reveals that speech content can trigger hallucinations in audio-visual LLMs. To investigate this issue systematically, we present SVHalluc, the inaugural comprehensive benchmark designed to assess speech-vision hallucination within these models. SVHalluc evaluates these hallucinations through two essential and complementary dimensions: semantics and temporal alignment. Our experiments indicate that leading open-source audio-visual LLMs face considerable difficulties in aligning speech with visual signals, often performing at near-random accuracy levels across various tasks. Conversely, Gemini 2.5 Pro demonstrates a marked superiority over these open-source alternatives. Our analysis attributes these shortcomings to a restricted capacity for cross-modality understanding, even though the models exhibit robust single-modality perception. This work identifies a fundamental constraint in current audio-visual LLMs and underscores the critical requirement for video comprehension grounded in speech.
Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



