arXiv

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

June 3, 2026 · Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh · Original Source

Title: SVHalluc: Evaluating Speech-Vision Hallucination in Audio-Visual Large Language Models

Abstract

Although audio-visual large-language models (LLMs) have achieved significant success, they are prone to generating outputs that sound plausible but lack grounding, a phenomenon known as hallucination. While current evaluation metrics largely rely on environmental sounds, such as barking dogs, to detect events, human speech possesses distinct semantic depth and temporal complexity. It remains unclear, however, whether existing models can accurately synchronize spoken content with matching visual cues. This study reveals that speech content can trigger hallucinations in audio-visual LLMs. To investigate this issue systematically, we present SVHalluc, the inaugural comprehensive benchmark designed to assess speech-vision hallucination within these models. SVHalluc evaluates these hallucinations through two essential and complementary dimensions: semantics and temporal alignment. Our experiments indicate that leading open-source audio-visual LLMs face considerable difficulties in aligning speech with visual signals, often performing at near-random accuracy levels across various tasks. Conversely, Gemini 2.5 Pro demonstrates a marked superiority over these open-source alternatives. Our analysis attributes these shortcomings to a restricted capacity for cross-modality understanding, even though the models exhibit robust single-modality perception. This work identifies a fundamental constraint in current audio-visual LLMs and underscores the critical requirement for video comprehension grounded in speech.

Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Global News Digest

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

Publishers in UK can opt out of Google AI search results

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Morning Bid: Marvell, a fitting name for the latest AI darling

Tim Hayward: I built the Jaguar E-Type of computer keyboards

AI Labs: Zuckerberg’s $100bn gamble