arXiv

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

Title: SVHalluc: Evaluating Speech-Vision Hallucination in Audio-Visual Large Language Models

Abstract

Although audio-visual large-language models (LLMs) have achieved significant success, they are prone to generating outputs that sound plausible but lack grounding, a phenomenon known as hallucination. While current evaluation metrics largely rely on environmental sounds, such as barking dogs, to detect events, human speech possesses distinct semantic depth and temporal complexity. It remains unclear, however, whether existing models can accurately synchronize spoken content with matching visual cues. This study reveals that speech content can trigger hallucinations in audio-visual LLMs. To investigate this issue systematically, we present SVHalluc, the inaugural comprehensive benchmark designed to assess speech-vision hallucination within these models. SVHalluc evaluates these hallucinations through two essential and complementary dimensions: semantics and temporal alignment. Our experiments indicate that leading open-source audio-visual LLMs face considerable difficulties in aligning speech with visual signals, often performing at near-random accuracy levels across various tasks. Conversely, Gemini 2.5 Pro demonstrates a marked superiority over these open-source alternatives. Our analysis attributes these shortcomings to a restricted capacity for cross-modality understanding, even though the models exhibit robust single-modality perception. This work identifies a fundamental constraint in current audio-visual LLMs and underscores the critical requirement for video comprehension grounded in speech.

Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...