When Seeing Is Not Believing -- A Benchmark for Search-Grounded Video Misinformation Detection
Title: Visual Evidence Is Not Enough: A New Benchmark for Detecting Search-Grounded Video Misinformation
Video-based disinformation has evolved beyond simple fabrication, now operating at the level of semantics and evidentiary context. Authentic clips are increasingly manipulated through selective editing, temporal reordering, cross-source splicing, or the integration of AI-generated elements to fabricate misleading narratives. Because these manipulations rely on missing, reordered, or recontextualized evidence that exists outside the video file itself, verifying such content cannot be achieved by analyzing the input video in isolation.
To address this challenge, we present EVID-Bench, a benchmark designed for search-grounded video misinformation detection. In this framework, systems are required to scour the open web for related footage and identify inaccuracies by comparing multiple video sources. The benchmark includes 222 videos that exhibit nine distinct types of manipulation across three primary categories: AI generation, single-source editing, and multi-source editing. Crucially, every sample has been confirmed to be undetectable by current frontier models when relying solely on visual inspection.
We tested nine leading multimodal models using a retrieval-augmented verification baseline. The results indicate significant limitations: the top-performing system attained only 61.43% accuracy at the point level and 43.24% at the video level. Manipulations involving AI generation proved particularly difficult to detect. Our error analysis highlights persistent issues, including models focusing on irrelevant cues, incorrectly attributing synthetic artifacts to editorial cuts, and halting their search processes before fully elucidating the nature of the manipulation.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



