VidMsg: A Benchmark for Implicit Message Inference in Short Videos
Title: VidMsg: A Benchmark for Implicit Message Inference in Short Videos
Abstract:
Interpreting short online videos demands more than merely recognizing visible objects and actions; creators frequently embed an underlying intent or purpose within their clips. To address this, we present VidMsg, a new benchmark designed to assess the capability of systems to comprehend implicit messages in short, internet-native video content. The dataset comprises 400 clips sourced from YouTube, spanning nine practical topic areas and 52 distinct, fine-grained target messages. These domains include career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle.
VidMsg was developed using a message-first construction pipeline. Initially, a Large Language Model (LLM) converts target messages into indirect search scenarios to retrieve candidate clips. Human annotators subsequently filter these results, keeping only those that convey the intended message without being overly explicit. The benchmark is primarily geared toward bidirectional message-clip retrieval, supporting scalable applications like video search and recommendation systems that require holistic video understanding.
Beyond retrieval tasks, VidMsg features a diagnostic multiple-choice question-answering (QA) benchmark. In this setup, models must identify the intended message of a clip by selecting it from a set of semantically related distractors. Evaluations of contemporary video-language and retrieval models reveal that even high-performing systems often struggle with VidMsg. This difficulty arises because the task necessitates pragmatic inference, the integration of contextual cues, and the ability to discriminate between semantically similar messages. Furthermore, we introduce VidVec-Msg, a baseline approach that enhances message-oriented retrieval, though significant potential for future improvement remains.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



