VCIFBench: Evaluating Complex Instruction Following for Video Understanding
Title: VCIFBench: Assessing Complex Instruction Adherence in Video Analysis
Abstract: While multimodal large language models (MLLMs) have demonstrated significant advancements in video comprehension, current evaluation frameworks predominantly utilize straightforward prompts and offer insufficient proof regarding a model’s capacity to meet specific output requirements. To address this gap, we present VCIFBench, a novel benchmark designed to test complex instruction following within the realm of video understanding. VCIFBench generates instructions rich in constraints by leveraging both adapted benchmark prompts and those directly grounded in video content, encompassing demands related to content, format, style, and structure. Model responses are assessed using a hybrid verification approach. The dataset comprises 306 test instructions that are satisfiable, a preference dataset for Direct Preference Optimization (DPO) containing 540 pairs, and a diagnostic subset of 30 items aimed at identifying conflicts. Our experiments involving 10 MLLMs reveal that jointly satisfying multiple constraints remains a difficult task. Furthermore, we demonstrate that applying DPO training on the VCIFBench data leads to enhanced performance in instruction following.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC


