arXiv

Knowledge-Intensive Video Generation

June 2, 2026 · Chenxu Wang, Mingda Chen · Original Source

Title: Knowledge-Intensive Video Generation

Abstract

While text-to-video synthesis has seen rapid improvements in visual fidelity, its factual accuracy and practical utility remain insufficiently assessed. To address this gap, we propose Knowledge-Intensive Video Generation (KIVI), a framework in which models create videos based on concise, information-seeking prompts designed to elicit explanations, step-by-step procedures, or demonstrations. We introduce KIVI-Bench, a comprehensive benchmark comprising 1,080 prompts, alongside novel automatic metrics to assess factuality and helpfulness. Our human evaluation studies demonstrate that these new metrics correlate significantly more strongly with human judgments than current alternative measures. Furthermore, experiments involving seven leading state-of-the-art video generation models reveal that existing systems still fall short of human capabilities, particularly regarding visual attributes, procedural actions, and the clarity of information delivery. These findings underscore KIVI as a demanding yet promising avenue for developing video generation tools that are both factually reliable and instructionally valuable.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC