OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
Title: OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
Abstract: While recent strides in audio-video joint generation models have significantly enhanced content creation capabilities, producing high-fidelity videos focused on humans within complex, real-world physical environments continues to pose a substantial hurdle. We attribute this difficulty to three structural shortcomings in current datasets: a lack of diversity in global scenes and camera angles, inadequate modeling of interactions (including person-person and person-object dynamics), and poor alignment of individual attributes. To address these issues, we introduce OmniHuman, a comprehensive, multi-scene dataset tailored for detailed human modeling. This resource features hierarchical annotations spanning video-level scenes, frame-level interactions, and individual-level characteristics. To support this, we engineered a fully automated pipeline for acquiring high-quality data and performing multi-modal annotations. Alongside the dataset, we launch the OmniHuman Benchmark (OHBench), a three-tier evaluation framework designed to offer a scientific assessment of human-centric audio-video synthesis. Notably, OHBench incorporates metrics that align closely with human perception, thereby addressing deficiencies in current benchmarks by delivering a holistic diagnosis across global scenes, relational interactions, and individual attributes.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




