T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Title: T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Abstract:
The field of Text-to-Audio-Video (T2AV) generation seeks to produce videos that are temporally coherent and accompanied by audio that is semantically aligned, all derived from natural language inputs. However, assessing these systems remains a fragmented endeavor, typically depending on isolated unimodal metrics or limited benchmarks that overlook critical aspects such as cross-modal alignment, adherence to instructions, and perceptual fidelity when handling intricate prompts. To bridge this gap, we introduce T2AV-Compass, a comprehensive benchmark designed for the holistic evaluation of T2AV models. This benchmark comprises 500 varied and complex prompts, developed through a taxonomy-driven pipeline to guarantee both semantic depth and physical plausibility.
T2AV-Compass employs a dual-tier evaluation framework. This approach combines objective signal-level metrics—covering video quality, audio quality, and cross-modal synchronization—with a subjective "MLLM-as-a-Judge" protocol to assess instruction following and overall realism. Our extensive testing of 11 prominent T2AV systems demonstrates that even the most advanced models significantly lag behind human-level standards in terms of realism and cross-modal consistency. Persistent issues were observed in areas such as audio authenticity, fine-grained synchronization, and prompt adherence. These findings underscore the substantial potential for future improvements and position T2AV-Compass as a rigorous diagnostic platform essential for advancing the state of text-to-audio-video generation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





