arXiv

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

June 3, 2026 · Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu · Original Source

Title: T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Abstract:

The field of Text-to-Audio-Video (T2AV) generation seeks to produce videos that are temporally coherent and accompanied by audio that is semantically aligned, all derived from natural language inputs. However, assessing these systems remains a fragmented endeavor, typically depending on isolated unimodal metrics or limited benchmarks that overlook critical aspects such as cross-modal alignment, adherence to instructions, and perceptual fidelity when handling intricate prompts. To bridge this gap, we introduce T2AV-Compass, a comprehensive benchmark designed for the holistic evaluation of T2AV models. This benchmark comprises 500 varied and complex prompts, developed through a taxonomy-driven pipeline to guarantee both semantic depth and physical plausibility.

T2AV-Compass employs a dual-tier evaluation framework. This approach combines objective signal-level metrics—covering video quality, audio quality, and cross-modal synchronization—with a subjective "MLLM-as-a-Judge" protocol to assess instruction following and overall realism. Our extensive testing of 11 prominent T2AV systems demonstrates that even the most advanced models significantly lag behind human-level standards in terms of realism and cross-modal consistency. Persistent issues were observed in areas such as audio authenticity, fine-grained synchronization, and prompt adherence. These findings underscore the substantial potential for future improvements and position T2AV-Compass as a rigorous diagnostic platform essential for advancing the state of text-to-audio-video generation.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC