Do Joint Audio-Video Generation Models Understand Physics?
Title: Do Joint Audio-Video Generation Models Understand Physics?
Abstract: As joint audio-video generation models rapidly achieve professional-grade production quality, a critical inquiry emerges: do these systems truly comprehend audio-visual physics, or do they simply synthesize plausible yet physically inconsistent sounds and frames? To address this, we present AV-Phys Bench, a novel benchmark designed to assess physical commonsense within joint audio-video generation. This benchmark evaluates models across three distinct scene categories—Steady State, Event Transition, and Environment Transition—incorporating physics-grounded subcategories derived from real-world scenarios alongside "Anti-AV-Physics" prompts specifically crafted to elicit physically inconsistent audio-video outputs. Each generated output is assessed across five key dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Our evaluation of three proprietary and four open-source models reveals that while Seedance 2.0 achieves the highest overall performance, no model demonstrates robust physical understanding. Significant performance declines are observed during event-driven and environment-driven transitions, with even leading proprietary systems failing when confronted with Anti-AV-Physics prompts. Additionally, we introduce AV-Phys Agent, a ReAct-style evaluator that integrates a multimodal language model with deterministic acoustic measurement tools to generate rankings that strongly correlate with human assessments. Our findings highlight cross-modal physical consistency and the dynamics of transition-driven scenes as primary unresolved challenges in the field of joint audio-video generation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





