Inference-Time Scaling for Joint Audio-Video Generation
Title: Inference-Time Scaling for Joint Audio-Video Generation
Abstract:
The creation of realistic audio-video pairs that are both semantically consistent with text prompts and accurately synchronized represents the core objective of joint audio-video generation. While current models typically demand extensive training resources to boost fidelity, Inference-Time Scaling (ITS) has recently appeared as a compelling training-free solution within single-modality fields. However, adapting ITS to multimodal contexts is complex, as it necessitates the management of multiple, heterogeneous objectives. This study offers the first thorough examination of ITS applied to joint audio-video generation.
We begin by showing that a multi-verifier framework is crucial for overcoming the shortcomings of single-objective guidance, such as asymmetric performance trade-offs and verifier hacking. Through rigorous analysis, we pinpoint an ideal multi-verifier configuration that delivers balanced enhancements across all quality metrics. Furthermore, to manage the aggregation of varied reward signals effectively, we introduce Adaptive Reward Weighting (ARW), a new test-time optimization method. ARW frames reward aggregation as an online optimization challenge, employing learnable parameters to adjust reward variances without needing prior insights into reward distributions. This approach guarantees robust selection among multiple objectives.
Our experiments on the VGGSound and JavisBench-mini benchmarks reveal that the proposed framework markedly improves the semantic alignment, perceptual quality, and audio-visual synchronization of the generated content. Code and synthesized samples can be accessed via the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





