arXiv

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

June 2, 2026 · Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang · Original Source

Title: Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Abstract:

Current open-source diffusion models face significant challenges in producing stable, synchronized audio-visual content, especially in contexts that require intricate semantic reasoning. This limitation stems from the reliance of existing approaches on coarse text embeddings derived from pre-trained encoders to steer the denoising process for both audio and video. Such methods fail to preserve fine-grained semantic nuances and, more critically, lack a unified long-term strategy. Consequently, the denoising paths for audio and video remain uncoordinated, resulting in weak cross-modal alignment.

To address these issues, we introduce Baton, a novel framework that incorporates explicit semantic planning into the joint generation of video and audio. Our core hypothesis is that augmenting basic text guidance with semantically dense, modality-specific planned tokens—jointly reasoned and aligned prior to denoising—can simultaneously recover detailed semantic information and create a shared blueprint. This blueprint synchronizes the denoising trajectories for both modalities.

Specifically, Baton employs the VA-Planner, a multimodal language model featuring dual semantic alignment towers. Within this structure, learnable queries cross-attend to features from both video and audio streams, generating a pair of semantically aligned planned tokens. These tokens serve as keyframe-level blueprints. To integrate this guidance, the planned tokens are injected into the diffusion backbone through cross-attention layers, offering temporally grounded support that complements coarse text embeddings.

Because the planned tokens do not possess a direct one-to-one spatial-temporal correspondence with diffusion latents, we introduce Relative Semantic RoPE. This relative positional encoding maps both planned tokens and latents into a common spatial-temporal coordinate system. This mechanism allows each latent to precisely attend to its corresponding semantic cues based on position. Our experiments on standard benchmarks demonstrate the qualitative and quantitative effectiveness of the Baton framework.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC