arXiv

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Title: Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Abstract:

Current open-source diffusion models face significant challenges in producing stable, synchronized audio-visual content, especially in contexts that require intricate semantic reasoning. This limitation stems from the reliance of existing approaches on coarse text embeddings derived from pre-trained encoders to steer the denoising process for both audio and video. Such methods fail to preserve fine-grained semantic nuances and, more critically, lack a unified long-term strategy. Consequently, the denoising paths for audio and video remain uncoordinated, resulting in weak cross-modal alignment.

To address these issues, we introduce Baton, a novel framework that incorporates explicit semantic planning into the joint generation of video and audio. Our core hypothesis is that augmenting basic text guidance with semantically dense, modality-specific planned tokens—jointly reasoned and aligned prior to denoising—can simultaneously recover detailed semantic information and create a shared blueprint. This blueprint synchronizes the denoising trajectories for both modalities.

Specifically, Baton employs the VA-Planner, a multimodal language model featuring dual semantic alignment towers. Within this structure, learnable queries cross-attend to features from both video and audio streams, generating a pair of semantically aligned planned tokens. These tokens serve as keyframe-level blueprints. To integrate this guidance, the planned tokens are injected into the diffusion backbone through cross-attention layers, offering temporally grounded support that complements coarse text embeddings.

Because the planned tokens do not possess a direct one-to-one spatial-temporal correspondence with diffusion latents, we introduce Relative Semantic RoPE. This relative positional encoding maps both planned tokens and latents into a common spatial-temporal coordinate system. This mechanism allows each latent to precisely attend to its corresponding semantic cues based on position. Our experiments on standard benchmarks demonstrate the qualitative and quantitative effectiveness of the Baton framework.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users
Bloomberg

Withings Debuts New Smart Scale Marketed Toward GLP-1 Users

Withings launched a new smart scale targeting GLP-1 users, offering advanced body composition analysis. This device help...

TechCrunch

Rocket engine startup Impulse raises $500 million to hire people, not AI

Rocket engine startup Impulse Space raised $500 million to hire 200 engineers, prioritizing human expertise over AI for ...

Startup Impulse Space Raises $500 Million, Valued at $4 Billion
Bloomberg

Startup Impulse Space Raises $500 Million, Valued at $4 Billion

Impulse Space secured $500 million in funding, achieving a $4 billion valuation. This investment supports the developmen...

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App
Bloomberg

Walmart’s Answer to Apple Pay Wants to Be Your Favorite Financial App

Walmart’s new financial app aims to rival Apple Pay, positioning itself as a preferred digital payment and banking solut...

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again
Bloomberg

Nvidia Is Bigger, Stronger, and Trying to Slay the Laptop Dragon Again

Nvidia unveiled the RTX Spark Superchip at Computex 2026, aiming to challenge Intel’s PC dominance and modernize hardwar...

TechCrunch

Pacific Fusion’s latest prototype packs 440 gigawatts into an 80-nanosecond burst

Pacific Fusion’s new prototype delivers 440 gigawatts in 80 nanoseconds, securing over $1 billion in funding and enablin...