arXiv

JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

June 2, 2026 · Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang · Original Source

Title: JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions

Abstract:

We tackle the difficulty of producing high-quality, extended soundtracks that maintain narrative unity as scenes shift. Current AI music generation tools are primarily optimized for brief, disconnected segments and fail to provide the structural continuity required for longer narratives. In response, we introduce JenBridge, a modular and transparent framework designed for adaptive soundtracking in long-form video. This system guarantees both superior audio fidelity and seamless transitions.

The underlying architecture relies on a Transformer-based generative model optimized via a flow-matching objective. It operates on a two-phase training paradigm: first, it undergoes pretraining on extensive text-audio datasets to build strong musical foundations; subsequently, it is fine-tuned for the video domain using dual conditioning from both text and visual inputs to ensure precise cross-modal alignment.

A key innovation of JenBridge is its novel adaptive transition mechanism, which preserves coherence across varied scene changes. The toolkit offers a range of transition styles, including a generative approach. Distinctively, it utilizes a Large Language Model (LLM) Agent functioning as a director to intelligently choose the most suitable transition for each narrative shift.

To evaluate this specific challenge, we introduce the LVS Benchmark. This new resource comprises a curated dataset and innovative metrics designed to assess both holistic quality and transition-aware performance. Our comprehensive experiments on the LVS Benchmark reveal that JenBridge substantially surpasses current methods in both objective and subjective measures, with particular strength in transition naturalness and overarching narrative coherence. JenBridge marks a major advancement toward the realization of fully automated, professional-grade video soundtracking.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC