SegTune: Structured and Fine-Grained Control for Song Generation
Title: SegTune: Enabling Structured and Fine-Grained Control in Song Generation
Abstract:
While recent breakthroughs in neural song generation have facilitated the creation of high-quality audio from lyrics and broad textual descriptions, current systems largely overlook temporally shifting musical attributes. This limitation restricts the ability to exercise precise control over a track’s structure and dynamics. To overcome these challenges, we introduce SegTune, a framework built on Diffusion Transformers that offers both structured and granular control. This approach empowers users or large language models (LLMs) to define local musical descriptions tailored to specific song segments. These segment-specific prompts are mapped to their corresponding time intervals, while global prompts maintain overall stylistic unity.
To achieve accurate alignment between lyrics and music, we developed an LLM-driven duration predictor that autoregressively produces sentence-level timestamps formatted in LyRiCs. Additionally, we established a comprehensive data pipeline to curate a large-scale dataset of high-quality songs featuring synchronized lyrics and prompts. We also introduced novel evaluation metrics designed to assess segment alignment and vocal consistency. Our experimental results indicate that SegTune surpasses existing baseline models in terms of both musical quality and controllability. For access to the codebase and additional audio samples, please visit our project page at https://github.com/KlingAIResearch/SegTune.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



