MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Abstract: Translating lyrics demands not only precise semantic fidelity but also the maintenance of musical rhythm, syllabic patterns, and poetic nuance. This task becomes even more complex in animated musicals, where translations must synchronize with specific visual and auditory signals. To address this, we present MAVL (Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation), the inaugural multimodal and multilingual benchmark designed for translating singable lyrics. By combining text, audio, and video data, MAVL facilitates translations that are more expressive and nuanced than those derived from text alone. Leveraging this resource, we introduce SylAVL-CoT, a Syllable-Constrained Audio-Video Large Language Model. This model utilizes audio-video cues and applies strict syllabic constraints to generate lyrics that sound natural. Our experiments reveal that SylAVL-CoT substantially surpasses text-only models in both singability and contextual accuracy, highlighting the significant advantages of adopting multimodal, multilingual strategies in lyrics translation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





