Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
Title: A Role-Based Survey on Video Translation Powered by Multimodal Large Language Models
Recent advancements in multimodal large language models (MLLMs) are fundamentally transforming video translation. Rather than relying on a disjointed sequence of automatic speech recognition, machine translation, text-to-speech conversion, and lip synchronization, the field is shifting toward a unified framework of multimodal reasoning and generation. Achieving high-fidelity video translation demands more than just semantic accuracy; it requires precise temporal alignment, consistent speaker identity, and emotional nuance across visual, auditory, and linguistic channels.
This survey offers a targeted examination of MLLM-driven video translation, structured around a taxonomy defined by functional roles. We categorize studies involving MLLMs into three distinct capacities: the Semantic Reasoner, responsible for grounding translations through video comprehension, temporal logic, and multimodal integration; the Expressive Performer, which facilitates speech generation that is both context-sensitive and controllable; and the Visual Synthesizer, tasked with ensuring lip synchronization and visually coherent rendering of speakers.
Additionally, we review key datasets, benchmarks, and evaluation metrics associated with each role, highlighting significant gaps between current assessment protocols and the rigorous demands of end-to-end video translation. The paper concludes by addressing critical open challenges—such as long-form video analysis, temporal modeling, multimodal alignment, multilingual robustness, and ethical deployment—while proposing future research pathways to foster natural and reliable cross-lingual video communication.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




