Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation
Title: Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation
Abstract:
The generation of audio-driven human motion videos seeks to create realistic, temporally consistent animations from a single static image, facilitating applications such as talking-head synthesis, co-speech gesture generation, and dynamic presentations. Conventional keypoint-based methods frequently fail to capture subtle motion dynamics; to address this limitation, we introduce a novel implicit-motion framework. This system generates realistic and temporally coherent human motion videos from a single static image and audio input.
Our methodology utilizes a two-stage pipeline that separates motion prediction from the rendering process. In the first stage, a region-aware attention mechanism incorporates hierarchical depth cues and appearance priors to model latent motion features. The second stage leverages a Mamba-enhanced diffusion model to directly predict these features based on the audio and source image, allowing for the unsupervised learning of fine-grained motion patterns. This decoupled architecture improves both efficiency and flexibility.
Evaluated on a newly curated dataset of 380 hours of high-quality data, our method surpasses previous approaches in accuracy, naturalness, and temporal coherence across various public benchmarks and our own collected data, establishing a new state-of-the-art performance.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





