MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion
Title: MMTalker: Enhancing Multiresolution 3D Talking Head Synthesis via Multimodal Feature Fusion
Abstract
The objective of speech-driven three-dimensional (3D) facial animation synthesis is to establish a mapping between one-dimensional (1D) speech signals and dynamic 3D facial motion. However, existing approaches struggle with lip-sync precision and the generation of realistic expressions, largely because this cross-modal mapping is inherently ill-posed. To address these limitations, we present MMTalker, a novel method for 3D audio-driven facial animation that leverages multi-resolution representation and multimodal feature fusion to accurately reconstruct detailed 3D facial movements.
Our approach begins by achieving a continuous representation of the 3D face, enriched with fine details, through mesh parameterization and non-uniform differentiable sampling. Specifically, mesh parameterization creates a correspondence between the UV plane and the 3D facial mesh, providing the necessary ground truth for continuous learning. Concurrently, differentiable non-uniform sampling facilitates the precise capture of facial details by utilizing learnable sampling probabilities assigned to each triangular face.
Subsequently, we utilize a residual graph convolutional network alongside a dual cross-attention mechanism to extract distinctive facial motion features from various input modalities. This multimodal fusion strategy effectively integrates the hierarchical features of speech with the explicit spatiotemporal geometric characteristics of the facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements for the synthesized talking face by jointly processing the encoded facial motion features and the sampled points within the canonical UV space. Extensive experiments indicate that our method yields substantial improvements over state-of-the-art techniques, particularly regarding the synchronization accuracy of eye and lip movements.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





