arXiv

MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

Title: MMTalker: Enhancing Multiresolution 3D Talking Head Synthesis via Multimodal Feature Fusion

Abstract

The objective of speech-driven three-dimensional (3D) facial animation synthesis is to establish a mapping between one-dimensional (1D) speech signals and dynamic 3D facial motion. However, existing approaches struggle with lip-sync precision and the generation of realistic expressions, largely because this cross-modal mapping is inherently ill-posed. To address these limitations, we present MMTalker, a novel method for 3D audio-driven facial animation that leverages multi-resolution representation and multimodal feature fusion to accurately reconstruct detailed 3D facial movements.

Our approach begins by achieving a continuous representation of the 3D face, enriched with fine details, through mesh parameterization and non-uniform differentiable sampling. Specifically, mesh parameterization creates a correspondence between the UV plane and the 3D facial mesh, providing the necessary ground truth for continuous learning. Concurrently, differentiable non-uniform sampling facilitates the precise capture of facial details by utilizing learnable sampling probabilities assigned to each triangular face.

Subsequently, we utilize a residual graph convolutional network alongside a dual cross-attention mechanism to extract distinctive facial motion features from various input modalities. This multimodal fusion strategy effectively integrates the hierarchical features of speech with the explicit spatiotemporal geometric characteristics of the facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements for the synthesized talking face by jointly processing the encoded facial motion features and the sampled points within the canonical UV space. Extensive experiments indicate that our method yields substantial improvements over state-of-the-art techniques, particularly regarding the synchronization accuracy of eye and lip movements.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...