Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Title: Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Abstract:
The creation of lifelike talking head avatars from static images has become a pivotal technology for virtual communication and content production. Despite these advancements, existing models often fail to replicate the dynamic nature of genuine human interaction, typically producing unidirectional responses that lack emotional depth. This paper highlights two primary hurdles in developing truly interactive avatars: the necessity of generating motion in real-time while adhering to causal constraints, and the challenge of acquiring expressive, vibrant reactions without relying on additional labeled datasets.
To overcome these obstacles, we introduce "Avatar Forcing," a novel framework designed for interactive head avatar generation. This approach utilizes diffusion forcing to model real-time exchanges between users and avatars. By processing multimodal inputs—such as user audio and physical motion—with minimal latency, the system enables avatars to react instantly to both verbal and non-verbal signals, including speech, laughter, and nods. Additionally, we present a direct preference optimization technique that employs synthetic "losing" samples, created by omitting user conditions, to facilitate label-free learning of expressive interactions. Our experimental findings indicate that this framework achieves real-time interaction with a latency of roughly 500ms, representing a 6.8-fold speed improvement over baseline methods. Furthermore, the resulting avatar motions are highly reactive and expressive, garnering preference in over 80% of comparative evaluations against the baseline.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




