Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
Title: Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
Abstract
Identity-preserving video generation (IPVG) seeks to create high-fidelity videos that adhere to textual prompts while accurately maintaining a specific reference identity. Although the field has seen recent advancements, current IPVG approaches continue to face challenges in balancing high-level semantic control with low-level identity fidelity. To address this limitation, we introduce ST-DRC, a novel Spatial-Temporal Decoupled Reference Conditioning framework designed for identity-preserving text-to-video synthesis.
At the architectural level, ST-DRC facilitates latent in-context feature injection by encoding the reference image using the video VAE and concatenating the result with noisy video latents. This mechanism grants access to detailed low-level identity information without the need for extra adapters. To distinguish between identity-aware reference retrieval and mere appearance copying, we propose TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme. By positioning reference tokens temporally close to the video sequence while shifting them spatially, TASS-RoPE enables reference data to traverse spatio-temporal attention mechanisms while mitigating pixel-level copy-paste shortcuts.
Furthermore, to curb shortcut learning and reinforce the identity supervision that is often diluted within the diffusion objective, we integrate appearance-invariant reference augmentation with face-guided identity objectives. This combination encourages the model to retain identity consistency despite variations in color, pose, and layout. During inference, we employ a three-stream reference classifier-free guidance strategy to independently manage text adherence and reference fidelity.
Experimental results indicate that ST-DRC delivers robust identity preservation, prompt alignment, temporal consistency, and video quality, all achieved through a lightweight design based on LTX-2.3. Our approach ranks among the top entries in the facial identity-preserving video generation track, underscoring the efficacy of spatial-temporal decoupled reference conditioning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





