arXiv

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Title: Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

Abstract

Identity-preserving video generation (IPVG) seeks to create high-fidelity videos that adhere to textual prompts while accurately maintaining a specific reference identity. Although the field has seen recent advancements, current IPVG approaches continue to face challenges in balancing high-level semantic control with low-level identity fidelity. To address this limitation, we introduce ST-DRC, a novel Spatial-Temporal Decoupled Reference Conditioning framework designed for identity-preserving text-to-video synthesis.

At the architectural level, ST-DRC facilitates latent in-context feature injection by encoding the reference image using the video VAE and concatenating the result with noisy video latents. This mechanism grants access to detailed low-level identity information without the need for extra adapters. To distinguish between identity-aware reference retrieval and mere appearance copying, we propose TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme. By positioning reference tokens temporally close to the video sequence while shifting them spatially, TASS-RoPE enables reference data to traverse spatio-temporal attention mechanisms while mitigating pixel-level copy-paste shortcuts.

Furthermore, to curb shortcut learning and reinforce the identity supervision that is often diluted within the diffusion objective, we integrate appearance-invariant reference augmentation with face-guided identity objectives. This combination encourages the model to retain identity consistency despite variations in color, pose, and layout. During inference, we employ a three-stream reference classifier-free guidance strategy to independently manage text adherence and reference fidelity.

Experimental results indicate that ST-DRC delivers robust identity preservation, prompt alignment, temporal consistency, and video quality, all achieved through a lightweight design based on LTX-2.3. Our approach ranks among the top entries in the facial identity-preserving video generation track, underscoring the efficacy of spatial-temporal decoupled reference conditioning.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...