Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation
Title: Causal Forcing: A Correct Approach to Autoregressive Diffusion Distillation for High-Quality Real-Time Interactive Video Generation
Abstract: Current techniques for real-time interactive video generation typically distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) architectures. However, this process encounters a significant structural discrepancy when full attention mechanisms are substituted with causal attention. Existing methods fail to theoretically resolve this gap. They rely on initializing the AR student through ODE distillation, a process that demands frame-level injectivity—ensuring that each noisy frame corresponds to a single unique clean frame under the PF-ODE of the AR teacher. When an AR student is distilled from a bidirectional teacher, this injectivity condition is breached. Consequently, the teacher’s flow map cannot be recovered; instead, a conditional-expectation solution emerges, which significantly hampers model performance.
To overcome this challenge, we introduce Causal Forcing. This approach bridges the architectural divide by employing an autoregressive teacher for ODE initialization. Subsequently, it utilizes the same DMD procedure found in Self Forcing. Our empirical evaluations demonstrate that Causal Forcing surpasses all baseline methods across every metric. Specifically, it exceeds the state-of-the-art Self Forcing method by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following.
Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/} Code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





