arXiv

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

June 2, 2026 · Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu · Original Source

Title: Causal Forcing++: Enabling Real-Time Interactive Video Generation via Scalable Few-Step Autoregressive Diffusion Distillation

Abstract:

The demand for real-time, interactive video generation necessitates systems that offer low-latency, streaming capabilities, and controllable rollout. While current autoregressive (AR) diffusion distillation techniques have demonstrated impressive performance in chunk-wise scenarios requiring four steps, they are constrained by coarse response granularity and significant sampling delays. This work investigates a more demanding framework: frame-wise autoregression utilizing merely 1–2 sampling steps. Within this regime, we pinpoint the initialization of the few-step AR student as the primary bottleneck. Previous initialization strategies prove inadequate, as they are either misaligned with the target, unable to support few-step generation, or prohibitively expensive to scale.

To address these challenges, we introduce Causal Forcing++, a scalable and principled pipeline that leverages causal consistency distillation (causal CD) for initializing few-step AR models. The fundamental concept behind causal CD is that it acquires the same AR-conditional flow map as causal ODE distillation. However, it secures supervision from a single online teacher ODE step occurring between adjacent timesteps. This approach eliminates the requirement to precompute and store complete PF-ODE trajectories, rendering the initialization process both more efficient and simpler to optimize.

Our proposed pipeline, \ours, outperforms the state-of-the-art 4-step chunk-wise Causal Forcing model under the frame-wise 2-step setting. Specifically, it achieves improvements of 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward. Additionally, it cuts first-frame latency by 50% and reduces Stage 2 training costs by approximately fourfold ($\sim$$4\times$). Furthermore, we demonstrate the versatility of this pipeline by extending it to action-conditioned world model generation, following the approach of Genie3.

Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC