DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
Title: DSA: Dynamic Step Allocation for Fast Autoregressive Video Generation
Abstract:
While video diffusion transformers currently deliver state-of-the-art visual fidelity, their substantial inference costs hinder deployment in real-time scenarios. Although recent distillation techniques have enabled autoregressive video diffusion models with lower latency, they typically rely on a static count of denoising steps for every frame. This rigid approach leads to computational waste on predictable content and insufficient refinement for complex scenes. To address this, we introduce DSA, an adaptive computation framework for autoregressive video diffusion guided by confidence metrics. DSA incorporates a lightweight confidence head, trained in tandem with the generator via a distribution-matching distillation objective, to assess the reliability of per-frame denoising. During inference, this confidence metric dynamically modulates the diffusion step count: straightforward frames are terminated early to enhance speed, whereas intricate frames undergo extended refinement. The proposed method demands no additional video data, avoids heuristic rules, and requires minimal architectural changes. Experimental results indicate that DSA enables real-time autoregressive video generation, achieving 22.63 FPS with sub-second latency on H100 GPUs. It maintains VBench quality scores that are competitive with, or superior to, those of recent autoregressive and bidirectional video diffusion models. These findings highlight confidence-guided adaptive sampling as a viable and efficient strategy for interactive video generation.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






