Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization
Title: Flash-GRPO: Streamlining Video Diffusion Alignment Through Single-Step Policy Optimization
Abstract: Group Relative Policy Optimization (GRPO) has become a cornerstone for aligning video diffusion models with human preferences, yet it is hindered by a significant computational bottleneck: training a model with 14 billion parameters generally requires hundreds of GPU days for each experiment. While current efficiency strategies attempt to lower costs by employing sliding window subsampling of training timesteps, these approaches fundamentally degrade optimization quality, leading to severe instability and an inability to achieve optimal trajectory performance. To address these limitations, we introduce Flash-GRPO, a single-step training framework that delivers superior alignment quality compared to full trajectory training, even within constrained computational budgets, while significantly boosting training efficiency. Flash-GRPO tackles two primary challenges: first, iso-temporal grouping removes variance confounded by timesteps by enforcing prompt-wise temporal consistency, thereby decoupling policy performance from the inherent difficulty of specific timesteps; second, temporal gradient rectification counteracts the time-dependent scaling factor responsible for wildly inconsistent gradient magnitudes across different timesteps. Validated through experiments on models ranging from 1.3B to 14B parameters, Flash-GRPO demonstrates consistent stability, substantial acceleration in training, and state-of-the-art alignment quality.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




