arXiv

Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

June 2, 2026 · Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang · Original Source

Title: Pave-GRPO: Achieving Superior Alignment via Principled Average Velocity Decomposition

Abstract

Group Relative Policy Optimization (GRPO) has become a prominent strategy for aligning flow-based generative models with human preferences through post-training. Nevertheless, the iterative denoising process inherent to flow models imposes significant computational burdens when producing the group rollouts required for policy-gradient updates. Consequently, current approaches are forced to rely on models with a minimal number of denoising steps. This scarcity of temporal data hampers preference optimization; because reward feedback is confined to only a few stages within each trajectory, the majority of intermediate denoising steps lack direct supervision, thereby diminishing the precision of the alignment.

To overcome these limitations, we introduce Pave-GRPO, a method that restructures the GRPO objective using Principled average velocity decomposition. Instead of incurring the high costs associated with generating high-step rollouts, Pave-GRPO retains the efficiency of few-step group sampling. It achieves this by breaking down each coarse transition into an equivalent set of finer sub-trajectories that cover multiple intermediate timesteps. This approach channels reward feedback to a more dense array of temporal stages, facilitating more holistic preference alignment without increasing generation costs.

This architectural design yields two primary advantages: (i) Zero-cost horizon expansion: By directly reusing piece-wise group samples alongside their corresponding rewards, Pave-GRPO substantially widens the effective scope of optimization while maintaining fixed sampling budgets. (ii) Comprehensive temporal supervision: By equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, the method distributes reward signals across a greater number of intermediate denoising stages. This enables more granular and thorough preference optimization.

Extensive experimental results demonstrate that Pave-GRPO effectively enhances preference alignment across various reward configurations, delivering significant overall performance improvements.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC