Drifting Preference Optimization for One-Step Generative Models
Title: Drifting Preference Optimization for One-Step Generative Models
Abstract:
Deterministic one-step text-to-image models offer significant deployment advantages due to their ability to produce images via a single forward pass. However, aligning these models through preference fine-tuning presents considerable challenges. Conventional alignment techniques typically depend on policy likelihoods, denoising paths, differentiable reward gradients, or test-time optimization strategies. To address these limitations, we introduce Drifting Preference Optimization (DrPO), an online preference fine-tuning approach tailored for deterministic one-step generators.
In this method, DrPO generates candidate images for each prompt using the current model and ranks them based on a target reward. It then leverages both high- and low-scoring samples to construct an update direction within the feature space. This update comprises a non-parametric dipole preference field and a reference drift derived from the frozen base generator. Optimization is achieved through a detached feature-space regression target. Because the target reward serves exclusively for ranking purposes, DrPO supports training with large-scale, black-box, or non-differentiable rewards, all while maintaining inference efficiency with a single generator call.
We assessed DrPO’s performance on SD-Turbo and SDXL-Turbo models using various benchmarks and target rewards, such as HPSv3 and GenEval. Our results demonstrate that DrPO enhances alignment compared to reward-gradient-free one-step preference baselines. Furthermore, by eliminating the need for reward-model backpropagation, DrPO reduces HPSv3 training computation by a factor of $3.51\times$ under matched effective-batch conditions. Preliminary offline experiments also indicate that sample-based gradient synthesis may be applicable beyond online reward ranking contexts.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





