arXiv

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

June 3, 2026 · Sanjit Dandapanthula, Nicholas M. Boffi · Original Source

Title: Is the Tilt Genuine? Unraveling the Mechanics of Reward Guidance in Flow and Diffusion Models

Original: arXiv:2606.02884v1 Announce Type: cross Abstract: Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.

Rewritten: arXiv:2606.02884v1 Announce Type: cross Abstract: At inference time, reward guidance algorithms direct generative processes toward measures tilted by rewards. Despite their empirical success, these techniques frequently suffer from "reward hacking," where models excessively optimize for reward scores while sacrificing adherence to the underlying learned distribution. Although previous studies have linked this issue to intricate neural reward functions or inherent biases within diffusion training, the root causes have largely remained unclear. This paper demonstrates that reward hacking is actually a consequence of a specific approximation employed in most real-world reward-guided diffusion applications: the finite-particle plug-in estimation of the Doob h-function. This phenomenon persists even in straightforward scenarios involving Gaussian and Gaussian mixture targets paired with quadratic rewards. Through analytical derivation, we identify two primary failure modes of this plug-in estimator: it induces reward hacking within individual modes and fails to identify high-reward modes. To address the within-mode bias, we introduce a closed-form reward damping schedule that requires no extra computational resources. Additionally, we elucidate how best-of-n sampling helps mitigate the failure to select high-reward modes. Our theoretical findings are validated through experiments on Gaussian mixture targets, a 2D checkerboard pattern, and FLUX.1 text-to-image generation, demonstrating that these insights apply to practical applications.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC