Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success
Title: Policy Enhancement via Success Conditioning: Resolving the Optimization Puzzle of Imitating Success
Abstract:
Success conditioning stands as a prevalent strategy for refining policies, a process that involves gathering trajectories, isolating those that reach a target outcome, and subsequently training the policy to mimic the actions executed during these successful runs. Although this concept is recognized under various labels—including rejection sampling with supervised fine-tuning (SFT), goal-conditioned reinforcement learning (RL), and Decision Transformers—the specific optimization problem it addresses has historically been ambiguous.
In this work, we demonstrate that success conditioning precisely resolves a trust-region optimization challenge. Specifically, it maximizes policy improvement while adhering to a $\chi^2$ divergence constraint, the radius of which is automatically calibrated by the dataset. This finding establishes a fundamental identity: at every state, the relative policy improvement, the magnitude of the policy change, and a novel metric we term "action-influence"—which quantifies how stochastic variations in action selection impact success probabilities—are mathematically equivalent. Consequently, success conditioning functions as a conservative improvement operator. Because exact success conditioning cannot deteriorate performance or trigger hazardous distribution shifts, its failure modes are transparent, characterized by minimal policy modification. Furthermore, we extend our theoretical framework to the widespread practice of return thresholding, illustrating that while this method can enhance improvement, it risks misalignment with the primary objective.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




