Zero-Shot Off-Policy Learning
Title: Zero-Shot Off-Policy Learning
Abstract:
Off-policy learning aims to extract an optimal policy from a static dataset of past interactions, a goal complicated by inherent challenges such as distributional shift and the bias associated with value function overestimation. These difficulties are exacerbated in zero-shot reinforcement learning scenarios, where agents must adapt to novel tasks at test time without further training, relying solely on reward-free data acquired during the initial phase. To tackle the off-policy problem within this zero-shot framework, we establish a theoretical link between successor measures and stationary density ratios. Leveraging this finding, our proposed algorithm deduces optimal importance sampling ratios, thereby executing a stationary distribution correction tailored to the optimal policy for any given task in real-time. We evaluate our approach across motion tracking benchmarks on SMPL Humanoid, continuous control environments in ExoRL, and long-horizon tasks in OGBench. The method integrates effortlessly into forward-backward representation frameworks, facilitating rapid adaptation to new tasks in a training-free manner. Ultimately, this study connects the fields of off-policy learning and zero-shot adaptation, providing mutual benefits to both domains.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






