Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
Title: Enhancing Large Behavior Models via Coherent Off-Policy Improvement Using Learned Rewards
Abstract:
Behavioral cloning offers a scalable pathway for embedding expert demonstration data into large generative models, thereby enabling the acquisition of robust policies for robotic control, especially in complex dexterous manipulation scenarios. While reinforcement learning (RL) serves as a viable mechanism for further refining these policies through additional interaction, a critical open question remains: Is RL more sample-efficient than the continuous collection of human demonstrations? Previous research has addressed scalability by applying RL to a compact residual policy designed to correct a larger, pretrained model. However, in tasks characterized by sparse rewards, standard RL algorithms often face significant challenges in optimizing behavior efficiently.
To address this, we investigate inverse reinforcement learning (IRL), a technique that derives a dense reward function from expert data, potentially alleviating the difficulties associated with RL finetuning. Specifically, we employ coherent imitation learning, an IRL approach that leverages a specific reward formulation with theoretical backing to facilitate the enhancement of behavioral cloning (BC) policies. Our results demonstrate that this IRL methodology either sustains or enhances the performance of the pi-0.5 model across all six sparse manipulation tasks. Furthermore, it achieves a success rate of at least 90% on five of the six complex manipulation tasks, surpassing RL-based baselines that rely on sparse rewards. By guaranteeing that the initial pretrained finetuning policy is optimal relative to the initial reward and critic functions, our approach avoids the performance degradation typically observed during the early stages of RL finetuning, thereby accelerating the improvement process.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





