Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
Title: Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
Abstract
Offline reinforcement learning (RL) seeks to refine decision-making policies using data that has already been collected. A significant hurdle in this approach is the management of epistemic uncertainty, which stems from two primary sources: insufficient data coverage at the sample level and the difficulty in accurately determining transition dynamics from finite datasets at the model level. To offer a cohesive method for quantifying these uncertainties, Bayesian RL has emerged as a solution by conceptualizing the dynamics model as a stochastic variable and sustaining a corresponding belief state. However, despite its strong theoretical foundation, executing policy optimization within Bayesian RL is computationally intensive, largely because it involves solving composite objectives that include expectations. Existing solutions have struggled with this; some rely on search-based methods that scale poorly, while others enforce restrictive posterior assumptions that undermine the flexibility inherent in Bayesian RL.
To overcome these challenges, we introduce Posterior Hybrid Bayesian Belief (PhyB). This approach redefines the expectation as a convex combination derived from a specific subset of dynamics models. Our theoretical analysis confirms that the approximation error introduced by this method remains strictly bounded. Leveraging PhyB, we have engineered an iterative regularized policy optimization algorithm that ensures monotonic improvement and convergence, offering guarantees that are independent of specific metrics. Experimental evaluations indicate that PhyB delivers state-of-the-art results across a range of standard benchmarks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




