Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning
Title: Achieving Minimax Optimality in Online Reinforcement Learning Under Delayed Observations
Abstract: This research investigates reinforcement learning scenarios characterized by delayed state observations, wherein an agent accesses the current state after a stochastic number of time steps have elapsed. To address this challenge, we introduce an algorithm that integrates the upper confidence bound technique with an augmentation method. In the context of tabular Markov decision processes (MDPs), we establish a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$. Here, $S$ and $A$ denote the sizes of the state and action spaces, respectively; $H$ represents the time horizon; $K$ is the total number of episodes; and $D_{\max}$ signifies the maximum delay duration. Furthermore, we demonstrate the optimality of our proposed method by deriving a matching lower bound, accurate up to logarithmic factors. Our analytical framework interprets this problem as a specific instance within a wider category of MDPs, distinguished by transition dynamics that split into a known part and an unknown yet structured component. We also derive general theoretical results for this abstract formulation, which hold independent significance beyond the immediate application.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



