Data- and Variance-dependent Regret Bounds for Online Tabular MDPs
Title: Regret Bounds for Online Tabular MDPs That Depend on Data and Variance
Abstract:
This study investigates online episodic tabular Markov decision processes (MDPs) characterized by known transition dynamics. We introduce "best-of-both-worlds" algorithms that deliver refined regret bounds: data-dependent bounds for the adversarial setting and variance-dependent bounds for the stochastic setting. To measure MDP complexity, we employ a first-order metric alongside novel data-dependent indicators for the adversarial case, such as a second-order quantity and a path-length measure. Additionally, we utilize variance-based metrics for the stochastic context.
Our algorithms, grounded in optimistic follow-the-regularized-leader with log-barrier regularization, are developed through two distinct approaches: global optimization and policy optimization. In the global optimization framework, the proposed methods attain first-order, second-order, and path-length regret bounds within the adversarial regime. For the stochastic regime, these algorithms provide a variance-aware gap-independent bound, as well as a variance-aware gap-dependent bound that scales polylogarithmically with the episode count.
Alternatively, the policy optimization approach leverages a novel optimistic $Q$-function estimator to achieve similar data- and variance-dependent adaptability, though with a multiplicative factor of the episode horizon. Furthermore, we derive regret lower bounds defined by data-dependent complexity measures for the adversarial case and a variance measure for the stochastic case. These lower bounds suggest that the regret upper bounds obtained via the global-optimization strategy are nearly optimal.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



