Minimax-Optimal Policy Regret in Partially Observable Markov Games
Title: Achieving Minimax-Optimal Policy Regret in Partially Observable Markov Games
Abstract
This work investigates sequential decision-making within partially observable environments characterized by strategic and adaptive adversaries, a scenario formalized through partially observable Markov games (POMGs). The primary difficulty lies in inferring latent dynamics from incomplete observations while contending with an opponent whose actions are contingent upon the learner’s strategy; this dependency renders conventional regret metrics insufficient. We demonstrate that an epoch-based optimistic maximum-likelihood algorithm can attain a policy regret of $\tilde{O}(\sqrt{T})$ for fixed problem parameters. This bound exhibits explicit sensitivity to the horizon length, adversary memory, confidence radius, and the aggregate Eluder dimension associated with the observable-operator class. By employing confidence sets accumulated from historical data to select a single policy for each geometrically expanding epoch, the algorithm ensures that the computational cost of comparing adversary responses across different policies remains logarithmic with respect to $T$. Furthermore, we establish a lower bound that aligns with the $\sqrt{T}$ and aggregate-Eluder-dimension dependencies, modulo problem-specific and logarithmic factors. Lastly, the proposed framework is generalized to accommodate horizon-adaptive guarantees and adversaries possessing geometric fading memory.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





