Fairness in two-player zero-sum games with bandit feedback
Title: Ensuring Fairness in Two-Player Zero-Sum Games with Bandit Feedback
Abstract:
This paper investigates two-player zero-sum games (TPZSGs) operating under bandit feedback, subject to fairness constraints that mandate each action be selected with a minimum probability of $\alpha/m$. While previous instance-dependent analyses have focused on pure Nash equilibria, the introduction of fairness typically results in mixed equilibria, which present a more challenging target for learning algorithms.
Our primary technical contribution is a reparametrization technique. We demonstrate that any fair strategy $p$ can be expressed as $p = (\alpha/m)\mathbf{1} + (1-\alpha)\widetilde{p}$, where $\widetilde{p}$ resides in the simplex $\Delta_m$. By substituting this form into the payoff function, we obtain $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$, where $\widetilde{A} := (1-\alpha)A + \alpha\mathbf{1} c^{\top}$ is a fair payoff matrix derived from the original matrix $A$. Here, $c_j$ represents the column-mean vector, defined as $\tfrac{1}{m}\sum_i A(i,j)$.
This transformation establishes an equivalence between the fair game on $A$ and a standard zero-sum game on $\widetilde{A}$. Consequently, properties such as equilibrium existence, Karush-Kuhn-Tucker (KKT) conditions, and linear programming (LP) basis stability for the fair game can be deduced from classical results applied to $\widetilde{A}$. We derive the fair minimax value, the fair Nash equilibrium, and fair regret metrics. Furthermore, we provide a dual representation that quantifies the "price of fairness," showing it is bounded by $\alpha(1-1/m)$ and becomes zero if the unconstrained equilibrium already possesses full support.
The central finding of this work is an $\widetilde{O}(T^{2/3})$ regret bound achieved by our proposed Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$. This algorithm is designed for general mixed fair equilibria. We also discuss the limitations of naive action elimination in improving this bound. However, in cases where the fair equilibrium features a single dominant action—specifically when $\widetilde{p}^{\star}$ corresponds to a vertex of $\Delta_m$—the bound improves to an instance-dependent $\widetilde{O}(1/\widetilde{\Delta}(\alpha)^{2})$. In this context, $\widetilde{\Delta}(\alpha)$ denotes the LP-margin gap.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





