Finite-Time Regret Analysis of Retry-Aware Bandits
Title: Analyzing Regret in Retry-Aware Bandits Within Finite Time
Abstract: This paper investigates a stochastic bandit algorithm designed for retry-aware objectives, which prioritize the best result achieved across multiple trials, such as pass@$k$ and max@$k$. Operating on a posterior distribution of arm values, the ReMax method selects a sampling distribution that maximizes the posterior expected maximum reward over $M$ hypothetical draws. While this objective has previously served as an exploration strategy in reinforcement learning under uncertainty, its regret characteristics in bandit settings have not been well understood. We focus on Gaussian rewards and the initial non-trivial scenario where $M=2$. By establishing an expected-improvement balance condition, we define the optimal ReMax distribution and demonstrate the first sublinear regret bound for this approach. Our theoretical framework distinguishes between the standard saturation of suboptimal arms and a unique ReMax phenomenon: an underestimation effect where the optimal arm is sampled too infrequently following a pessimistic estimate. This dynamic clarifies why ReMax tends to be more exploitative than Thompson sampling (TS) and accounts for the technical complexity of its regret analysis. Empirical results align with this theoretical insight: ReMax generally surpasses both KL-UCB and Thompson sampling when underestimation is mild, whereas scaling the posterior variance helps alleviate the impact of significant underestimation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





