Adaptive Exploration for Latent-State Bandits
Title: Adaptive Exploration Strategies for Latent-State Bandit Problems
Abstract: This research investigates bandit scenarios where reward distributions are governed by an unobserved Markov state that transitions independently of the learner’s decisions. Consequently, the best-performing arm may shift over time, even though the learner’s information is limited to historical actions and outcomes. To address this, we introduce algorithms that enhance LinUCB by incorporating two specific summaries of the hidden state: a lagged action-reward pair and, when feasible, a probe fingerprint derived from the rewards of multiple arms. The adaptive versions of these algorithms dynamically update the fingerprint by applying tests for residuals, margins, and staleness. Synthetic evaluations assessing state cardinality, transition rates, noise levels, and time horizons demonstrate that these approaches significantly lower dynamic regret compared to standard, adversarial, and non-stationary bandit baselines, provided that the summaries effectively differentiate states and are refreshed with sufficient frequency. Furthermore, ablation studies and misspecification tests highlight primary failure points, including insufficient fingerprint separation, excessive noise, and state transitions occurring during sequential probing phases.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




