Bandit Simulation for Average Reward Inference
Bandit Simulation for Average Reward Inference
Abstract
While multi-arm bandit algorithms are gaining traction across online platforms, clinical trials, and social science research, performing rigorous statistical inference on their outcomes remains a significant hurdle. A primary concern following deployment is the ability to generate confidence intervals for the mean reward and determine if the bandit consistently surpasses a baseline policy. Because total rewards in any single deployment are stochastic, repeating the experiment on the same population typically produces divergent reward paths. Consequently, traditional statistical techniques are unsuitable, as the complex dependencies introduced by bandit algorithms breach the independent and identically distributed (i.i.d.) assumption required by classical methods. Furthermore, current inference approaches for adaptively gathered data are limited to estimands independent of the data-collection mechanism, such as the mean reward of a fixed action.
To address these limitations, we introduce Bandit Simulation for Inference (BSI). This framework constructs a simulator of the bandit environment using observed data—whether on-policy or off-policy—and leverages it to estimate mean rewards for any evaluation policy, including adaptive blackbox algorithms. BSI rigorously incorporates uncertainty from the estimated simulator parameters into the construction of confidence intervals. Crucially, BSI ensures validity under only weak exploration assumptions regarding the behavior policy and does not rely on importance weighting. We establish that BSI produces asymptotically valid confidence intervals and provide empirical evidence showing it achieves nominal coverage in scenarios where conventional off-policy evaluation methods fall short.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





