arXiv

Bandit Simulation for Average Reward Inference

Bandit Simulation for Average Reward Inference

Abstract

While multi-arm bandit algorithms are gaining traction across online platforms, clinical trials, and social science research, performing rigorous statistical inference on their outcomes remains a significant hurdle. A primary concern following deployment is the ability to generate confidence intervals for the mean reward and determine if the bandit consistently surpasses a baseline policy. Because total rewards in any single deployment are stochastic, repeating the experiment on the same population typically produces divergent reward paths. Consequently, traditional statistical techniques are unsuitable, as the complex dependencies introduced by bandit algorithms breach the independent and identically distributed (i.i.d.) assumption required by classical methods. Furthermore, current inference approaches for adaptively gathered data are limited to estimands independent of the data-collection mechanism, such as the mean reward of a fixed action.

To address these limitations, we introduce Bandit Simulation for Inference (BSI). This framework constructs a simulator of the bandit environment using observed data—whether on-policy or off-policy—and leverages it to estimate mean rewards for any evaluation policy, including adaptive blackbox algorithms. BSI rigorously incorporates uncertainty from the estimated simulator parameters into the construction of confidence intervals. Crucially, BSI ensures validity under only weak exploration assumptions regarding the behavior policy and does not rely on importance weighting. We establish that BSI produces asymptotically valid confidence intervals and provide empirical evidence showing it achieves nominal coverage in scenarios where conventional off-policy evaluation methods fall short.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...