When and why randomised exploration works (in linear bandits)
Title: The Mechanics and Timing of Randomized Exploration in Linear Bandits
Abstract: This paper introduces a novel analytical framework for randomized exploration techniques, such as Thompson sampling, that circumvents the need for forced optimism or posterior inflation. Applying this method to the $d$-dimensional linear bandit problem under conditions where the action space is both smooth and strongly convex, we establish an $n$-step regret bound of $O(d\sqrt{n} \log(n))$. Crucially, this result marks the first demonstration that Thompson sampling can achieve optimal dimension dependence in regret within non-trivial linear bandit environments.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



