arXiv

Efficient Adversarial Attacks on High-dimensional Offline Bandits

Title: Efficient Adversarial Attacks on High-dimensional Offline Bandits

Abstract

Bandit algorithms have recently gained traction as a potent mechanism for assessing machine learning systems, such as large language models and generative image architectures. These techniques streamline the identification of superior candidates by avoiding the need for exhaustive pairwise comparisons. Typically, these methods depend on a reward model—often available with public weights on repositories like Hugging Face—to supply feedback to the bandit process. Although online evaluation is resource-intensive and demands numerous iterations, leveraging logged data for offline assessment has emerged as a compelling alternative.

Despite this shift, the adversarial robustness of offline bandit evaluation remains largely under-investigated, specifically regarding scenarios where an adversary modifies the reward model prior to the bandit’s training phase, rather than tampering with the training data itself. This study addresses this void by examining, through both theoretical analysis and empirical testing, how susceptible offline bandit training is to adversarial manipulation of the reward model.

We propose a novel threat model wherein an attacker leverages offline data within high-dimensional contexts to subvert the bandit’s decision-making process. Our investigation begins with linear reward functions and expands to encompass nonlinear architectures, such as ReLU neural networks. We specifically target two Hugging Face evaluators commonly employed for generative model assessment: one designed to gauge aesthetic quality and another to evaluate compositional alignment.

Our findings indicate that even minute, imperceptible adjustments to the weights of the reward model can significantly distort the bandit’s behavior. Theoretically, we demonstrate a pronounced high-dimensional phenomenon: as the dimensionality of the input grows, the magnitude of the perturbation necessary to execute a successful attack diminishes. This dynamic renders modern applications, particularly those involving image evaluation, highly susceptible to such exploits. Comprehensive experiments validate that while random perturbations are largely ineffective, strategically crafted perturbations can achieve nearly perfect attack success rates.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Exelon CEO Sees Daily Cybersecurity Threats
Bloomberg

Exelon CEO Sees Daily Cybersecurity Threats

Exelon’s CEO warns of daily cybersecurity threats, highlighting persistent risks to the energy giant.

TechCrunch

Ramp raises $750M at $44B valuation as investors hunger for fintechs with an AI story

Ramp secured $750M at a $44B valuation, driven by AI integration and $1.5B+ revenue. The fintech firm now serves 70,000 ...

TechCrunch

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

Hello Robot’s Stretch avoids Silicon Valley hype, focusing on practical home deployment to gather essential real-world d...

Canada to Provide Funding, Buy Equity Stakes in AI Startups
Bloomberg

Canada to Provide Funding, Buy Equity Stakes in AI Startups

Canada will fund and buy equity stakes in AI startups to boost the sector. This investment aims to strengthen the nation...

TechCrunch

Chinese spies are using LinkedIn to lure Westerners into sharing sensitive information

A joint Western security alert warns that Chinese spies use LinkedIn to impersonate recruiters and extract sensitive dat...

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower
Bloomberg

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower

Peter Thiel’s family office set a record rent for a Miami tower lease. This deal establishes a new benchmark for the cit...