arXiv

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Title: TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

Abstract: As open-weight large language models (LLMs) grow more capable and are increasingly deployed, bolstering their resistance to unsafe modifications—whether accidental or malicious—has become essential for risk mitigation. Yet, a standardized method for assessing tamper resistance is currently lacking. The diversity of datasets, evaluation metrics, and tampering setups hinders meaningful comparisons of safety, utility, and robustness across various models and defensive strategies. To resolve this gap, we present TamperBench, the inaugural unified framework designed for the systematic assessment of LLM tamper resistance. TamperBench achieves this by (i) assembling a comprehensive repository of cutting-edge weight-space fine-tuning attacks, latent-space representation attacks, and alignment-stage defenses; (ii) facilitating realistic adversarial testing via systematic hyperparameter sweeps for each attack-model combination; and (iii) delivering evaluations of both safety and utility. Leveraging TamperBench, we assessed 21 open-weight LLMs, including variants enhanced with defenses, against nine distinct tampering threats. This evaluation utilized standardized safety and capability metrics, incorporating hyperparameter sweeps for every model-attack pair. Our findings reveal key insights: the impact of post-training on tamper resilience, the observation that jailbreak-tuning generally constitutes the most potent attack, and the conclusion that existing alignment-stage defenses are largely ineffective against comprehensive attack sweeps. The codebase is accessible at https://github.com/criticalml-uw/TamperBench.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...