TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
Title: TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
Abstract: As open-weight large language models (LLMs) grow more capable and are increasingly deployed, bolstering their resistance to unsafe modifications—whether accidental or malicious—has become essential for risk mitigation. Yet, a standardized method for assessing tamper resistance is currently lacking. The diversity of datasets, evaluation metrics, and tampering setups hinders meaningful comparisons of safety, utility, and robustness across various models and defensive strategies. To resolve this gap, we present TamperBench, the inaugural unified framework designed for the systematic assessment of LLM tamper resistance. TamperBench achieves this by (i) assembling a comprehensive repository of cutting-edge weight-space fine-tuning attacks, latent-space representation attacks, and alignment-stage defenses; (ii) facilitating realistic adversarial testing via systematic hyperparameter sweeps for each attack-model combination; and (iii) delivering evaluations of both safety and utility. Leveraging TamperBench, we assessed 21 open-weight LLMs, including variants enhanced with defenses, against nine distinct tampering threats. This evaluation utilized standardized safety and capability metrics, incorporating hyperparameter sweeps for every model-attack pair. Our findings reveal key insights: the impact of post-training on tamper resilience, the observation that jailbreak-tuning generally constitutes the most potent attack, and the conclusion that existing alignment-stage defenses are largely ineffective against comprehensive attack sweeps. The codebase is accessible at https://github.com/criticalml-uw/TamperBench.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




