Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
Title: Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
Abstract:
The security of Large Language Models (LLMs) is increasingly threatened by backdoor attacks, which enable adversaries to manipulate model outputs. Current defense mechanisms suffer from a significant structural disadvantage: they typically address backdoors individually and rely on prior knowledge of the specific triggers involved. This approach is inadequate when the model contains unknown backdoors. In this work, we demonstrate that backdoor neutralization via unlearning exhibits generalization capabilities. Specifically, we find that training a model to disregard a single trigger can inadvertently suppress other backdoors that were not explicitly targeted during the unlearning process.
We investigate this phenomenon across three distinct model families, where backdoors were introduced through either pretraining or continual pretraining, by systematically analyzing the models resulting from the removal of one backdoor at a time. To elucidate the mechanisms behind this cross-backdoor suppression, we propose the Cross Activation Shift Distance, a metric designed to quantify the divergence between model state changes caused by different training procedures. Our findings suggest a novel avenue for enhancing LLM safety: defenders could intentionally introduce and subsequently eliminate controlled backdoors. This strategy leverages cross-backdoor transfer effects to neutralize unknown threats that attackers may have previously embedded within the model.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





