Patcher: Post-Hoc Patching of Backdoored Large Language Models
Title: Patcher: Post-Hoc Patching of Backdoored Large Language Models
Abstract:
Despite their widespread adoption, large language models (LLMs) continue to face significant risks from jailbreak backdoor attacks. In these scenarios, malicious actors compromise safety alignment datasets to insert covert triggers that circumvent security protocols. Current mitigation strategies are often impractical for real-world deployment because they typically demand detailed knowledge of the attack vector or require multiple instances of triggered behavior. This creates a challenge for defenders who may only have access to a single observed failure case and lack the certainty that the incident is due to a backdoor rather than a standard alignment error.
To address this gap, we introduce Patcher, a novel post-hoc defense framework capable of repairing compromised LLMs using merely the model’s parameters and one reported instance of failure. Patcher functions through a two-phase process. Initially, it identifies backdoor triggers by calculating gradient-based saliency scores conditioned on responses, followed by adaptive clustering to distinguish triggers from harmless contextual information. Subsequently, the framework repairs the model via a constrained fine-tuning objective. This step severs the link between the trigger and the malicious response while ensuring the model retains its performance on benign tasks and remains resilient against non-triggered jailbreak attempts, enforced through KL-divergence constraints.
Our comprehensive evaluations across various backdoor attack methodologies confirm that Patcher effectively isolates triggers and neutralizes backdoors without degrading overall model utility. Furthermore, the framework demonstrates strong robustness against adaptive attacks specifically engineered to bypass this defense. This study marks a crucial advancement in developing practical, deployable defenses against training-time compromises in language models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



