arXiv

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Title: Patcher: Post-Hoc Patching of Backdoored Large Language Models

Abstract:

Despite their widespread adoption, large language models (LLMs) continue to face significant risks from jailbreak backdoor attacks. In these scenarios, malicious actors compromise safety alignment datasets to insert covert triggers that circumvent security protocols. Current mitigation strategies are often impractical for real-world deployment because they typically demand detailed knowledge of the attack vector or require multiple instances of triggered behavior. This creates a challenge for defenders who may only have access to a single observed failure case and lack the certainty that the incident is due to a backdoor rather than a standard alignment error.

To address this gap, we introduce Patcher, a novel post-hoc defense framework capable of repairing compromised LLMs using merely the model’s parameters and one reported instance of failure. Patcher functions through a two-phase process. Initially, it identifies backdoor triggers by calculating gradient-based saliency scores conditioned on responses, followed by adaptive clustering to distinguish triggers from harmless contextual information. Subsequently, the framework repairs the model via a constrained fine-tuning objective. This step severs the link between the trigger and the malicious response while ensuring the model retains its performance on benign tasks and remains resilient against non-triggered jailbreak attempts, enforced through KL-divergence constraints.

Our comprehensive evaluations across various backdoor attack methodologies confirm that Patcher effectively isolates triggers and neutralizes backdoors without degrading overall model utility. Furthermore, the framework demonstrates strong robustness against adaptive attacks specifically engineered to bypass this defense. This study marks a crucial advancement in developing practical, deployable defenses against training-time compromises in language models.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...