Erased but Not Forgotten: How Backdoors Compromise Concept Erasure
Title: Hidden Residues: How Backdoors Undermine Concept Erasure
The rapid proliferation of text-to-image diffusion models has sparked significant anxiety regarding the generation of dangerous material, ranging from deepfake portrayals of public figures to sexually explicit content. In response, researchers have developed concept erasure techniques that utilize fine-tuning to disconnect models from undesirable topics. However, it remains uncertain whether these strategies effectively eliminate all associations with harmful concepts or if they merely mask surface-level links.
This study identifies a critical security flaw known as the Erasure Evasion Backdoor (EEB). In this attack vector, an adversary attaches a specific trigger to a concept targeted for removal, ensuring that the malicious connection persists even after the erasure process is complete. We demonstrate that this threat can be executed by both black-box and white-box attackers.
When tested against six leading erasure algorithms—including robust methods designed to actively seek alternative representations for the targeted concept—the EEB consistently succeeded in revealing prohibited content. The attack achieved success rates of up to 82% in celebrity identity unlearning and up to 94% in object erasure. Furthermore, the backdoor amplified the exposure of explicit material by a factor of up to 16.
Although the EEB highlights a significant oversight in current erasure protocols, it also serves as a valuable diagnostic mechanism for rigorously stress-testing and improving future concept removal techniques.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




