Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs
Title: Unlearning Leaves a Mark: Identifying Unlearning Signatures in LLMs via Model Outputs
Abstract:
The process of machine unlearning (MU) in large language models (LLMs)—often termed LLM unlearning—aims to excise specific unwanted data or knowledge from a trained system without degrading its efficacy on conventional tasks. Although unlearning is essential for safeguarding data privacy, upholding copyright laws, and reducing sociotechnical risks associated with LLMs, our research highlights a previously overlooked vulnerability that emerges after the unlearning process: the detectability of unlearning traces.
We have found that unlearning creates enduring "fingerprints" within LLMs. These traces are visible in both the model’s internal representations and its behavioral outputs. Notably, these signatures can be identified through response outputs, even when the model is presented with inputs unrelated to the forgotten data. Specifically, a basic supervised classifier can accurately determine if a model has undergone unlearning by analyzing merely its prediction logits or its textual responses.
Further investigation reveals that these traces reside in intermediate activations and propagate nonlinearly to the final layer, creating low-dimensional, learnable manifolds within the activation space. Our extensive experiments confirm that unlearning traces can be detected with greater than 90% accuracy, even when using forget-irrelevant inputs. Additionally, we observe that larger LLMs display more pronounced detectability. These results indicate that unlearning generates measurable signatures, thereby introducing a novel risk: if a model is identified as having undergone unlearning, it may be vulnerable to reverse-engineering attempts to recover the forgotten information based on a given input query.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





