Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Title: Multilingual Unlearning in LLMs: Transfer, Dynamics, and Reversibility
Abstract:
The tendency of large language models (LLMs) to memorize sensitive information has spurred the development of unlearning techniques designed to excise specific knowledge without the high costs associated with full retraining. Despite this progress, the majority of unlearning research continues to focus predominantly on English. To address this gap, our study investigates multilingual unlearning by adapting the TOFU benchmark for use across five distinct languages. We evaluate our models by fine-tuning, unlearning, and querying them using various language permutations.
Our findings indicate that unlearning transfer—the capacity of a model to "forget" information in languages other than the one targeted for unlearning—exhibits significant variability. This transfer is most robust between languages that share scripts or linguistic families. Furthermore, we demonstrate that the specific language chosen for unlearning serves as a predictor for which query languages will exhibit the strongest transfer effects.
A layer-wise analysis provides further insight into this phenomenon, revealing that unlearning leaves the shared cross-lingual latent space largely unaffected in the model’s early layers. Instead, the unlearning process primarily impacts the later decoding layers. This pattern suggests that unlearning does not result in the genuine erasure of knowledge but rather induces a form of superficial suppression. Leveraging this structural insight, we show that a single inference-time steering direction can reverse much of this suppression across different languages. This approach successfully recovers 50% of the previously unlearned knowledge in Qwen models and 90% in Gemma models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





