Subliminal Learning is a LoRA Artifact
Title: Subliminal Learning Identified as a LoRA Artifact
Abstract:
Recent research has highlighted "subliminal learning," a phenomenon in which language models pass behavioral characteristics to one another via data that appears harmless (Cloud et al., 2025). In this process, a teacher model possessing a specific traitāsuch as an intense fixation on catsācan instill that same obsession in a student model, even if the student is fine-tuned exclusively on numerical sequences produced by the teacher. This paper investigates the mechanisms behind this surprising transmission of behavior. Our findings reveal that subliminal learning is essentially an artifact of Low-Rank Adaptation (LoRA). Specifically, we observe that the effectiveness of this transmission follows an inverted U-shaped curve relative to LoRA rank. Furthermore, the phenomenon vanishes entirely when full fine-tuning is employed.
We also demonstrate that subliminal learning is heavily contingent upon the context present during both the fine-tuning and evaluation phases. For instance, a Qwen model trained with its default system prompt ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") fails to exhibit subliminal learning during generation if no system prompt is provided at that time. Additionally, we show that these behavioral shifts are localized to the computations involving tokens that appear in both the fine-tuning and evaluation contexts, such as the modelās standard chat template tokens or its default system prompt. Consequently, subliminal learning appears to be a fragile byproduct of LoRA hyperparameters and specific fine-tuning contexts, rendering it an unreliable channel for behavioral transmission.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




