Subliminal Learning Is Steering Vector Distillation
Title: Steering Vector Distillation Drives Subliminal Learning
Abstract:
Subliminal learning occurs when a student language model inherits specific characteristics from a teacherāsuch as a system-prompted preference for owlsāduring fine-tuning on the teacherās outputs, even when those outputs lack semantic connection to the traits in question. The mechanisms by which data devoid of semantic content can convey specific semantic attributes remain largely unexplained. This study reveals that subliminal learning is governed by a single steering vector, defined as a vector added to the modelās internal activations.
Our analysis of two open-source models indicates that the teacherās system prompt can be effectively approximated by a steering vector. Furthermore, the studentās behavioral shifts are driven by the acquisition of an aligned vector throughout the fine-tuning process. Notably, system prompts that cannot be approximated by steering vectors are not subliminally learned. This phenomenon represents a specific instance of "steering vector distillation," where a student model, trained on the outputs of a steered teacher, learns to replicate that specific steering mechanism.
We validate steering vector distillation using various semantic and random vectors. The addition of a semantic vector to a modelās activations can produce effects that are both model-independent and model-specific (non-semantic). Consequently, non-semantic generated data can transmit a vector with semantic implications, thereby facilitating subliminal learning. This mechanism also clarifies why subliminal learning fails to transfer across different models. Additionally, our findings highlight that adaptive optimizers are essential for subliminal learning in language models. Activation gradients derived from steered data contain a small but persistent component in the steering direction; however, non-adaptive optimizers hinder this process by permitting outlier gradients to dominate.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




