Global News Digest

arXiv

Subliminal Learning Is Steering Vector Distillation

Title: Steering Vector Distillation Drives Subliminal Learning

Abstract:

Subliminal learning occurs when a student language model inherits specific characteristics from a teacher—such as a system-prompted preference for owls—during fine-tuning on the teacher’s outputs, even when those outputs lack semantic connection to the traits in question. The mechanisms by which data devoid of semantic content can convey specific semantic attributes remain largely unexplained. This study reveals that subliminal learning is governed by a single steering vector, defined as a vector added to the model’s internal activations.

Our analysis of two open-source models indicates that the teacher’s system prompt can be effectively approximated by a steering vector. Furthermore, the student’s behavioral shifts are driven by the acquisition of an aligned vector throughout the fine-tuning process. Notably, system prompts that cannot be approximated by steering vectors are not subliminally learned. This phenomenon represents a specific instance of "steering vector distillation," where a student model, trained on the outputs of a steered teacher, learns to replicate that specific steering mechanism.

We validate steering vector distillation using various semantic and random vectors. The addition of a semantic vector to a model’s activations can produce effects that are both model-independent and model-specific (non-semantic). Consequently, non-semantic generated data can transmit a vector with semantic implications, thereby facilitating subliminal learning. This mechanism also clarifies why subliminal learning fails to transfer across different models. Additionally, our findings highlight that adaptive optimizers are essential for subliminal learning in language models. Activation gradients derived from steered data contain a small but persistent component in the steering direction; however, non-adaptive optimizers hinder this process by permitting outlier gradients to dominate.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.