arXiv

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

Title: Accurately Quantifying Alignment-Induced Activation Shifts: A Protocol Using Template-Controlled Difference-in-Differences

Abstract

Analyzing the internal representations of language models before and after alignment offers a direct method for understanding the impact of safety training. Typically, this involves constructing a matrix of paired activation differences (aligned minus base) on safety-relevant inputs and examining its effective rank or primary direction. However, we demonstrate that the standard approach to constructing this matrix suffers from significant confounding. Because the aligned model is evaluated using a chat template that was absent during the base model’s evaluation, the resulting naive difference inadvertently mixes the true alignment shift with the effects of chat formatting.

To address this, we propose a four-part decomposition of the modification matrix, distinguishing between naive, template-controlled, within-aligned, and difference-in-differences (DiD) variants. This framework effectively isolates the two distinct effects. Implementing template control alone eliminates an inflation of the measured effective rank by a factor of 2.0 to 3.9 across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B. Furthermore, the DiD contrast is essential for recovering the refusal direction identified by Arditi et al. (2024), improving its cosine alignment score from a range of 0.18–0.39 to 0.50–0.86. Through projection-ablation experiments across these three model families, we confirm that the recovered subspace is behaviorally significant and that the order of singular values does not necessarily correspond to causal importance. We validate this protocol on a controlled testbed and provide distilled measurement recommendations for future activation-difference studies on alignment.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...