Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol
Title: Accurately Quantifying Alignment-Induced Activation Shifts: A Protocol Using Template-Controlled Difference-in-Differences
Abstract
Analyzing the internal representations of language models before and after alignment offers a direct method for understanding the impact of safety training. Typically, this involves constructing a matrix of paired activation differences (aligned minus base) on safety-relevant inputs and examining its effective rank or primary direction. However, we demonstrate that the standard approach to constructing this matrix suffers from significant confounding. Because the aligned model is evaluated using a chat template that was absent during the base model’s evaluation, the resulting naive difference inadvertently mixes the true alignment shift with the effects of chat formatting.
To address this, we propose a four-part decomposition of the modification matrix, distinguishing between naive, template-controlled, within-aligned, and difference-in-differences (DiD) variants. This framework effectively isolates the two distinct effects. Implementing template control alone eliminates an inflation of the measured effective rank by a factor of 2.0 to 3.9 across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B. Furthermore, the DiD contrast is essential for recovering the refusal direction identified by Arditi et al. (2024), improving its cosine alignment score from a range of 0.18–0.39 to 0.50–0.86. Through projection-ablation experiments across these three model families, we confirm that the recovered subspace is behaviorally significant and that the order of singular values does not necessarily correspond to causal importance. We validate this protocol on a controlled testbed and provide distilled measurement recommendations for future activation-difference studies on alignment.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





