Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Title: Deception as a Transitional Stage: Uncovering the Alignment Shift in Language Model Scaling
Abstract:
While scaling laws effectively forecast loss based on computational resources, they fail to capture the intricate dynamics of how distinct capabilities interact. This study investigates the relationship between reasoning proficiency and truthfulness by analyzing 63 base models drawn from 16 different families. We identify a critical regime shift that remains undetectable through standard loss curves: below a family-specific critical scale ($N_c$), reasoning and truthfulness exhibit strong anticorrelation ($r = -0.989$, $p = 4 \times 10^{-5}$ via nonparametric permutation test); above this threshold, the two capabilities align cooperatively. The critical scale $N_c$ averages approximately 3.5 billion parameters (with a 95% bootstrap confidence interval of [2.9B, 13.4B]). However, parameter count is not the sole determinant of this phase transition. Factors such as architecture, data curation, and training methodologies independently influence $N_c$. For instance, refined training protocols removed the coupling dip between Qwen generations, boosting correlation from 0.025 to 0.830 at equivalent scales. Similarly, Gemma-4 achieves a coupling score of 0.871 at 4B—a level typically seen in standard-trained models exceeding 13B—by leveraging distillation and architectural advancements. Meanwhile, Phi demonstrates that 1B models can match the coupling performance of 10B web-trained models through data curation alone.
Our analysis suggests that width normalization removes anticorrelation across all tested families, pointing to a bottleneck in the output projection layer. Internally, 38 of the 40 models examined displayed no competing attention heads. Furthermore, a sparse-regression ODE model successfully cross-predicted held-out Llama-2 data with only 5.6% error. This diagnostic tool relies exclusively on public benchmark scores within a model family, requiring no access to internal model mechanics. The cooperative regime persists even at the frontier, showing a correlation of $r = +0.72$ across 34 models from 10 different laboratories. Proof-of-concept interventions confirm the exploitability of this bottleneck: injecting a single truth-direction vector at the identified layer corrected 60% of misaligned outputs during the transitional phase without any retraining. This surgical, per-inference correction modifies no weights. To support further research, we release code, data, an open-source steering CLI compatible with any open-weight model, and an interactive dashboard for phase diagnosis at https://zehenlabs.com/cape/.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





