arXiv

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Title: Deception as a Transitional Stage: Uncovering the Alignment Shift in Language Model Scaling

Abstract:

While scaling laws effectively forecast loss based on computational resources, they fail to capture the intricate dynamics of how distinct capabilities interact. This study investigates the relationship between reasoning proficiency and truthfulness by analyzing 63 base models drawn from 16 different families. We identify a critical regime shift that remains undetectable through standard loss curves: below a family-specific critical scale ($N_c$), reasoning and truthfulness exhibit strong anticorrelation ($r = -0.989$, $p = 4 \times 10^{-5}$ via nonparametric permutation test); above this threshold, the two capabilities align cooperatively. The critical scale $N_c$ averages approximately 3.5 billion parameters (with a 95% bootstrap confidence interval of [2.9B, 13.4B]). However, parameter count is not the sole determinant of this phase transition. Factors such as architecture, data curation, and training methodologies independently influence $N_c$. For instance, refined training protocols removed the coupling dip between Qwen generations, boosting correlation from 0.025 to 0.830 at equivalent scales. Similarly, Gemma-4 achieves a coupling score of 0.871 at 4B—a level typically seen in standard-trained models exceeding 13B—by leveraging distillation and architectural advancements. Meanwhile, Phi demonstrates that 1B models can match the coupling performance of 10B web-trained models through data curation alone.

Our analysis suggests that width normalization removes anticorrelation across all tested families, pointing to a bottleneck in the output projection layer. Internally, 38 of the 40 models examined displayed no competing attention heads. Furthermore, a sparse-regression ODE model successfully cross-predicted held-out Llama-2 data with only 5.6% error. This diagnostic tool relies exclusively on public benchmark scores within a model family, requiring no access to internal model mechanics. The cooperative regime persists even at the frontier, showing a correlation of $r = +0.72$ across 34 models from 10 different laboratories. Proof-of-concept interventions confirm the exploitability of this bottleneck: injecting a single truth-direction vector at the identified layer corrected 60% of misaligned outputs during the transitional phase without any retraining. This surgical, per-inference correction modifies no weights. To support further research, we release code, data, an open-source steering CLI compatible with any open-weight model, and an interactive dashboard for phase diagnosis at https://zehenlabs.com/cape/.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...