arXiv

When Autoregressive Consistency Hurts Safety Alignment

Title: The Detrimental Impact of Autoregressive Consistency on Safety Alignment

Abstract

Safety alignment in large language models (LLMs) often proves fragile due to its superficial nature; fine-tuning processes primarily alter model behavior in the immediate vicinity of the initial output tokens. We posit that this issue is rooted in "autoregressive consistency," a characteristic where next-token prediction inherently preserves and extends the current response trajectory. Through an analysis of safety alignment learning dynamics, we demonstrate that autoregressive consistency tends to concentrate alignment updates on early tokens, providing a mechanistic rationale for the phenomenon of shallow safety alignment.

This same mechanism forecasts a wider spectrum of potential LLM vulnerabilities: specifically, attacks that force harmful continuation states at any point along the output trajectory. To illustrate this, we present the "random insertion attack," a method that embeds a brief harmful segment into an otherwise safe refusal sequence. By leveraging autoregressive consistency, this attack sustains the harmful branch, effectively circumventing safety alignment. Remarkably, even a concise harmful span can redirect generation toward harmful content, persisting well beyond a lengthy refusal prefix. This underscores autoregressive consistency as a significant, generalized failure mode.

Consequently, we argue that safety alignment must actively disrupt harmful autoregressive consistency across the entire output trajectory. In response, we introduce "adversarial safety alignment," a preliminary framework grounded in worst-case harmful continuation states. We instantiate this framework using "random worst-insertion training." Ultimately, our findings indicate that autoregressive consistency must be regarded as a pivotal factor in both the design of safety alignment protocols and the development of adversarial attacks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...

TechCrunch

Cash App launches a wand for tap-and-pay

Cash App launched a $25 NFC "Magic Wand" for tap-and-pay, blending viral novelty with practical contactless payments. It...

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings
Bloomberg

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings

Databricks CEO plans to avoid an IPO in 2021, despite a surge in public offerings. This contrasts with earlier reports t...

TechCrunch

Waymo’s spent robotaxi batteries will be used as grid storage

Waymo partners with B2U to repurpose retired robotaxi batteries for grid storage in California and Texas, aligning with ...