arXiv

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Title: Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Abstract:

Large Language Models (LLMs) that have undergone safety alignment are still susceptible to inference-time interventions capable of steering their outputs toward harmful content. While recent studies have linked this issue to "shallow safety"—a phenomenon where alignment efforts are concentrated in the initial tokens of the output—we demonstrate that this is merely a specific instance of a more pervasive inference-time vulnerability. Specifically, we show that injecting short sequences of tokens at any point during generation can significantly disrupt subsequent safety protocols.

Furthermore, our analysis reveals that a model’s alignment with refusal directions within its hidden states is not an accurate predictor of its resilience to such injections. This finding indicates that internal state representations alone are insufficient to guarantee stable generation behavior when subjected to perturbations. To mitigate these risks, we propose aligning models directly on generation trajectories derived from simulations of mid-sequence perturbations. This approach not only enhances robustness against mid-sequence injections but also generalizes effectively to attacks that target early-token generation. Our results underscore the necessity of training on the generative process itself, rather than focusing exclusively on final outputs, to achieve truly robust safety alignment.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...

AI Concentration Risk Is the Problem: 3-Minutes MLIV
Bloomberg

AI Concentration Risk Is the Problem: 3-Minutes MLIV

The article argues that AI concentration risk, rather than the technology itself, is the primary concern. It highlights ...

Reuters

Foxconn announces strategic collaboration with Intel on next-gen AI infrastructure

Foxconn and Intel announced a strategic partnership to develop next-generation AI infrastructure. This collaboration aim...

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...