Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Title: Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Abstract:
Large Language Models (LLMs) that have undergone safety alignment are still susceptible to inference-time interventions capable of steering their outputs toward harmful content. While recent studies have linked this issue to "shallow safety"—a phenomenon where alignment efforts are concentrated in the initial tokens of the output—we demonstrate that this is merely a specific instance of a more pervasive inference-time vulnerability. Specifically, we show that injecting short sequences of tokens at any point during generation can significantly disrupt subsequent safety protocols.
Furthermore, our analysis reveals that a model’s alignment with refusal directions within its hidden states is not an accurate predictor of its resilience to such injections. This finding indicates that internal state representations alone are insufficient to guarantee stable generation behavior when subjected to perturbations. To mitigate these risks, we propose aligning models directly on generation trajectories derived from simulations of mid-sequence perturbations. This approach not only enhances robustness against mid-sequence injections but also generalizes effectively to attacks that target early-token generation. Our results underscore the necessity of training on the generative process itself, rather than focusing exclusively on final outputs, to achieve truly robust safety alignment.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





