Jailbreak Attack Initializations as Extractors of Compliance Directions
Title: Jailbreak Attack Initializations as Extractors of Compliance Directions
Abstract: In the activation space of safety-aligned large language models (LLMs), responses to prompts fall into two categories: compliance or refusal, each mapped to distinct directional vectors. While recent studies indicate that initializing attacks through self-transfer from other prompts significantly boosts their efficacy, the precise mechanisms driving these initializations remain obscure, with current methods relying on arbitrary or manually selected starting points. This study reveals that gradient-based jailbreak attacks, along with their subsequent initializations, progressively converge toward a unified compliance direction. This convergence suppresses refusal behaviors, facilitating an efficient shift from refusal to compliance. Leveraging this finding, we introduce CRI, an initialization framework designed to project unseen prompts further along these compliance vectors. Our evaluation across various models, datasets, and attack methods demonstrates that CRI improves attack success rates (ASR) while lowering computational costs, thereby underscoring the vulnerabilities inherent in safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





