Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning
Title: Minimal Tokens, Maximum Control: Safeguarding Safety Alignment by Regulating Safety Tokens During Fine-Tuning
Abstract:
While fine-tuning (FT) is essential for adapting Large Language Models (LLMs) to specific downstream tasks, it frequently triggers safety-alignment drift, even when the training corpus consists exclusively of benign content. Previous studies have demonstrated that incorporating a minimal amount of malicious data can significantly degrade an LLM’s refusal mechanisms, leading to compliance with harmful prompts. Current defensive strategies typically employ broad model-wide interventions—such as freezing specific parameters or adding extra safety data—which can hinder generalization and impair performance on target tasks.
To overcome these constraints, we introduce a novel fine-tuning framework named Preserving Safety Alignment via Constrained Tokens (PACT). This method stabilizes the model’s confidence regarding safety tokens. Our approach is grounded in the empirical finding that safety-aligned behavior is manifested in token-level output confidence, which tends to be concentrated within a small subset of safety-specific tokens. During the fine-tuning process for downstream applications, PACT regularizes the model to align its confidence on safety-related tokens with that of the pre-aligned reference model at every response step. Meanwhile, non-safety tokens remain largely unconstrained, facilitating effective task adaptation. This precise targeting avoids the utility trade-offs associated with global restrictions, thereby preventing alignment drift. The source code for PACT is accessible at https://github.com/Glresearch1/PACT.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




