arXiv

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Title: Minimal Tokens, Maximum Control: Safeguarding Safety Alignment by Regulating Safety Tokens During Fine-Tuning

Abstract:

While fine-tuning (FT) is essential for adapting Large Language Models (LLMs) to specific downstream tasks, it frequently triggers safety-alignment drift, even when the training corpus consists exclusively of benign content. Previous studies have demonstrated that incorporating a minimal amount of malicious data can significantly degrade an LLM’s refusal mechanisms, leading to compliance with harmful prompts. Current defensive strategies typically employ broad model-wide interventions—such as freezing specific parameters or adding extra safety data—which can hinder generalization and impair performance on target tasks.

To overcome these constraints, we introduce a novel fine-tuning framework named Preserving Safety Alignment via Constrained Tokens (PACT). This method stabilizes the model’s confidence regarding safety tokens. Our approach is grounded in the empirical finding that safety-aligned behavior is manifested in token-level output confidence, which tends to be concentrated within a small subset of safety-specific tokens. During the fine-tuning process for downstream applications, PACT regularizes the model to align its confidence on safety-related tokens with that of the pre-aligned reference model at every response step. Meanwhile, non-safety tokens remain largely unconstrained, facilitating effective task adaptation. This precise targeting avoids the utility trade-offs associated with global restrictions, thereby preventing alignment drift. The source code for PACT is accessible at https://github.com/Glresearch1/PACT.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...