DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
Title: DataShield: Mitigating Safety Degradation in LLM Benign Instruction Fine-Tuning via Data Filtering
Abstract:
Large Language Models (LLMs) often experience a decline in safety performance even when subjected to fine-tuning with harmless datasets. While current approaches attempt to pinpoint safety-eroding samples within these benign corpora, they are frequently hampered by prohibitive computational expenses and substantial noise interference. To address these challenges, we introduce DataShield, a novel framework designed to efficiently and accurately detect potential safety-degrading instances.
Our approach is grounded in the observation that benign fine-tuning generally elevates the overall compliance of LLMs. Leveraging this, DataShield quantifies the extent to which each data sample influences the modelās compliance behavior, assigning it a "safety degradation score." The architecture of DataShield comprises three primary modules:
- Compliance Vector Extraction: This component analyzes and captures the LLMās tendency toward compliance.
- Compliance-Aware Score (CAS): A new metric introduced to automatically locate the most safety-critical layer within the model.
- Safety-degrading Sample Filtering: This module measures the shift in training data projections along the compliance axis to identify risky samples.
We validated the efficacy of our method through extensive experiments on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B, utilizing the Alpaca and Dolly datasets. The results confirm that DataShield effectively distinguishes between high-risk and low-risk data subsets. Additionally, our analysis revealed that open-ended question-answering tasks are more prone to inducing safety degradation, with the associated responses typically being longer. This study aims to offer fresh perspectives on data-centric defense strategies. The source code is publicly accessible at: https://github.com/ZJunBo/DataShield.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




