arXiv

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

June 2, 2026 · Junbo Zhang, Qianli Zhou, Xinyang Deng, Wen Jiang, Jie Pan, Jinbiao Zhu · Original Source

Title: DataShield: Mitigating Safety Degradation in LLM Benign Instruction Fine-Tuning via Data Filtering

Abstract:

Large Language Models (LLMs) often experience a decline in safety performance even when subjected to fine-tuning with harmless datasets. While current approaches attempt to pinpoint safety-eroding samples within these benign corpora, they are frequently hampered by prohibitive computational expenses and substantial noise interference. To address these challenges, we introduce DataShield, a novel framework designed to efficiently and accurately detect potential safety-degrading instances.

Our approach is grounded in the observation that benign fine-tuning generally elevates the overall compliance of LLMs. Leveraging this, DataShield quantifies the extent to which each data sample influences the model’s compliance behavior, assigning it a "safety degradation score." The architecture of DataShield comprises three primary modules:

Compliance Vector Extraction: This component analyzes and captures the LLM’s tendency toward compliance.
Compliance-Aware Score (CAS): A new metric introduced to automatically locate the most safety-critical layer within the model.
Safety-degrading Sample Filtering: This module measures the shift in training data projections along the compliance axis to identify risky samples.

We validated the efficacy of our method through extensive experiments on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B, utilizing the Alpaca and Dolly datasets. The results confirm that DataShield effectively distinguishes between high-risk and low-risk data subsets. Additionally, our analysis revealed that open-ended question-answering tasks are more prone to inducing safety degradation, with the associated responses typically being longer. This study aims to offer fresh perspectives on data-centric defense strategies. The source code is publicly accessible at: https://github.com/ZJunBo/DataShield.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC