Global News Digest

arXiv

DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning

Title: DataShield: Mitigating Safety Degradation in LLM Benign Instruction Fine-Tuning via Data Filtering

Abstract:

Large Language Models (LLMs) often experience a decline in safety performance even when subjected to fine-tuning with harmless datasets. While current approaches attempt to pinpoint safety-eroding samples within these benign corpora, they are frequently hampered by prohibitive computational expenses and substantial noise interference. To address these challenges, we introduce DataShield, a novel framework designed to efficiently and accurately detect potential safety-degrading instances.

Our approach is grounded in the observation that benign fine-tuning generally elevates the overall compliance of LLMs. Leveraging this, DataShield quantifies the extent to which each data sample influences the model’s compliance behavior, assigning it a "safety degradation score." The architecture of DataShield comprises three primary modules:

  1. Compliance Vector Extraction: This component analyzes and captures the LLM’s tendency toward compliance.
  2. Compliance-Aware Score (CAS): A new metric introduced to automatically locate the most safety-critical layer within the model.
  3. Safety-degrading Sample Filtering: This module measures the shift in training data projections along the compliance axis to identify risky samples.

We validated the efficacy of our method through extensive experiments on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B, utilizing the Alpaca and Dolly datasets. The results confirm that DataShield effectively distinguishes between high-risk and low-risk data subsets. Additionally, our analysis revealed that open-ended question-answering tasks are more prone to inducing safety degradation, with the associated responses typically being longer. This study aims to offer fresh perspectives on data-centric defense strategies. The source code is publicly accessible at: https://github.com/ZJunBo/DataShield.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.