arXiv

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Title: CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Abstract:

Adversarial actors can embed latent malicious behaviors by poisoning as little as 1% of fine-tuning data. This contamination remains undetectable to all output-level defenses, as the harmful patterns remain dormant within the model’s hidden-state geometry and only manifest in generated text once contamination levels surpass 7.5%. To address this, we present CANARY (Contamination Auditor via Neural Activation Representation Yield), a checkpoint auditor that operates without labels. It identifies these hidden shifts by analyzing the results of just two forward passes through an unlabeled set of prompts. CANARY utilizes a Sparse Autoencoder (SAE) to project hidden-state differences, effectively filtering out stylistic noise to isolate significant semantic drift.

The method demonstrates perfect detection capability, achieving an AUROC of 1.000 (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across two training paradigms and four model architectures at just 1% contamination. This sensitivity is 7.5 times lower than the threshold at which any output-based method can trigger detection, while maintaining zero false positives on benign fine-tuning and full resilience against style-matching and gradient-noise adaptive attacks.

Beyond detection, the SAE feature basis powers a comprehensive governance pipeline. Amplifying SAE-filtered signals reveals latent harm at a rate five times higher than standard generation techniques. Additionally, score-ranked prompts provide a 4.2x improvement in red-teaming efficacy. At inference time, suppressing a small number of contamination-specific features reduces harmful output rates from 70% to 10% without increasing perplexity. CANARY stands as the inaugural zero-label framework capable of detecting, verifying, prioritizing, and remediating supply-chain contamination using hidden state data alone.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...