CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
Title: CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
Abstract:
Adversarial actors can embed latent malicious behaviors by poisoning as little as 1% of fine-tuning data. This contamination remains undetectable to all output-level defenses, as the harmful patterns remain dormant within the model’s hidden-state geometry and only manifest in generated text once contamination levels surpass 7.5%. To address this, we present CANARY (Contamination Auditor via Neural Activation Representation Yield), a checkpoint auditor that operates without labels. It identifies these hidden shifts by analyzing the results of just two forward passes through an unlabeled set of prompts. CANARY utilizes a Sparse Autoencoder (SAE) to project hidden-state differences, effectively filtering out stylistic noise to isolate significant semantic drift.
The method demonstrates perfect detection capability, achieving an AUROC of 1.000 (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across two training paradigms and four model architectures at just 1% contamination. This sensitivity is 7.5 times lower than the threshold at which any output-based method can trigger detection, while maintaining zero false positives on benign fine-tuning and full resilience against style-matching and gradient-noise adaptive attacks.
Beyond detection, the SAE feature basis powers a comprehensive governance pipeline. Amplifying SAE-filtered signals reveals latent harm at a rate five times higher than standard generation techniques. Additionally, score-ranked prompts provide a 4.2x improvement in red-teaming efficacy. At inference time, suppressing a small number of contamination-specific features reduces harmful output rates from 70% to 10% without increasing perplexity. CANARY stands as the inaugural zero-label framework capable of detecting, verifying, prioritizing, and remediating supply-chain contamination using hidden state data alone.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





