Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models
Title: Moving Past Illusory Stability: High-Noise Drift Gating for Test-Time Adversarial Protection in Vision-Language Models
Abstract
While Vision-Language Models (VLMs) like CLIP demonstrate impressive zero-shot generalization capabilities, they remain exceptionally susceptible to adversarial attacks. Although adversarial training can enhance robustness, its high computational cost has driven interest in test-time defense strategies. Current methods typically leverage the behavior of CLIP’s visual representations under stochastic perturbations. These techniques include aggregating predictions from multiple noisy views, creating Gaussian noise-averaged anchors to interpolate features toward, or applying counter-perturbations. While these approaches bolster robustness, they frequently come at the expense of clean accuracy, resulting in a suboptimal balance between the two.
This study re-examines stochastic test-time defenses by identifying a previously overlooked transition in the noise regime within CLIP’s representation space. Previous research has primarily focused on the weak-noise regime, a context where adversarial examples can exhibit misleading stability, or "false stability." Our analysis reveals that this dynamic inverts as perturbation intensity increases. Beyond the weak-noise threshold, adversarial representations become significantly more unstable compared to clean ones, providing a more distinct separation signal. This transition phenomenon proves robust across various conditions, including uniform and Gaussian noise, photometric and geometric transformations, different datasets, and diverse attack vectors. Notably, this effect largely vanishes in models trained with adversarial techniques, suggesting a link to the fragile local-basin geometry inherent to non-robust CLIP models.
To address this, we introduce a training-free, plug-in drift-gated mechanism. This system utilizes feature drift observed under high-noise conditions as a lightweight gating signal, activating existing test-time defenses only when adversarial-like instability is identified. Evaluated across 13 datasets, our method consistently enhances the clean-robust accuracy trade-off. Specifically, on eight fine-grained datasets, the mean accuracy for combined clean and adversarial samples increased from 65.7% to 71.4% for counterattack defenses, and from 68.4% to 73.2% for noise-anchoring methods. Similarly, on ImageNet and four shifted variants, performance improved from 56.1% to 66.2% and from 62.1% to 67.6%, respectively.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






