BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks
Title: BYORn: Bootstrapping Model Responses to Shield Large Vision-Language Models from Backdoor Injections
Abstract:
Supervised fine-tuning remains the standard method for tailoring autoregressive vision-language models to specific downstream applications. However, recent studies have highlighted that this training paradigm is exceptionally susceptible to backdoor attacks, with current mitigation strategies proving inadequate for open-ended generation tasks. To address these vulnerabilities, we introduce BYORn, a fine-tuning framework designed to resist backdoor threats. This approach is grounded in the insight that poisoned target responses frequently lack semantic coherence when evaluated against their corresponding image-text inputs and a pretrained model. BYORn detects these misaligned outputs and dynamically substitutes them with alternative responses produced by the model itself, effectively severing the link between malicious triggers and target outputs. The resulting objective gradient aligns with the gradient of the empirical estimate of the population risk upper bound derived from clean data distributions. Experimental results demonstrate that BYORn consistently enhances robustness against backdoor attacks without compromising performance on clean tasks, thereby defining a new boundary in the trade-off between generalization capability and attack success rates. Furthermore, we show that BYORn maintains its effectiveness even when confronted with adaptive attacks engineered specifically to bypass this defense mechanism.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



