arXiv

Investigating Adversarial Robustness of Multi-modal Large Language Models

June 3, 2026 · Hashmat Shadab Malik, Muzammal Naseer, Salman Khan · Original Source

Title: Examining the Adversarial Resilience of Multi-modal Large Language Models

Abstract: While Multi-modal Large Language Models (MLLMs) demonstrate impressive capabilities in vision-language tasks, the integration of visual inputs via encoders such as CLIP significantly broadens their attack surface, rendering them susceptible to visual adversarial perturbations. Previous defensive strategies have generally maintained compatibility with pre-trained MLLMs by imposing rigid constraints to align with CLIP’s original embedding space during adversarial fine-tuning. Although this approach is practical, it inherently restricts the maximum level of robustness that can be achieved. This study offers a comprehensive analysis of adversarial robustness within MLLMs. We propose a diagnostic CLIP-alignment protocol designed to forecast, before full MLLM training begins, which robust vision encoders will transfer effectively to multi-modal environments. Our findings indicate that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the pivotal element for successful robustness transfer. By incorporating these superior encoders into MLLMs through end-to-end multimodal training, we observed average improvements of 28 CIDEr points in captioning and an 11.7% increase in VQA accuracy under strong adversarial conditions, outperforming constrained plug-and-play baselines. Additionally, we demonstrate that applying adversarial training directly to a standard, non-robust MLLM results in performance degradation for both clean and adversarial data, proving that robust visual representations are a mandatory foundation. However, conducting end-to-end adversarial training starting from a robust backbone provides further enhancements of 1.9 CIDEr points and 4.3% VQA accuracy. In addition to training-time solutions, we highlight lightweight test-time visual stochastic transformations as an effective black-box defense for non-robust MLLMs, boosting adversarial performance from near-zero levels to match those of robust models. Lastly, we confirm that our robust models significantly mitigate toxic generation during white-box visual jailbreak attacks. The code and pre-trained weights will be made publicly available.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC