arXiv

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

June 4, 2026 · Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan · Original Source

Title: Robust-LLaVA: Assessing the Impact of Large-Scale Robust Image Encoders on Multi-Modal Large Language Models

Multi-modal Large Language Models (MLLMs) have demonstrated exceptional proficiency in vision-language applications; however, they remain susceptible to visual adversarial perturbations. These vulnerabilities can trigger hallucinations, alter model outputs, or circumvent safety protocols. Current strategies aim to reduce these risks by subjecting CLIP vision encoders to constrained adversarial fine-tuning using ImageNet-scale datasets, thereby attempting to preserve their generalization capabilities. Yet, this narrow scope of adversarial training often limits both robustness and broader generalization potential.

This study investigates an alternative strategy: utilizing vision classification models that have already undergone adversarial pre-training on massive datasets. Our analysis highlights two primary findings. First, the vast scale and diversity inherent in such adversarial pre-training allow these models to exhibit heightened resilience against a wide spectrum of threats—from subtle, imperceptible perturbations to sophisticated jailbreaking attempts—without the need for further adversarial training. Second, integrating these robust models into MLLMs via an end-to-end approach enables the language components to better adapt to robust visual features, surpassing current plug-and-play methods in complex reasoning scenarios.

We conducted a comprehensive evaluation covering visual question-answering, image captioning, and jailbreak attack simulations. The results indicate that MLLMs incorporating these robust models achieve significantly higher adversarial robustness while maintaining strong performance on clean data. Specifically, our framework yields average robustness improvements of 2x for captioning tasks and 1.5x for VQA tasks, alongside a jailbreak attack resistance increase of over 10%. The codebase and pre-trained models will be released at https://github.com/HashmatShadab/Robust-LLaVA.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC