arXiv

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Title: Robust-LLaVA: Assessing the Impact of Large-Scale Robust Image Encoders on Multi-Modal Large Language Models

Multi-modal Large Language Models (MLLMs) have demonstrated exceptional proficiency in vision-language applications; however, they remain susceptible to visual adversarial perturbations. These vulnerabilities can trigger hallucinations, alter model outputs, or circumvent safety protocols. Current strategies aim to reduce these risks by subjecting CLIP vision encoders to constrained adversarial fine-tuning using ImageNet-scale datasets, thereby attempting to preserve their generalization capabilities. Yet, this narrow scope of adversarial training often limits both robustness and broader generalization potential.

This study investigates an alternative strategy: utilizing vision classification models that have already undergone adversarial pre-training on massive datasets. Our analysis highlights two primary findings. First, the vast scale and diversity inherent in such adversarial pre-training allow these models to exhibit heightened resilience against a wide spectrum of threats—from subtle, imperceptible perturbations to sophisticated jailbreaking attempts—without the need for further adversarial training. Second, integrating these robust models into MLLMs via an end-to-end approach enables the language components to better adapt to robust visual features, surpassing current plug-and-play methods in complex reasoning scenarios.

We conducted a comprehensive evaluation covering visual question-answering, image captioning, and jailbreak attack simulations. The results indicate that MLLMs incorporating these robust models achieve significantly higher adversarial robustness while maintaining strong performance on clean data. Specifically, our framework yields average robustness improvements of 2x for captioning tasks and 1.5x for VQA tasks, alongside a jailbreak attack resistance increase of over 10%. The codebase and pre-trained models will be released at https://github.com/HashmatShadab/Robust-LLaVA.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Exelon CEO Sees Daily Cybersecurity Threats
Bloomberg

Exelon CEO Sees Daily Cybersecurity Threats

Exelon’s CEO warns of daily cybersecurity threats, highlighting persistent risks to the energy giant.

TechCrunch

Ramp raises $750M at $44B valuation as investors hunger for fintechs with an AI story

Ramp secured $750M at a $44B valuation, driven by AI integration and $1.5B+ revenue. The fintech firm now serves 70,000 ...

TechCrunch

Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.

Hello Robot’s Stretch avoids Silicon Valley hype, focusing on practical home deployment to gather essential real-world d...

Canada to Provide Funding, Buy Equity Stakes in AI Startups
Bloomberg

Canada to Provide Funding, Buy Equity Stakes in AI Startups

Canada will fund and buy equity stakes in AI startups to boost the sector. This investment aims to strengthen the nation...

TechCrunch

Chinese spies are using LinkedIn to lure Westerners into sharing sensitive information

A joint Western security alert warns that Chinese spies use LinkedIn to impersonate recruiters and extract sensitive dat...

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower
Bloomberg

Peter Thiel’s Family Office Pays Record Rent for Top Miami Tower

Peter Thiel’s family office set a record rent for a Miami tower lease. This deal establishes a new benchmark for the cit...