arXiv

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

June 3, 2026 · Hashmat Shadab Malik, Muzammal Naseer, Salman Khan · Original Source

Investigating Adversarial Resilience and Safety Alignment in Multilingual Multi-Modal Large Language Models

Abstract

By incorporating visual perception into language reasoning, Multimodal Large Language Models (MLLMs) create a persistent attack vector that is vulnerable to adversarial manipulation. Existing research on MLLM robustness has predominantly centered on English-only applications, largely neglecting their behavior in multilingual contexts. This study fills that void by conducting a comprehensive analysis of adversarial robustness and multimodal safety across twelve distinct languages, focusing on open-source MLLMs that achieve multilingual proficiency via instruction tuning.

Our investigation utilizes gradient-based attacks to uncover a significant, transferable vulnerability: adversarial images crafted to cause failures in one language remain effective across other languages, highlighting robust cross-lingual transferability. Furthermore, the degree of multilingual safety depends heavily on the model’s ability to accurately retrieve and interpret harmful directives. When malicious intent is conveyed through text, languages with stronger linguistic foundations are more prone to generating responses that facilitate misuse, whereas weaker languages result in fewer unsafe outputs. However, when harmful content is embedded within images as text, English scripts are consistently recognized and executed, while non-English scripts are seldom parsed by the vision encoder.

Consequently, lower-resource languages may seem safer, but this is merely an artifact of the model’s inability to comprehend or visually ground the input—a phenomenon we define as "safety-by-failure"—rather than true safety alignment. In contrast, MLLMs that integrate multilingual capabilities throughout their entire training lifecycle, such as Qwen3-VL, demonstrate authentic cross-lingual safety. These models maintain active refusal mechanisms across all languages instead of hiding comprehension gaps. Ultimately, shallow multilingual adaptations, like fine-tuning on translated instruction data, often yield only superficial understanding that creates a false sense of security in low-resource settings. Conversely, deep integration across various training phases fosters genuine multilingual safety alignment.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC