Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety
Title: Leveraging Quality-Diversity Evolution to Uncover Distinct Vulnerabilities in LLM Safety
Abstract: Existing methods for adversarially testing Large Language Models (LLMs) are plagued by significant coverage limitations. Manual red-teaming lacks scalability, techniques relying on LLMs as attackers tend to suffer from mode collapse, and gradient-based methods often yield uninterpretable nonsense. To address these issues, we propose a quality-diversity evolutionary framework that functions at the semantic level, evolving interpretable attack strategies instead of simple token sequences. By employing MAP-Elites, we curate a diverse repository of attacks categorized by behavioral dimensions, including strategy type, encoding method, and prompt length.
Our experiments, conducted on GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and the open-weight coding model Devstral-small-2, reveal distinct vulnerability profiles for each system. GPT-4o-mini proved susceptible to hypothetical and multi-turn framing combined with ROT13 encoding, achieving a fitness score of 0.8. Similarly, Gemini showed vulnerability to direct attacks using ROT13 and multi-turn exchanges employing Leetspeak, also scoring 0.8. In contrast, Claude produced uniformly ambiguous responses across all tested strategies, with a maximum fitness of 0.4. The use of semantic representations generates interpretable attacks that expose systematic, model-specific weaknesses. These findings offer actionable insights for enhancing LLM safety and establish a reproducible baseline for assessing future frontier models. The code and experimental artifacts are available at https://github.com/bassrehab/red-queen.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





