arXiv

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

June 2, 2026 · Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby · Original Source

Title: Evaluating Medical Large Language Models: A Multi-Domain Red Teaming Approach for Safety, Robustness, and Fairness

Abstract

Despite the growing integration of large language models (LLMs) into healthcare settings, current evaluation benchmarks often overlook how these systems behave under the adversarial or ethically nuanced conditions typical of clinical environments. To address this gap, we introduced a comprehensive multi-domain red teaming framework designed to assess eleven modern LLMs. Our study utilized 690 scenarios rooted in clinical practice, organized across nine distinct domains and more than 150 subcategories. These scenarios included adversarial modifications, and the resulting model responses were evaluated using a seven-dimensional rubric, incorporating both LLM-assisted scoring and human-in-the-loop validation.

The analysis uncovered significant disparities in model performance, with average scores spanning from 0.791 to 0.984. Most notably, several models that demonstrated high overall accuracy experienced total failures in specific safety-critical situations, highlighting that aggregate metrics can obscure clinically significant risks. The top-tier systems—identified as X-BAI, GPT-5, and Claude Opus 4.1—consistently scored above 0.97 with minimal variance. However, performance fluctuated considerably depending on the domain.

Our findings also revealed that tasks involving equity issues saw error rates increase by 10-20% when demographic variables were altered. Furthermore, human reviewers detected clinically pertinent failures that automated evaluation tools had missed. These results suggest that reliability indicators based on performance variance and worst-case scenarios offer greater clinical relevance than mean accuracy alone. Consequently, we argue that credible safety assessments require hybrid evaluation strategies that combine automated processes with direct clinician oversight.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC