MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
Title: MultiTurnPSB: Assessing Multi-Turn Jailbreak Attacks and Classifier-Based Defenses in Medical AI Safety
Abstract:
While patient-facing medical chatbots are typically assessed using single-turn prompts, actual users often persist after receiving refusals, escalating urgency, and invoking authority. To address this gap, we present MultiTurnPSB, a four-turn adversarial extension of PatientSafetyBench, and evaluate the resilience of GPT-4.1-mini against fixed template, template-adaptive, and live adversarial attacks. Our findings reveal that under live attack conditions, unsafe response rates surge from 35% to nearly 80% by Turn 4. Although GPT-4.1-mini and Claude Sonnet 4.5 perform similarly at baseline, they diverge significantly by Turn 4, exhibiting a 19x performance gap—a disparity that single-turn evaluations fail to detect.
We identify four distinct degradation trajectory signatures and pinpoint a two-element attack formula as the primary driver of catastrophic failures. Furthermore, we analyze a lightweight input-side classifier capable of reducing Turn 4 unsafe responses by 52 percentage points. However, its deployment is primarily constrained by a 45% false alarm rate on benign queries, which causes severe accuracy degradation. Additionally, we observe a methodological anomaly: Claude Sonnet declined to generate adversarial messages in more than half of late-turn conversations, even when explicitly framed as red teaming exercises. This suggests that safety training may generalize effectively to the attacker role.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



