arXiv

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

Title: Assessing Large Language Models in Fluid Clinical Scenarios Using Standardized Patient Simulations

Abstract:

Although Large Language Models (LLMs) are frequently suggested as tools for clinical decision-making, existing static, single-turn benchmarks fail to capture the dynamic nature of patient care, which involves continuous information gathering, treatment planning, and adaptive management across changing patient conditions. To address this gap, medical education has utilized Standardized Patients (SPs)—trained actors who consistently simulate clinical cases to provide realistic practice and objective, scripted assessments. In this study, we present MedSP1000, an interactive benchmark derived from SPs designed to evaluate clinical agents. This resource comprises 1,638 SP cases supported by 24,602 trajectory-level rubrics that have undergone peer review. MedSP1000 transforms these peer-reviewed teaching cases into executable simulations, incorporating defined SP scripts, clinical environmental contexts, and human-validated structured rubrics.

During each evaluation run, a clinical agent engages in a closed-loop interaction with both a patient agent and an environment controller. The agent’s performance is continuously scored against expert criteria established in the original materials throughout the entire encounter. We applied MedSP1000 to various general-purpose and medically specialized LLMs, discovering that high scores on static benchmarks do not reliably predict performance in these educational simulations. The top-performing model, GPT-5.5, successfully completed only 60.4% of the expert-defined rubric items, while the most capable medically specialized model achieved a score of 40.0%. Furthermore, increasing computational resources during testing yielded no significant improvement. These findings indicate that current LLMs, including agentic systems specifically optimized for medicine, lack the reliability necessary for safe integration into real-world clinical practice. Broadly, MedSP1000 demonstrates that process-level evaluations using SP-style methodologies can uncover clinically significant failure modes that are often overlooked by single-turn benchmarks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...