arXiv

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

June 3, 2026 · Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang · Original Source

Title: MedCUA-Bench: A Screenshot-Based Evaluation Framework for Clinical Computer-Use Agents

Abstract:

While computer-use agents hold the potential to automate routine, screen-driven clinical tasks, their dependability within medical graphical user interfaces (GUIs) has yet to be thoroughly validated. Current benchmarks predominantly address general web or desktop activities, failing to adequately represent medical software. Such specialized applications necessitate specific domain expertise, feature interface designs distinct from consumer applications, lack publicly accessible testing environments, and require safety assurances that extend far beyond simple task completion.

To address these gaps, we present MedCUA-Bench, an interactive benchmark designed specifically for clinical computer-use agents. This framework encompasses 18 clinical scenarios spanning 10 distinct medical domains. These scenarios were reconstructed from real-world product manuals and open-source medical systems, ensuring the capture of authentic clinical interfaces while circumventing issues related to licensing and patient privacy.

Each task within the benchmark includes paired goals at both the intent and step levels, allowing for the separation of clinical reasoning from UI execution. Evaluation is conducted using a deterministic checker that assesses task completion alongside five specific dimensions of clinical safety.

Our testing of 23 agents revealed significant performance disparities. The top-performing closed-source model achieved a strict success rate of 54.2%. However, performance on the real OpenEMR platform remained critically low, with all models scoring below 9%. Among open-source agents, the average success rate was merely 2.5%, with the highest-performing model reaching only 16.2%. MedCUA-Bench highlights the substantial gap between current agent capabilities and the requirements for reliable clinical software usage, offering a reproducible testbed to guide future research in this area.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC