arXiv

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Title: MedCUA-Bench: A Screenshot-Based Evaluation Framework for Clinical Computer-Use Agents

Abstract:

While computer-use agents hold the potential to automate routine, screen-driven clinical tasks, their dependability within medical graphical user interfaces (GUIs) has yet to be thoroughly validated. Current benchmarks predominantly address general web or desktop activities, failing to adequately represent medical software. Such specialized applications necessitate specific domain expertise, feature interface designs distinct from consumer applications, lack publicly accessible testing environments, and require safety assurances that extend far beyond simple task completion.

To address these gaps, we present MedCUA-Bench, an interactive benchmark designed specifically for clinical computer-use agents. This framework encompasses 18 clinical scenarios spanning 10 distinct medical domains. These scenarios were reconstructed from real-world product manuals and open-source medical systems, ensuring the capture of authentic clinical interfaces while circumventing issues related to licensing and patient privacy.

Each task within the benchmark includes paired goals at both the intent and step levels, allowing for the separation of clinical reasoning from UI execution. Evaluation is conducted using a deterministic checker that assesses task completion alongside five specific dimensions of clinical safety.

Our testing of 23 agents revealed significant performance disparities. The top-performing closed-source model achieved a strict success rate of 54.2%. However, performance on the real OpenEMR platform remained critically low, with all models scoring below 9%. Among open-source agents, the average success rate was merely 2.5%, with the highest-performing model reaching only 16.2%. MedCUA-Bench highlights the substantial gap between current agent capabilities and the requirements for reliable clinical software usage, offering a reproducible testbed to guide future research in this area.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...