AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
Title: AutoMedBench: Advancing Medical AutoResearch through Agentic AI Models
Abstract
Autonomous agents are increasingly anticipated to facilitate comprehensive, end-to-end medical-AI research workflows, extending their utility far beyond isolated prediction tasks or brief clinical question-and-answer sessions. However, current benchmarks for medical agents largely focus on final outputs, offering scant insight into the agentsā operational behavior throughout the research process. To bridge this gap, we introduce AutoMedBench, a workflow-centric benchmark designed for autonomous medical-AI research across various medical imaging and multimodal inference tasks. This benchmark structures agent execution into a cohesive five-stage workflow (S1āS5): Plan, Setup, Validate, Inference, and Submit.
AutoMedBench features long-horizon tasks, with each run averaging 33 agent turns. The benchmark covers five distinct research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is assessed at two difficulty levels, Lite and Standard; while both tiers utilize identical data and metrics, they differ in the extent of scaffolding provided in the task brief. Evaluation scores are derived from both final task performance and stage-specific metrics (S1āS5), allowing for granular analysis of the workflow from the initial brief to the final submitted artifact.
Analysis of thousands of recorded runs indicates that the Validate stage is, on average, the weakest link in the workflow, while Setup is the strongest. This suggests that contemporary agents excel at making pipelines executable rather than verifying their reliability. Further post-run error analysis highlights that failures in verification and submission are the primary sources of tagged errors, representing 37.7% and 38.1% of triggered codes, respectively. In contrast, errors related to task understanding are rare, accounting for only 0.9%. Notably, runs that trigger at least one error code demonstrate an average overall score that is 48% lower than those with no error codes.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




