arXiv

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

June 2, 2026 · Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang · Original Source

Title: CardioLens: Uncovering the Clinical Reality Gap of MLLMs Through Multi-Sequence Cardiac MRI Assessments

Abstract:

While Multimodal Large Language Models (MLLMs) have demonstrated impressive results on standard public medical benchmarks, current evaluation methods often serve as inadequate proxies for actual clinical practice. These existing assessments typically depend on isolated inputs and simplified recognition-based tasks, failing to mirror the complexity of real-world usage. To address this limitation, we present CardioLens, a robust, leakage-resistant evaluation framework designed for multi-sequence Cardiovascular Magnetic Resonance (CMR). This testbed was developed using private hospital archives and constructed through a stringent pipeline that rigorously verifies question-and-answer pairs derived from clinical reports.

CardioLens encompasses a substantial dataset comprising 473,896 image slices and 13,494 verified QA pairs. It covers four distinct imaging modalities: 4D Cine, Late Gadolinium Enhancement (LGE), perfusion, and T2-weighted imaging. The framework assesses three critical phases of CMR interpretation: image comprehension, report generation, and disease diagnosis.

Our evaluation of 24 state-of-the-art MLLMs highlights a significant "clinical reality gap." Overall model performance was poor, with accuracy declining progressively as the tasks aligned more closely with the actual CMR workflow. Detailed confusion analysis identified a "category-collapse" failure mode, wherein models tend to default to common abnormal categories instead of accurately distinguishing between clinically distinct findings.

To determine whether input construction compatible with MLLM architectures was the primary driver of these results, we compared various slice selection protocols—random, clinically motivated, and data-driven—under different slice budgets. The impact on performance was negligible, with variations typically amounting to only about 1%. Furthermore, the use of explicit reasoning prompts did not enhance performance; instead, it often led models to become more conservative without improving their utilization of visual evidence.

These findings indicate that current MLLMs are far from being reliable tools for CMR interpretation, a domain that demands the integration of distributed evidence across multiple sequences, views, and temporal phases. CardioLens offers a clinically grounded testbed to guide the development of next-generation MLLMs suitable for real-world clinical deployment.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC