Global News Digest

arXiv

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Title: CardioLens: Uncovering the Clinical Reality Gap of MLLMs Through Multi-Sequence Cardiac MRI Assessments

Abstract:

While Multimodal Large Language Models (MLLMs) have demonstrated impressive results on standard public medical benchmarks, current evaluation methods often serve as inadequate proxies for actual clinical practice. These existing assessments typically depend on isolated inputs and simplified recognition-based tasks, failing to mirror the complexity of real-world usage. To address this limitation, we present CardioLens, a robust, leakage-resistant evaluation framework designed for multi-sequence Cardiovascular Magnetic Resonance (CMR). This testbed was developed using private hospital archives and constructed through a stringent pipeline that rigorously verifies question-and-answer pairs derived from clinical reports.

CardioLens encompasses a substantial dataset comprising 473,896 image slices and 13,494 verified QA pairs. It covers four distinct imaging modalities: 4D Cine, Late Gadolinium Enhancement (LGE), perfusion, and T2-weighted imaging. The framework assesses three critical phases of CMR interpretation: image comprehension, report generation, and disease diagnosis.

Our evaluation of 24 state-of-the-art MLLMs highlights a significant "clinical reality gap." Overall model performance was poor, with accuracy declining progressively as the tasks aligned more closely with the actual CMR workflow. Detailed confusion analysis identified a "category-collapse" failure mode, wherein models tend to default to common abnormal categories instead of accurately distinguishing between clinically distinct findings.

To determine whether input construction compatible with MLLM architectures was the primary driver of these results, we compared various slice selection protocols—random, clinically motivated, and data-driven—under different slice budgets. The impact on performance was negligible, with variations typically amounting to only about 1%. Furthermore, the use of explicit reasoning prompts did not enhance performance; instead, it often led models to become more conservative without improving their utilization of visual evidence.

These findings indicate that current MLLMs are far from being reliable tools for CMR interpretation, a domain that demands the integration of distributed evidence across multiple sequences, views, and temporal phases. CardioLens offers a clinically grounded testbed to guide the development of next-generation MLLMs suitable for real-world clinical deployment.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.