Global News Digest

arXiv

Cross-modal linkage risk in clinical vision-language models

Title: The Risk of Cross-Modal Re-linking in Clinical Vision-Language Models

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports develop a shared embedding space that maintains instance-level correspondence between images and text. This capability introduces a significant privacy vulnerability in clinical environments where data modalities are intentionally decoupled post-acquisition, such as during image-only data distribution or when reports are restricted to authorized personnel. In these scenarios, a de-identified image can be matched back to its original narrative report solely through cosine similarity calculations.

To quantify this threat, we formalized the issue as an image-to-report retrieval task. Rather than simulating a privacy breach, we utilized public paired cohorts—where the correct pairings are inherently known—as ground-truth benchmarks to audit the extent of the risk. Our evaluation involved VLMs with varying degrees of clinical specialization, testing them against 406,241 paired examples drawn from 126,804 patients. The dataset included 43,793 held-out pairs from MIMIC-CXR and 29,296 pairs from the external CheXpert Plus cohort.

The results demonstrated a systematic increase in re-linkage accuracy as model specialization grew. The most advanced VLM retrieved the correct report at a rate 15 times higher than chance when the candidate pool size (N) was 100, and 50 times higher than chance when N was 10,000. This performance remained significantly above random chance even at full-database scale. Furthermore, the re-linkage signal persisted even when subjected to pathology-matched hard negatives, which eliminated shortcuts based on disease labels, indicating that the models capture correspondence beyond broad diagnostic categories.

We investigated methods to mitigate this risk without the need for full model retraining. By freezing the encoders and applying differentially private optimization exclusively to the projection heads that define the alignment layer (with epsilon = 0.34 and delta = 6x10⁻⁶), we achieved a substantial reduction in re-linkage capability. This approach decreased Recall@1 by 61.8% at N = 10,000 on the MIMIC-CXR dataset. Importantly, this mitigation transferred effectively to CheXpert Plus without further retraining. Crucially, the utility of the image representations remained largely intact; the macro AUROC for linear-probe classification across 14 labels shifted only marginally, from 79.63% to 79.43%. These findings suggest that targeted differentially private fine-tuning of the shared alignment layer can significantly curb cross-modal re-linkage while preserving the clinical utility of the underlying image features.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.