Cross-modal linkage risk in clinical vision-language models
Title: The Risk of Cross-Modal Re-linking in Clinical Vision-Language Models
Vision-language models (VLMs) trained on paired chest radiographs and radiology reports develop a shared embedding space that maintains instance-level correspondence between images and text. This capability introduces a significant privacy vulnerability in clinical environments where data modalities are intentionally decoupled post-acquisition, such as during image-only data distribution or when reports are restricted to authorized personnel. In these scenarios, a de-identified image can be matched back to its original narrative report solely through cosine similarity calculations.
To quantify this threat, we formalized the issue as an image-to-report retrieval task. Rather than simulating a privacy breach, we utilized public paired cohortsâwhere the correct pairings are inherently knownâas ground-truth benchmarks to audit the extent of the risk. Our evaluation involved VLMs with varying degrees of clinical specialization, testing them against 406,241 paired examples drawn from 126,804 patients. The dataset included 43,793 held-out pairs from MIMIC-CXR and 29,296 pairs from the external CheXpert Plus cohort.
The results demonstrated a systematic increase in re-linkage accuracy as model specialization grew. The most advanced VLM retrieved the correct report at a rate 15 times higher than chance when the candidate pool size (N) was 100, and 50 times higher than chance when N was 10,000. This performance remained significantly above random chance even at full-database scale. Furthermore, the re-linkage signal persisted even when subjected to pathology-matched hard negatives, which eliminated shortcuts based on disease labels, indicating that the models capture correspondence beyond broad diagnostic categories.
We investigated methods to mitigate this risk without the need for full model retraining. By freezing the encoders and applying differentially private optimization exclusively to the projection heads that define the alignment layer (with epsilon = 0.34 and delta = 6x10â»â¶), we achieved a substantial reduction in re-linkage capability. This approach decreased Recall@1 by 61.8% at N = 10,000 on the MIMIC-CXR dataset. Importantly, this mitigation transferred effectively to CheXpert Plus without further retraining. Crucially, the utility of the image representations remained largely intact; the macro AUROC for linear-probe classification across 14 labels shifted only marginally, from 79.63% to 79.43%. These findings suggest that targeted differentially private fine-tuning of the shared alignment layer can significantly curb cross-modal re-linkage while preserving the clinical utility of the underlying image features.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




