ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training
Title: ASAP: Enhancing Medical Volumetric Representation Learning through Anatomy-Aware and Semantically-Adaptive Pre-training
Abstract:
Acquiring transferable and interpretable representations from medical volumetric data is a significant challenge, primarily due to the intricate nature of anatomical structures and the sparse, heterogeneous supervision found in radiology reports. To address this, we introduce Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a robust vision-language framework designed for fine-grained representation learning from large-scale chest CT scans paired with their associated radiology reports.
ASAP is built upon three core components: 1. An anatomy-aware knowledge injection module that embeds organ-level structural priors using off-the-shelf segmentation tools, thereby fostering anatomically coherent representations. 2. A semantically-adaptive selective alignment mechanism that dynamically links sentence-level clinical findings with specific, localized regions within the volumetric data. 3. A semantically-adaptive fusion module that facilitates effective interaction between anatomically informed visual features and grounded textual cues, operating under a dual-modal masked modeling paradigm.
In addition to these methodological advances, we introduce a comprehensive benchmark for medical volumetric vision-language pre-training focused on chest CTs. This benchmark encompasses 15 distinct datasets and 22 downstream tasks, including abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval, and visual question answering. By providing standardized evaluation protocols, this benchmark enables the systematic assessment of representation quality across various clinical settings and data regimes.
Extensive experimental results show that ASAP consistently delivers state-of-the-art performance across all evaluated tasks and datasets. Notably, the model exhibits particularly substantial improvements in scenarios involving limited supervision and distribution shifts, confirming its efficacy in learning transferable and clinically relevant volumetric representations.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





