arXiv

Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

June 2, 2026 · Weicheng Dai, Chenyu Wang, Andy Li, Shantanu Ghosh, Kayhan Batmanghelich · Original Source

Title: Achieving Flexible Control in 3D CT Generation Through Text and Semantic Segmentation Prompts

Abstract: Volumetric medical image generative models have established significant utility across various medical imaging tasks, including serving as priors for inverse problems and facilitating data augmentation. However, producing high-resolution 3D images with robust controllability remains a formidable challenge for these applications. Current methodologies generally rely on either text prompts derived from radiology reports or full-image segmentation maps for control. While text-based conditioning offers flexibility, it lacks precise spatial definition regarding the location, morphology, and boundaries of anomalies. Conversely, segmentation-driven approaches provide accurate spatial guidance but are constrained by the necessity for comprehensive organ annotations.

To address these limitations, we introduce a versatile multimodal framework for controllable volumetric image generation that accommodates both radiology reports and segmentation prompts, with either being optional. This system enables users to supply segmentation data for specific anatomical structures or pathologies without the need for complete organ-level annotations. The semantic context of each segmentation mask is clarified via an associated textual description, creating a highly adaptable and scalable conditioning strategy. Our architecture, built upon a modified diffusion transformer, is designed to be memory-efficient and simultaneously processes tokens for both images and segmentation data. Additionally, the model employs gated attention mechanisms to effectively manage long radiology reports.

Experimental results indicate that our approach delivers state-of-the-art perceptual and semantic performance, notably achieving a 24% relative improvement in mean FID. The model successfully generates high-resolution CT volumes that maintain anatomical consistency and enhances data efficiency when applied to data augmentation tasks. Furthermore, evaluations conducted by radiologists validate the strong alignment between the generated images and authentic medical scans.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC