arXiv

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

June 3, 2026 · Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi · Original Source

Title: GLINT: Achieving Fine-Grained Radiology Representations via Sparsely Gated Vision-Language Alignment

Vision-language models (VLMs) have become a scalable solution in radiology by capitalizing on the image-report pairs that are naturally generated during clinical workflows. However, this approach highlights a significant scale discrepancy: while individual findings occupy only minute regions within an image, the supervisory signal is applied globally to the entire image-report pair. This creates a fundamental challenge, as previous methods distributed weights densely across all image patches instead of focusing on the specific, sparse subset pertinent to a given query.

To resolve this issue, we introduce GLINT (Gated Language-Image alignmeNT), a framework designed to explicitly model these sparse correspondences. From an alignment perspective, we propose Sparsely Gated Alignment, a new architecture that employs a sigmoid gate within a distinct gate embedding space. This mechanism activates only the patches relevant to each textual query, thereby enforcing explicit sparsity. On the representation side, we incorporate Dense Feature Regularization, which anchors the intermediate features of the trainable encoder to a frozen self-supervised learning (SSL) teacher. This step preserves the fine-grained patch features essential for the gate’s operation.

The methodology is consistent across both 2D chest X-ray (CXR) and 3D chest computed tomography (CT) modalities, utilizing DINOv3 and V-JEPA 2.1, respectively. GLINT supports zero-shot classification, grounding, and segmentation derived from free-text queries. To our knowledge, it is the first model to demonstrate zero-shot segmentation on 3D CT volumes without requiring mask supervision. Notably, the most significant improvements are observed in zero-shot grounding and segmentation tasks, where sparse, query-specific localization is critical, aligning perfectly with our design objectives. In downstream evaluations, GLINT surpasses both SSL encoders and medical VLMs in classification, report generation, and segmentation performance.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC