arXiv

Spatial Transcriptomics as Images for Large-Scale Pretraining

June 4, 2026 · Yishun Zhu, Jiaxin Qi, Jian Wang, Yuhua Zheng, Jianqiang Huang · Original Source

Title: Spatial Transcriptomics as Images for Large-Scale Pretraining

Original: arXiv:2603.13432v4 Announce Type: replace-cross Abstract: Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, preserving spatial context essential for clinical and pathological studies. With rising sequencing throughput and advancing platforms, the expanding data volumes motivate large-scale ST pretraining. However, the fundamental unit for pretraining, i.e., what constitutes a single training sample, remains ill-posed. Existing choices fall into two camps: (1) treating each spot as an independent sample, which discards spatial dependencies and collapses ST into single-cell transcriptomics; and (2) treating an entire slide as a single sample, which produces prohibitively large inputs and drastically fewer training examples, undermining effective pretraining. To address this gap, we propose treating spatial transcriptomics as croppable images. Specifically, we define a multi-channel image representation with fixed spatial size by cropping patches from raw slides, thereby preserving spatial context while substantially increasing the number of training samples. Along the channel dimension, we define gene subset selection rules to control input dimensionality and improve pretraining stability. Extensive experiments show that the proposed image-like dataset construction for ST pretraining consistently improves downstream performance, outperforming conventional pretraining schemes. Ablation studies verify that both spatial patching and channel design are necessary, establishing a unified, practical paradigm for organizing ST data and enabling large-scale pretraining.

Rewrite: Spatial Transcriptomics (ST) captures thousands of gene expression measurements at specific, georeferenced locations on tissue sections, maintaining the spatial relationships crucial for clinical and pathological research. As sequencing capabilities improve and platforms advance, the surge in data volume has spurred interest in large-scale pretraining for ST. Yet, a critical ambiguity persists regarding the basic unit of pretraining: what exactly defines a single training instance? Current methodologies generally adhere to one of two approaches. The first treats individual spots as independent samples, a method that ignores spatial correlations and effectively reduces ST to single-cell transcriptomics. The second approach considers the whole slide as one sample, resulting in unwieldy input sizes and a severe reduction in the quantity of training examples, which hampers effective model learning. To bridge this divide, we suggest processing spatial transcriptomics data as flexible, crop-able images. By extracting patches from original slides, we create a multi-channel image format with consistent spatial dimensions. This strategy retains vital spatial context while significantly boosting the volume of available training data. Furthermore, we implement specific rules for selecting gene subsets along the channel axis to manage input dimensions and enhance the stability of the pretraining process. Comprehensive experiments demonstrate that this image-centric dataset construction for ST pretraining leads to consistent gains in downstream tasks, surpassing traditional pretraining methods. Ablation studies confirm that both spatial patching and careful channel design are essential components, thereby establishing a cohesive and practical framework for structuring ST data to facilitate large-scale pretraining.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC