arXiv

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

June 2, 2026 · Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu · Original Source

Title: Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

Abstract

The adoption of vision foundation models is currently hindered by the quadratic computational expense inherent to self-attention mechanisms. This limitation restricts the resolution of inputs and inflates the costs associated with large-scale pretraining. While subquadratic approaches like state-space models and linear attention offer reduced computational demands, they typically flatten images into one-dimensional token sequences, thereby diminishing the two-dimensional spatial structures that are crucial for visual tasks. In contrast, Generalized Spatial Propagation Networks (GSPN) maintain the 2D grid structure by propagating context through line-scan recurrences, achieving near-linear complexity without relying on positional embeddings. Despite these advantages, GSPN has not been widely utilized as a foundation-scale encoder.

This paper introduces C-GSPN, a foundation-scale vision encoder built upon 2D spatial propagation. We enhance the practicality of the GSPN operator through three key innovations:

High-Performance CUDA Implementation: We developed a fast GSPN CUDA kernel that consolidates per-step launches into a single, warp-specialized implementation. By utilizing shared-memory tiling, coalesced memory access, and compact multi-channel propagation, the kernel achieves over 90% of peak memory bandwidth. This optimization results in a speedup of 40 to 52 times compared to the original GSPN implementation.
Efficient Latent-Space Propagation: We introduced a compressed latent-space propagation block that integrates normalization. This design converts kernel-level performance gains into broader block- and model-level efficiency.
Cross-Operator Distillation: We employed a two-stage distillation recipe that allows the new architecture to be trained using an attention-based teacher model. This approach eliminates the need for expensive, from-scratch foundation-scale training.

When distilled using 600 million image-text pairs, C-GSPN performs on par with an isomorphic Vision Transformer (ViT) baseline while requiring 15% fewer parameters. Additionally, it improves ADE20K segmentation scores by 2.1%. The model demonstrates strong transfer capabilities to high-resolution tasks with significantly less data than required for training from scratch. Furthermore, C-GSPN delivers a fourfold end-to-end block speedup at 2K resolution during single-pass, tiling-free inference.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC