Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
Title: Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
Abstract
The adoption of vision foundation models is currently hindered by the quadratic computational expense inherent to self-attention mechanisms. This limitation restricts the resolution of inputs and inflates the costs associated with large-scale pretraining. While subquadratic approaches like state-space models and linear attention offer reduced computational demands, they typically flatten images into one-dimensional token sequences, thereby diminishing the two-dimensional spatial structures that are crucial for visual tasks. In contrast, Generalized Spatial Propagation Networks (GSPN) maintain the 2D grid structure by propagating context through line-scan recurrences, achieving near-linear complexity without relying on positional embeddings. Despite these advantages, GSPN has not been widely utilized as a foundation-scale encoder.
This paper introduces C-GSPN, a foundation-scale vision encoder built upon 2D spatial propagation. We enhance the practicality of the GSPN operator through three key innovations:
- High-Performance CUDA Implementation: We developed a fast GSPN CUDA kernel that consolidates per-step launches into a single, warp-specialized implementation. By utilizing shared-memory tiling, coalesced memory access, and compact multi-channel propagation, the kernel achieves over 90% of peak memory bandwidth. This optimization results in a speedup of 40 to 52 times compared to the original GSPN implementation.
- Efficient Latent-Space Propagation: We introduced a compressed latent-space propagation block that integrates normalization. This design converts kernel-level performance gains into broader block- and model-level efficiency.
- Cross-Operator Distillation: We employed a two-stage distillation recipe that allows the new architecture to be trained using an attention-based teacher model. This approach eliminates the need for expensive, from-scratch foundation-scale training.
When distilled using 600 million image-text pairs, C-GSPN performs on par with an isomorphic Vision Transformer (ViT) baseline while requiring 15% fewer parameters. Additionally, it improves ADE20K segmentation scores by 2.1%. The model demonstrates strong transfer capabilities to high-resolution tasks with significantly less data than required for training from scratch. Furthermore, C-GSPN delivers a fourfold end-to-end block speedup at 2K resolution during single-pass, tiling-free inference.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





