arXiv

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

Title: Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

Abstract

The adoption of vision foundation models is currently hindered by the quadratic computational expense inherent to self-attention mechanisms. This limitation restricts the resolution of inputs and inflates the costs associated with large-scale pretraining. While subquadratic approaches like state-space models and linear attention offer reduced computational demands, they typically flatten images into one-dimensional token sequences, thereby diminishing the two-dimensional spatial structures that are crucial for visual tasks. In contrast, Generalized Spatial Propagation Networks (GSPN) maintain the 2D grid structure by propagating context through line-scan recurrences, achieving near-linear complexity without relying on positional embeddings. Despite these advantages, GSPN has not been widely utilized as a foundation-scale encoder.

This paper introduces C-GSPN, a foundation-scale vision encoder built upon 2D spatial propagation. We enhance the practicality of the GSPN operator through three key innovations:

  1. High-Performance CUDA Implementation: We developed a fast GSPN CUDA kernel that consolidates per-step launches into a single, warp-specialized implementation. By utilizing shared-memory tiling, coalesced memory access, and compact multi-channel propagation, the kernel achieves over 90% of peak memory bandwidth. This optimization results in a speedup of 40 to 52 times compared to the original GSPN implementation.
  2. Efficient Latent-Space Propagation: We introduced a compressed latent-space propagation block that integrates normalization. This design converts kernel-level performance gains into broader block- and model-level efficiency.
  3. Cross-Operator Distillation: We employed a two-stage distillation recipe that allows the new architecture to be trained using an attention-based teacher model. This approach eliminates the need for expensive, from-scratch foundation-scale training.

When distilled using 600 million image-text pairs, C-GSPN performs on par with an isomorphic Vision Transformer (ViT) baseline while requiring 15% fewer parameters. Additionally, it improves ADE20K segmentation scores by 2.1%. The model demonstrates strong transfer capabilities to high-resolution tasks with significantly less data than required for training from scratch. Furthermore, C-GSPN delivers a fourfold end-to-end block speedup at 2K resolution during single-pass, tiling-free inference.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...