ChannelTok: Efficient Flexible-Length Vision Tokenization
Title: ChannelTok: Streamlined Flexible-Length Vision Tokenization
Abstract:
Current state-of-the-art flexible vision tokenizers deliver exceptional quality but at a prohibitive expense, typically depending on bulky parameter-heavy backbones and sluggish, multi-stage generative decoders. We break away from this intricate, spatial-token framework by proposing ChannelTok, a straightforward, lightweight, and rapid channel-wise flexible-length tokenizer. By regarding each latent channel as an individual visual token, our approach facilitates a parameter-efficient hybrid architecture combining CNNs and Transformers. Additionally, we utilize a stochastic tail-dropping strategy during training, which instinctively prompts channels to arrange themselves according to semantic significance. This mechanism permits flexible compression during inference through the simple retention of the initial $k$ channels, while simultaneously supporting variable-length autoregressive image generation. We substantiate our method with comprehensive experiments on ImageNet, showing steady performance across varying token budgets. Our findings set a new benchmark for quality and efficiency: our model delivers state-of-the-art perceptual quality (rFID 2.92), decodes $8.6\times$ faster, and requires $2.1\times$ fewer parameters (159M) compared to the next best competitor. This study confirms that channel-wise tokenization is a highly effective and practical framework for efficient visual representation.
Project page: https://channeltok.github.io
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC






