HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution
Title: HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution
Vector-quantized (VQ) generative models have demonstrated significant potential in the realm of real-world image super-resolution (Real-ISR). Nevertheless, current approaches predominantly depend on a unified latent space, a design choice that conflates low-frequency structural elements with high-frequency textural details. This entanglement necessitates that a solitary codebook manage a combinatorially intricate array of structure-texture combinations, thereby restricting representational capacity and hindering efficient codebook utilization.
To overcome these limitations, we introduce HiTokSR, a novel hierarchical token prediction framework. Rather than employing a single codebook, HiTokSR divides the latent space along the channel axis into distinct, frequency-aware groups, assigning an independent sub-codebook to each for quantization. This coarse-to-fine architecture effectively separates global structures from intricate details, boosting combinatorial expressiveness while avoiding the optimization instability often associated with high-dimensional nearest-neighbor lookups.
To further bolster semantic consistency, the generator incorporates priors from a vision foundation model through adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Moreover, we propose an index-level perturbation strategy during the fine-tuning of the decoder to mitigate the discrepancy between training and testing phases in discrete token prediction. Comprehensive experiments conducted on real-world benchmarks reveal that HiTokSR delivers state-of-the-art results, excelling in both reconstruction fidelity and perceptual quality.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





