arXiv

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

June 4, 2026 · Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li · Original Source

Title: HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Abstract:

While pixel-space diffusion models circumvent the reconstruction limitations inherent to Variational Autoencoders (VAEs), they encounter a fundamental "granularity dilemma." This challenge arises because capturing global semantic structures typically requires large patch scales, whereas producing high-fidelity details necessitates fine-grained inputs. To overcome this obstacle, we introduce HyperDiT, a comprehensive framework that establishes Hyper-Connected Cross-Scale Interactions to bridge the gap between the semantic and pixel manifolds. In contrast to traditional methods that inject semantics via AdaLN, HyperDiT employs Cross-Attention mechanisms, allowing fine-grained tokens to query multi-level semantic anchors on a global scale. To address spatial mismatches inherent in multi-scale interactions, we propose the Scale-Aware Rotary Position Embedding (SA-RoPE), which ensures precise geometric alignment across tokens with different patch sizes. Additionally, we integrate Registers to extract dense semantics from a pretrained Visual Foundation Model (VFM), thereby significantly reducing generation hallucinations and visual artifacts. Our extensive experiments confirm that HyperDiT achieves a state-of-the-art (SoTA) Fréchet Inception Distance (FID) of $\mathbf{1.56}$ on the ImageNet $256\times256$ dataset, operating directly within the pixel space. By merging fine-grained processing with robust semantic guidance, HyperDiT presents a superior paradigm for high-fidelity pixel generation.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC