arXiv

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Title: HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

Abstract:

While pixel-space diffusion models circumvent the reconstruction limitations inherent to Variational Autoencoders (VAEs), they encounter a fundamental "granularity dilemma." This challenge arises because capturing global semantic structures typically requires large patch scales, whereas producing high-fidelity details necessitates fine-grained inputs. To overcome this obstacle, we introduce HyperDiT, a comprehensive framework that establishes Hyper-Connected Cross-Scale Interactions to bridge the gap between the semantic and pixel manifolds. In contrast to traditional methods that inject semantics via AdaLN, HyperDiT employs Cross-Attention mechanisms, allowing fine-grained tokens to query multi-level semantic anchors on a global scale. To address spatial mismatches inherent in multi-scale interactions, we propose the Scale-Aware Rotary Position Embedding (SA-RoPE), which ensures precise geometric alignment across tokens with different patch sizes. Additionally, we integrate Registers to extract dense semantics from a pretrained Visual Foundation Model (VFM), thereby significantly reducing generation hallucinations and visual artifacts. Our extensive experiments confirm that HyperDiT achieves a state-of-the-art (SoTA) Fréchet Inception Distance (FID) of $\mathbf{1.56}$ on the ImageNet $256\times256$ dataset, operating directly within the pixel space. By merging fine-grained processing with robust semantic guidance, HyperDiT presents a superior paradigm for high-fidelity pixel generation.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...