arXiv

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

June 2, 2026 · Duoduo Xue, Zhiyu Zhu, Junhui Hou · Original Source

Title: Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

Abstract

Generative models for images strive to draw samples from the intrinsic data manifold, a process necessitating the mastery of a dense, compact, and low-dimensional parameter space. To address this challenge, we introduce the Data Manifold-aware Image diffusioN moDel (MIND). This novel framework explicitly captures manifold geometry by embedding discrete patch tokenization directly into the score function of a continuous diffusion model. By doing so, it effectively combines the structural quantification strengths of discrete tokens with the parallel generation versatility inherent to continuous diffusion processes.

Our approach facilitates end-to-end differentiable training through a newly developed soft top-$k$ aggregation mechanism. Additionally, we incorporate dual-branch high-frequency feature embedding layers to mitigate the spectral bias typically exhibited by transformer backbones when processing low-dimensional inputs. For the inference phase, we have designed a multi-stage transition sampling scheme that dynamically modulates the sampling strategy according to the specific timestep.

We evaluated MIND extensively using ImageNet at a resolution of 256$\times$256, demonstrating its efficacy. Following an 80-epoch training period, our base model attained an Fréchet Inception Distance (FID) of 22.73 in an unguided setting. This performance nearly halves the 43.47 FID recorded by the standard DiT-B/2 baseline. Compared against baseline models, our method yielded average FID reductions of 15.95 over DiT and 9.06 over SiT.

In guided image generation tasks on ImageNet-256$\times$256, the proposed MIND-B, which comprises only 130M parameters, achieved an FID of 2.06, outperforming LlamaGen-3B, which utilizes 3.1B parameters. Furthermore, our larger MIND-XL variant, containing 715M parameters, pushed the FID down to 1.95. MIND offers a new perspective on diffusion-based image synthesis, laying the groundwork for subsequent advancements in the field. The associated code will be made publicly available.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC