arXiv

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

June 4, 2026 · Ruishu Zhu, Zhihao Huang, Jiacheng Sun, Ping Luo, Hongyuan Zhang, Xuelong Li · Original Source

Title: ViewMask-1-to-3: Achieving Multi-View Consistency in Image Generation through Multimodal Discrete Diffusion

Abstract:

Building on the proven efficacy of discrete diffusion in language-vision tasks, this study investigates its applicability to multi-view generation, a domain traditionally dominated by continuous methods. We present ViewMask-1-to-3, a framework that treats multi-view generation as a discrete sequence modeling challenge. In this approach, each camera viewpoint is encoded as visual tokens using MAGVIT-v2. By leveraging discrete diffusion through masked token prediction, the model facilitates progressive multi-view creation via iterative token unmasking, thereby integrating language and vision within a shared token space. Notably, the combination of straightforward random masking with self-attention mechanisms inherently fosters cross-view consistency, eliminating the need for specialized architectures or 3D geometric priors. Our method surpasses the baseline on both the GSO and 3D-FUTURE benchmarks, securing the top rank on average across standard image metrics. Additionally, it demonstrates a 10.6% improvement in IoU over continuous diffusion models on the 3D-FUTURE dataset. The proposed framework is also readily extensible to text-to-image generation and multimodal understanding, underscoring its promise for establishing a more unified paradigm for multimodal tasks.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC