UniCanvas: A Diffusion-base Unified Model for Text-in-Image Joint Generation
Title: UniCanvas: A Unified Diffusion Model for Joint Text-in-Image Generation
Abstract:
Unified vision-language architectures have recently achieved significant strides in managing both multimodal comprehension and creation within a single framework. However, a dichotomy exists between current approaches: autoregressive Vision-Language Models (VLMs) excel at cross-modal reasoning but fall short in producing high-fidelity images. Conversely, diffusion models are capable of rendering photorealistic visuals but encounter difficulties in generating coherent text, thereby complicating the development of a singular model that can effortlessly handle both visual and textual outputs. Emerging research indicates that language can be successfully integrated into visual representations, enabling models to interpret textual semantics directly from image data.
Addressing this challenge, we introduce UniCanvas, an inaugural effort to unify diffusion models for the creation of interleaved multimodal content via text-in-image generation. By treating the shared pixel canvas as a "world model" for visual evolution, diffusion models naturally capture transformations within this space. Rather than outputting discrete text tokens, our approach teaches the model to encode language as visual patterns embedded within images, utilizing the inherent multimodal embedding space. This architectural choice enables the model to "draw" text organically onto a single pixel canvas during the synthesis process, facilitating seamless multimodal generation. Our experimental results indicate that UniCanvas surpasses the performance of earlier unified models, establishing text-in-image generation with diffusion models as a highly promising paradigm for unified multimodal creation.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




