Imagine Before You Draw: Visual Prompt Engineering for Image Generation
Title: Prioritize Visualization: Leveraging Visual Prompt Engineering for Image Synthesis
Original Source: arXiv:2606.04457v1 Announcement Type: New Abstract
By inserting visual semantic representations as a preliminary stage prior to image synthesis, the inherent complexity of mapping text to images is mitigated, leading to enhanced output fidelity. While recent initiatives like X-Omni and BLIP3o-Next have pursued this methodology, they predominantly rely on a two-step external workflow. In this approach, a distinct autoregressive model creates semantic tokens first, which subsequently serve as conditioning inputs for a separate diffusion decoder. This separation prevents the decoder from simultaneously accessing both the initial input and the semantic blueprint, creating an information bottleneck that hampers detail retention in downstream applications such as image editing.
Conversely, internal architectures including Transfusion, BAGEL, and Show-o2 eliminate this specific bottleneck by facilitating cross-modal interaction within a unified model. However, these systems still struggle with the substantial gap between text and pixel generation due to the lack of intermediate semantic direction. To address this, we introduce Visual Prompt Engineering (VPE), a method designed for seamless incorporation into internal frameworks. Under this approach, the model initially employs autoregressive generation to produce visual semantic tokens—such as those from SigLIP 2—which act as "visual prompts" defining the semantic structure. The full image tokens are then generated based on this established plan.
We evaluated VPE across multiple domains, including class-conditional generation, text-to-image synthesis, and image editing, encompassing diverse token types and architectural designs. Our findings indicate that VPE not only speeds up convergence and elevates quality limits but also, through its internal integration, delivers significantly superior editing preservation compared to external counterparts of equivalent parameter size (achieving a PSNR of 26.76 versus 19.92). Furthermore, it maintains competitive levels of responsiveness during editing tasks.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




