arXiv

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

Title: Prioritize Visualization: Leveraging Visual Prompt Engineering for Image Synthesis

Original Source: arXiv:2606.04457v1 Announcement Type: New Abstract

By inserting visual semantic representations as a preliminary stage prior to image synthesis, the inherent complexity of mapping text to images is mitigated, leading to enhanced output fidelity. While recent initiatives like X-Omni and BLIP3o-Next have pursued this methodology, they predominantly rely on a two-step external workflow. In this approach, a distinct autoregressive model creates semantic tokens first, which subsequently serve as conditioning inputs for a separate diffusion decoder. This separation prevents the decoder from simultaneously accessing both the initial input and the semantic blueprint, creating an information bottleneck that hampers detail retention in downstream applications such as image editing.

Conversely, internal architectures including Transfusion, BAGEL, and Show-o2 eliminate this specific bottleneck by facilitating cross-modal interaction within a unified model. However, these systems still struggle with the substantial gap between text and pixel generation due to the lack of intermediate semantic direction. To address this, we introduce Visual Prompt Engineering (VPE), a method designed for seamless incorporation into internal frameworks. Under this approach, the model initially employs autoregressive generation to produce visual semantic tokens—such as those from SigLIP 2—which act as "visual prompts" defining the semantic structure. The full image tokens are then generated based on this established plan.

We evaluated VPE across multiple domains, including class-conditional generation, text-to-image synthesis, and image editing, encompassing diverse token types and architectural designs. Our findings indicate that VPE not only speeds up convergence and elevates quality limits but also, through its internal integration, delivers significantly superior editing preservation compared to external counterparts of equivalent parameter size (achieving a PSNR of 26.76 versus 19.92). Furthermore, it maintains competitive levels of responsiveness during editing tasks.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...