arXiv

Text-to-Image Models Need Less from Text Encoders Than You Think

Text-to-Image Models Require Less from Text Encoders Than Previously Assumed

Abstract

Text-to-image systems primarily utilize text prompts as the main conduit for conveying human intent. These prompts are transformed by a text encoder into embeddings that steer the image synthesis process. While these embeddings capture more than just individual token definitions—encoding broader contextual nuances such as compositionality and attribute binding across the entire prompt—how much of this richness the image models actually leverage remains largely uninvestigated. This study investigates a critical question: What specific components of text representation are indispensable for generating images?

Our findings indicate that diffusion transformer-based text-to-image models typically depend on only two relatively simple facets of text representations: (i) the consolidation of adjacent tokens into word-level representations for terms that span multiple tokens, and (ii) word sequence, which is established through the text encoder’s positional embeddings. To demonstrate this, we developed a novel text embedding format that captures solely individual word meanings and their sequence, stripping away any contextual data regarding the full prompt structure. We discovered that this "bag of position-tagged words" approach is adequate for effectively directing image creation, delivering visual quality and text fidelity comparable to methods guided by comprehensive text embeddings. This evidence challenges the prevailing assumption that text-to-image models extensively utilize the complex information embedded in text representations beyond basic word definitions and order. Instead, it suggests that the interpretation of intricate linguistic structures is largely handled by the image generation model itself.

Project webpage: https://nsping13.github.io/contextless-TTI/


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...

Google Ordered to Make Changes to AI Search Summaries by UK
Bloomberg

Google Ordered to Make Changes to AI Search Summaries by UK

The UK has ordered Google to modify its AI search summaries. This mandate aims to ensure greater accuracy and transparen...