Text-to-Image Models Need Less from Text Encoders Than You Think
Text-to-Image Models Require Less from Text Encoders Than Previously Assumed
Abstract
Text-to-image systems primarily utilize text prompts as the main conduit for conveying human intent. These prompts are transformed by a text encoder into embeddings that steer the image synthesis process. While these embeddings capture more than just individual token definitions—encoding broader contextual nuances such as compositionality and attribute binding across the entire prompt—how much of this richness the image models actually leverage remains largely uninvestigated. This study investigates a critical question: What specific components of text representation are indispensable for generating images?
Our findings indicate that diffusion transformer-based text-to-image models typically depend on only two relatively simple facets of text representations: (i) the consolidation of adjacent tokens into word-level representations for terms that span multiple tokens, and (ii) word sequence, which is established through the text encoder’s positional embeddings. To demonstrate this, we developed a novel text embedding format that captures solely individual word meanings and their sequence, stripping away any contextual data regarding the full prompt structure. We discovered that this "bag of position-tagged words" approach is adequate for effectively directing image creation, delivering visual quality and text fidelity comparable to methods guided by comprehensive text embeddings. This evidence challenges the prevailing assumption that text-to-image models extensively utilize the complex information embedded in text representations beyond basic word definitions and order. Instead, it suggests that the interpretation of intricate linguistic structures is largely handled by the image generation model itself.
Project webpage: https://nsping13.github.io/contextless-TTI/
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC






