On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Title: Enhancing Diversity in Diffusion Transformers via On-the-Fly Repulsion in Contextual Space
Abstract:
While contemporary Text-to-Image (T2I) diffusion models have attained impressive levels of semantic alignment, they frequently exhibit a pronounced lack of variety. These systems often converge on a limited subset of visual interpretations for any specific prompt, a phenomenon known as typicality bias. This limitation poses a significant hurdle for creative workflows that demand a broad spectrum of generative results. Current strategies for improving diversity face a fundamental trade-off: adjusting model inputs necessitates expensive optimization to integrate feedback from the generation process, whereas manipulating spatially-committed intermediate latents tends to distort the emerging visual structure and introduce artifacts.
To address these challenges, this study introduces a novel framework for generating rich diversity in Diffusion Transformers by applying repulsion within the Contextual Space. By intervening in multimodal attention channels, we implement on-the-fly repulsion during the transformer’s forward pass. This intervention is strategically inserted between blocks, precisely when the text conditioning is augmented with emergent image structures. Consequently, this approach enables the redirection of the guidance trajectory after it has become structurally informed but prior to the final composition being locked in.
Our findings indicate that repulsion in the Contextual Space yields substantially greater diversity without compromising visual fidelity or semantic accuracy. Moreover, the proposed method is highly efficient, incurring only minimal computational overhead. It remains effective even in modern "Turbo" and distilled models, where conventional trajectory-based interventions often prove unsuccessful.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




