Representation Forcing for Bottleneck-Free Unified Multimodal Models
Title: Eliminating Bottlenecks in Unified Multimodal Models via Representation Forcing
Abstract:
Unified Multimodal Models (UMMs) are designed to integrate both perception and generation capabilities within a single architecture. However, current implementations typically depend on a frozen, independently pretrained Variational Autoencoder (VAE) for image creation, which creates a structural bottleneck. Simply removing this component leads to a significant drop in quality, as the model is then required to learn both high-level structural elements and low-level pixel details from scratch. To address this challenge, this study introduces Representation Forcing (RF), a method that integrates representation prediction as an inherent capability of the model. Specifically, RF compels the decoder to autoregressively generate visual representations as intermediate tokens prior to predicting pixels; these tokens remain within the context window to guide the pixel diffusion process within the same backbone. By converting representations from outputs of perception tasks into targets for generation, RF removes the necessity for an external generative latent space. Our findings indicate that RF enhances both understanding and generation performance. In terms of image generation, our pixel-space model utilizing RF achieves parity with state-of-the-art UMMs that rely on VAEs. Furthermore, for image understanding tasks, the pixel-space RF approach generally surpasses its VAE-based counterpart. Collectively, these outcomes represent a significant advancement toward achieving end-to-end, bottleneck-free Unified Multimodal Models.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




