ChatUMM: Robust Context Tracking for Conversational Interleaved Generation
Title: ChatUMM: Enhancing Robust Context Tracking for Conversational Interleaved Generation
Abstract:
While Unified Multimodal Models (UMMs) have made significant strides, they are still largely limited to single-turn interactions, operating more like independent request solvers than partners in ongoing conversations. To address this limitation, we introduce ChatUMM, a conversational unified model designed for robust context tracking that supports continuous, interleaved multimodal generation.
ChatUMM’s performance is driven by two primary innovations. First, it employs an interleaved multi-turn training strategy that treats serialized text-image streams as a seamless conversational flow. Second, it utilizes a systematic pipeline for synthesizing conversational data. This pipeline converts various standard single-turn datasets into natural dialogues through three distinct phases: establishing foundational stateful interactions, resolving long-range dependencies by introducing "distractor" turns that require query rewriting based on history, and generating naturally interleaved multimodal responses.
Comprehensive evaluations reveal that ChatUMM delivers state-of-the-art results among open-source unified models on benchmarks for visual understanding and instruction-guided editing, while also maintaining strong fidelity in text-to-image generation. Crucially, ChatUMM demonstrates enhanced robustness in complex multi-turn settings, facilitating fluid and context-aware dialogue.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





