arXiv

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

June 2, 2026 · Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, Wayne Zhuang, Yong Liu, Haoji Zhang, Yansong Tang, Chunyu Wang · Original Source

Title: ChatUMM: Enhancing Robust Context Tracking for Conversational Interleaved Generation

Abstract:

While Unified Multimodal Models (UMMs) have made significant strides, they are still largely limited to single-turn interactions, operating more like independent request solvers than partners in ongoing conversations. To address this limitation, we introduce ChatUMM, a conversational unified model designed for robust context tracking that supports continuous, interleaved multimodal generation.

ChatUMM’s performance is driven by two primary innovations. First, it employs an interleaved multi-turn training strategy that treats serialized text-image streams as a seamless conversational flow. Second, it utilizes a systematic pipeline for synthesizing conversational data. This pipeline converts various standard single-turn datasets into natural dialogues through three distinct phases: establishing foundational stateful interactions, resolving long-range dependencies by introducing "distractor" turns that require query rewriting based on history, and generating naturally interleaved multimodal responses.

Comprehensive evaluations reveal that ChatUMM delivers state-of-the-art results among open-source unified models on benchmarks for visual understanding and instruction-guided editing, while also maintaining strong fidelity in text-to-image generation. Crucially, ChatUMM demonstrates enhanced robustness in complex multi-turn settings, facilitating fluid and context-aware dialogue.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC