Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
Title: Combining Strengths: Unified Discrete Flow Matching for Multimodal Reasoning and Generation
Abstract: This paper introduces UniDFlow, a cohesive discrete flow-matching architecture designed to handle multimodal understanding, generation, and editing simultaneously. The framework separates these functions by employing task-specific low-rank adapters, a strategy that prevents objective conflicts and representation entanglement. Additionally, we present a novel reference-based multimodal preference alignment technique that enhances faithfulness and controllability by optimizing relative outcomes under consistent conditioning, all without the need for extensive retraining. UniDFlow delivers state-of-the-art results across eight distinct benchmarks. Furthermore, it demonstrates robust zero-shot generalization capabilities in tasks such as in-context image generation, inpainting, reference-based editing, and compositional generation, despite lacking explicit training for these specific tasks.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





