MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching
Title: MT-EditFlow: Leveraging Reinforcement Learning for Multi-Turn Image Editing via Flow Matching
Abstract:
Instruction-guided image editing has recently garnered substantial interest, with models now equipped to meet the practical demands of daily users. Despite this progress, systems optimized for single-turn modifications frequently falter in multi-turn scenarios—the standard interactive mode where users progressively refine an image based on prior model outputs. This degradation occurs due to two primary issues: the "all-or-nothing" constraint, in which a single unsuccessful step derails the entire sequence, and error propagation, where exposure bias causes editing mistakes to accumulate.
To overcome these hurdles, we present MT-EditFlow, a reinforcement learning framework grounded in flow matching that optimizes reward signals specifically for sequential image manipulation. By combining a multi-turn viewpoint with a multi-reward structure, MT-EditFlow offers a cohesive architecture compatible with both GRPO and NFT-based reinforcement learning techniques. We conduct a thorough analysis and optimization of the reward mechanism, exploring efficient scoring methods for turn-level aggregation, varying VLM reasoning approaches to balance reward bias against variance, and different levels of advantage fusion to mitigate reward hacking.
Our results indicate that disseminating the aggregated advantage across the full editing trajectory successfully aligns local planning with the achievement of global, multi-turn objectives. Comprehensive experiments confirm that MT-EditFlow yields substantial performance gains across a variety of base models. Specifically, it enhances the turn-3 overall performance of FLUX.1-Kontext-dev by 6.85 points, outperforming leading open-source solutions like Qwen-Image-Edit. By sustaining high marginal success rates and minimizing exposure bias, MT-EditFlow establishes a robust basis for more dependable and intuitive human-AI cooperation in visual content generation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





