arXiv

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

Title: MT-EditFlow: Leveraging Reinforcement Learning for Multi-Turn Image Editing via Flow Matching

Abstract:

Instruction-guided image editing has recently garnered substantial interest, with models now equipped to meet the practical demands of daily users. Despite this progress, systems optimized for single-turn modifications frequently falter in multi-turn scenarios—the standard interactive mode where users progressively refine an image based on prior model outputs. This degradation occurs due to two primary issues: the "all-or-nothing" constraint, in which a single unsuccessful step derails the entire sequence, and error propagation, where exposure bias causes editing mistakes to accumulate.

To overcome these hurdles, we present MT-EditFlow, a reinforcement learning framework grounded in flow matching that optimizes reward signals specifically for sequential image manipulation. By combining a multi-turn viewpoint with a multi-reward structure, MT-EditFlow offers a cohesive architecture compatible with both GRPO and NFT-based reinforcement learning techniques. We conduct a thorough analysis and optimization of the reward mechanism, exploring efficient scoring methods for turn-level aggregation, varying VLM reasoning approaches to balance reward bias against variance, and different levels of advantage fusion to mitigate reward hacking.

Our results indicate that disseminating the aggregated advantage across the full editing trajectory successfully aligns local planning with the achievement of global, multi-turn objectives. Comprehensive experiments confirm that MT-EditFlow yields substantial performance gains across a variety of base models. Specifically, it enhances the turn-3 overall performance of FLUX.1-Kontext-dev by 6.85 points, outperforming leading open-source solutions like Qwen-Image-Edit. By sustaining high marginal success rates and minimizing exposure bias, MT-EditFlow establishes a robust basis for more dependable and intuitive human-AI cooperation in visual content generation.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...