OctoT2I: A Self-Evolving Agentic Text-to-Image Router
Title: OctoT2I: A Self-Evolving Agentic Text-to-Image Router
Abstract
As the landscape of Text-to-Image (T2I) models expands—ranging from massive architectures to streamlined, real-time variants—the industry is encountering diminishing returns from scaling individual models. To overcome this stagnation, agentic T2I approaches have emerged, leveraging multiple models to enhance output. However, current agentic solutions are hindered by three primary limitations: their dependence on costly handcrafted priors or human annotations, inflexible single-path decision-making processes, and a general disregard for inference efficiency.
In response to these issues, we present OctoT2I, an innovative agentic framework that redefines the T2I task as a joint optimization problem focusing on both generation quality and inference speed. OctoT2I utilizes a stateful, multi-round routing strategy that dynamically selects the most appropriate tool by leveraging its internal knowledge and memory. This adaptive selection is powered by a knowledge base constructed entirely through our novel Self-Evolving Mechanism, which operates without human supervision.
This mechanism first autonomously establishes foundational Conceptual Dimensions, such as style, color, and count. It then intelligently explores the combinations of these dimensions through an iterative "Propose--Solve--Evaluate--Learn" (PSEL) loop. By efficiently mapping the capability boundaries of each tool, the PSEL loop drives continuous improvement without the need for external guidance.
Extensive experiments confirm that OctoT2I strikes an exceptional balance between performance and efficiency. It achieves a competitive score of 0.96 on GenEval while delivering a 90.3% increase in inference speed and a 56.6% gain in energy efficiency compared to the leading baseline, Flow-GRPO. The associated code and models will be made publicly available.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




