TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards
Title: TextAlign: Aligning Text Rendering Preferences Through Hierarchical Rewards
Abstract: Large text-to-image generative models continue to struggle with accurate text rendering, a challenge that demands strict adherence to semantic instructions alongside precise, fine-grained control over glyph structures. While previous approaches have attempted to address this by integrating specific architectural modules or modifying encoders—often complicating their deployment across foundation models—we reframe text rendering as a post-training preference alignment issue. In this work, we introduce TextAlign, a non-invasive framework that enhances performance without altering the underlying generator architecture. Central to our method is a hierarchical reward system built on a vision-language model (VLM). This system breaks down rendering errors into global, word, and glyph-level components, transforming binary assessments of defects into a scalar preference signal. This signal is compatible with both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Our evaluations on FLUX.1-dev and Z-Image-Turbo demonstrate consistent improvements in OCR-based text accuracy while maintaining high general generation quality. When benchmarked against robust foundation models and specialized text-rendering baselines such as SD3.5, Qwen-Image, AnyText, and TextDiffuser, our findings suggest that thoughtful reward design provides a scalable solution for enhancing text rendering, offering a viable alternative to extensive model redesign.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





