arXiv

Evaluating Reasoning Fidelity in Visual Text Generation

June 4, 2026 · Jiajun Hong, Jiawei Zhou · Original Source

Title: Assessing Reasoning Accuracy in Visual Text Synthesis

Abstract:

While contemporary text-to-image (T2I) models have demonstrated the capacity to produce highly legible and structurally sound text within images—facilitating uses such as document and slide creation—it is still uncertain if these systems genuinely maintain reasoning capabilities when complex solutions are conveyed directly via rendered text, or if they simply replicate superficial patterns. To address this, we examine reasoning fidelity in visual text generation, a domain where models are required to depict entire reasoning processes as images. Our assessment covers long-form text rendering, factual knowledge testing, context comprehension, and multi-step logical deduction. In these scenarios, we observe that existing T2I models often commit semantic mistakes, exhibit logical contradictions, and generate flawed intermediate steps, even when the output text is visually crisp. Such shortcomings stand in stark contrast to the robust reasoning skills displayed by text-only models tackling identical tasks. Consequently, our results highlight a significant disparity between visual text generation and procedural reasoning, underscoring the need for more dependable visual text reasoning systems.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC