arXiv

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

June 3, 2026 · Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan · Original Source

Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

Abstract:

While Text-to-Image (T2I) systems excel at producing high-quality artistic and visual outputs, current evaluation frameworks are largely confined to assessing image realism and basic text-image correspondence. These standards fail to adequately measure a model's capacity for deep semantic comprehension and the integration of world knowledge. To bridge this gap, we introduce WISE, the inaugural benchmark dedicated to World Knowledge-Informed Semantic Evaluation.

WISE transcends rudimentary word-to-pixel correspondence by testing models on 1,000 carefully constructed prompts distributed across 25 distinct subdomains. These domains encompass cultural common sense, spatio-temporal reasoning, and natural science. Furthermore, to address the shortcomings of conventional CLIP-based metrics, we present WiScore, a new quantitative measure designed to evaluate the alignment between generated images and underlying knowledge.

Our comprehensive evaluation of 20 models—including 10 specialized T2I systems and 10 unified multimodal architectures—utilized the 1,000 structured prompts across the 25 subdomains. The results expose substantial deficiencies in how effectively these models incorporate and utilize world knowledge during generation. These findings underscore essential directions for improving knowledge integration in future T2I advancements. The associated code and data can be accessed at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC