arXiv

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

June 2, 2026 · Subhadeep Roy, Gagan Bhatia, Steffen Eger · Original Source

Title: Prototypicality Bias Uncovers Hidden Flaws in Multimodal Evaluation Metrics

Abstract:

Automatic metrics have become the standard for assessing text-to-image (T2I) models, frequently substituting for human assessment in tasks such as benchmarking, model selection, and large-scale data filtering. However, these automated systems often prioritize images that appear plausible or align with common stereotypes over those that accurately adhere to the specific prompt. This study identifies "prototypicality bias" as a critical oversight in multimodal evaluation: metrics tend to favor semantically inaccurate images that are visually or socially typical, even when a semantically correct but less conventional image is available.

To address this, we present PROTOBIAS, a controlled diagnostic benchmark spanning Animals, Objects, and Demography. This framework contrasts semantically accurate images with "prototypical adversaries"—images that are visually plausible but contain a single, controlled semantic violation. Built on principles of prototype theory and social-category prototypicality, PROTOBIAS utilizes multiple prompt and image generators alongside independent Visual Language Model (VLM) filters. Its validity is confirmed through rigorous controls for prompt quality, human annotation, and image fidelity.

Our analysis using PROTOBIAS demonstrates that prevalent evaluation methods, including embedding scores, reward models, VQA-based metrics, and VLM-as-judge systems, frequently struggle to distinguish between these contrasts. In contrast, human judgments remain significantly more aligned with semantic correctness. Additionally, we propose PROTOSCORE, a lightweight evaluator trained via contrastive learning, as an initial strategy to mitigate this bias. PROTOBIAS serves as a targeted benchmark for quantifying metric failures driven by prototypicality and for fostering the development of T2I evaluators that are more faithful to semantic intent.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC