arXiv

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

June 2, 2026 · Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci · Original Source

Title: A Standardized Framework for Text-Guided Anomaly Detection: Assessing the Actual Influence of Language on Decision-Making

Abstract:

Traditionally, industrial anomaly detection has operated as a unimodal discipline. Although recent advancements in multimodal vision-language models have introduced systems capable of processing text alongside images—promising text-guided zero- and few-shot inspection capabilities—these systems are typically assessed using evaluation protocols derived from unimodal benchmarks. These standard protocols keep the textual prompt constant, thereby failing to determine whether language genuinely influences the model's decisions. Consequently, it remains unclear whether observed performance improvements stem from effective text guidance or merely from robust pretrained visual features.

To address this gap, we present Text-Guided Anomaly Detection (TGAD), a structured benchmark designed to incrementally elevate the functional importance of language across three distinct scenarios. First, we establish a controlled prompt-sensitivity environment using the MVTec AD dataset. Second, we introduce a modified version of MVTec AD featuring component tags, which mandates that the model limit its evaluation to a specifically instructed part. Third, we unveil the Assembled Panel Dataset (APD), a realistic industrial scenario that demands both knowledge of defect types and component locations.

We assessed one representative model for each prevailing paradigm: a generative large vision-language model, a training-free discriminative model, and an embedding-adaptive discriminative model. Our findings reveal that in all three cases, the textual interface influences the decision only superficially. Specifically, prompt content is largely absorbed by the models unless the object noun is omitted, a change that causes the generative model’s I-AUROC to plummet from 97.4 to 82.6. Similarly, component-level instructions fail to constrain the decision when defects outside the instructed area are classified as normal, causing performance to drop from 90.3 to 66.3. When these factors combine on the APD dataset, image-level discrimination deteriorates significantly, falling below MVTec performance levels and, in one instance, below chance accuracy (scores of 71.2, 50.5, and 31.5).

These results indicate that conventional benchmarks exaggerate the text-guided capabilities of current multimodal anomaly detection systems. They further suggest that adopting rigorous evaluation protocols like TGAD is essential for developing models that can be reliably controlled via language for industrial applications.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC