Global News Digest

arXiv

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

Title: A Standardized Framework for Text-Guided Anomaly Detection: Assessing the Actual Influence of Language on Decision-Making

Abstract:

Traditionally, industrial anomaly detection has operated as a unimodal discipline. Although recent advancements in multimodal vision-language models have introduced systems capable of processing text alongside images—promising text-guided zero- and few-shot inspection capabilities—these systems are typically assessed using evaluation protocols derived from unimodal benchmarks. These standard protocols keep the textual prompt constant, thereby failing to determine whether language genuinely influences the model's decisions. Consequently, it remains unclear whether observed performance improvements stem from effective text guidance or merely from robust pretrained visual features.

To address this gap, we present Text-Guided Anomaly Detection (TGAD), a structured benchmark designed to incrementally elevate the functional importance of language across three distinct scenarios. First, we establish a controlled prompt-sensitivity environment using the MVTec AD dataset. Second, we introduce a modified version of MVTec AD featuring component tags, which mandates that the model limit its evaluation to a specifically instructed part. Third, we unveil the Assembled Panel Dataset (APD), a realistic industrial scenario that demands both knowledge of defect types and component locations.

We assessed one representative model for each prevailing paradigm: a generative large vision-language model, a training-free discriminative model, and an embedding-adaptive discriminative model. Our findings reveal that in all three cases, the textual interface influences the decision only superficially. Specifically, prompt content is largely absorbed by the models unless the object noun is omitted, a change that causes the generative model’s I-AUROC to plummet from 97.4 to 82.6. Similarly, component-level instructions fail to constrain the decision when defects outside the instructed area are classified as normal, causing performance to drop from 90.3 to 66.3. When these factors combine on the APD dataset, image-level discrimination deteriorates significantly, falling below MVTec performance levels and, in one instance, below chance accuracy (scores of 71.2, 50.5, and 31.5).

These results indicate that conventional benchmarks exaggerate the text-guided capabilities of current multimodal anomaly detection systems. They further suggest that adopting rigorous evaluation protocols like TGAD is essential for developing models that can be reliably controlled via language for industrial applications.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.