arXiv

VT-3DAD: Cross-Category 3D Anomaly Detection via Visual-Text Normal Space Alignment

June 4, 2026 · Zi Wang, Katsuya Hotta, Yawen Zou, Koichiro Kamide, Yijin Wei, Chao Zhang, Jun Yu · Original Source

Title: VT-3DAD: Aligning Visual-Text Normal Spaces for Cross-Category 3D Anomaly Detection

Abstract

The objective of few-shot cross-category 3D anomaly detection is to identify whether an unlabelled point cloud belongs to a specific normal class, utilizing merely a small set of normal reference samples. While traditional approaches depend on category-specific training, recent training-free techniques leveraging multi-view CLIP visual features often struggle with categories that share similar geometric structures, as they rely predominantly on visual likeness. To address these limitations, we introduce VT-3DAD, a novel training-free framework designed for cross-category 3D anomaly detection through Visual-Text Normal Space Alignment.

The VT-3DAD process begins by converting few-shot normal references and test point clouds into realistic multi-view depth maps, from which view-wise features are extracted via a frozen CLIP visual encoder. The visual component calculates the deviation between test samples and references within this multi-view feature space. Simultaneously, the framework employs a frozen CLIP text encoder to process depth-aware and 3D-aware prompts, creating textual normal anchors. These anchors establish semantic constraints for normality relative to the target category. The ultimate anomaly score is derived by combining the visual deviation observed against normal references with the semantic deviation measured against the textual normal space.

Evaluation on the ShapeNetPart dataset indicates that VT-3DAD delivers state-of-the-art results. Notably, when compared to a visual-only baseline, VT-3DAD boosts the one-shot average AUC-ROC from 92.49% to 94.80% and significantly lowers the average standard deviation from 5.64 to 3.41.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC