Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection
Title: Enhancing Generalizability and Explainability in Deepfake Detection via Segmentation-Guided Spatial Indexing
Abstract:
This paper presents a novel approach to deepfake detection that prioritizes both generalizability and explainability through segmentation-guided spatial indexing. The methodology inverts the conventional workflow: instead of aggregating all facial tokens prior to classification, the system first identifies semantically significant patch tokens and pools exclusively these selected elements. This process begins with a frozen FaRL parser, which assigns semantic labels to each patch token generated by a DINOv3 ViT-L/16 model. Tokens that do not correspond to target regions are discarded, leaving only the relevant areas for classification via a linear probe.
By leveraging the spatial consistency inherent in DINOv3—a characteristic that facilitates emergent segmentation—this spatial indexing technique creates a cleaner regional subspace for the probe. In this refined space, evidence of manipulation is less obscured by holistic facial cues. Furthermore, this method ensures structural attribution; for instance, if the model identifies a fake image based on mouth artifacts, the decision relies strictly on the tokens from that specific region, rather than on post-hoc saliency maps.
Experimental results on the Celeb-DF v2 dataset demonstrate the efficacy of the mouth-indexed probe, which achieved an Area Under the Curve (AUC) of 0.905. This performance surpasses LipForensics by 8.1 percentage points and Xception by 16.9 percentage points, all without requiring fine-tuning of DINOv3 or FaRL, nor the use of target-domain data. Ablation studies further clarify the underlying mechanisms: substituting the regional selection process with DINOv3’s CLS token reduces the AUC on Celeb-DF v2 by 26.4 percentage points, while replacing DINOv3 with FaRL features results in a 20.9 percentage point drop. These findings confirm that both the DINOv3 representation and the spatial indexing mechanism are essential components, as neither component alone can replicate the performance of the complete system.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





