arXiv

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

Title: Enhancing Generalizability and Explainability in Deepfake Detection via Segmentation-Guided Spatial Indexing

Abstract:

This paper presents a novel approach to deepfake detection that prioritizes both generalizability and explainability through segmentation-guided spatial indexing. The methodology inverts the conventional workflow: instead of aggregating all facial tokens prior to classification, the system first identifies semantically significant patch tokens and pools exclusively these selected elements. This process begins with a frozen FaRL parser, which assigns semantic labels to each patch token generated by a DINOv3 ViT-L/16 model. Tokens that do not correspond to target regions are discarded, leaving only the relevant areas for classification via a linear probe.

By leveraging the spatial consistency inherent in DINOv3—a characteristic that facilitates emergent segmentation—this spatial indexing technique creates a cleaner regional subspace for the probe. In this refined space, evidence of manipulation is less obscured by holistic facial cues. Furthermore, this method ensures structural attribution; for instance, if the model identifies a fake image based on mouth artifacts, the decision relies strictly on the tokens from that specific region, rather than on post-hoc saliency maps.

Experimental results on the Celeb-DF v2 dataset demonstrate the efficacy of the mouth-indexed probe, which achieved an Area Under the Curve (AUC) of 0.905. This performance surpasses LipForensics by 8.1 percentage points and Xception by 16.9 percentage points, all without requiring fine-tuning of DINOv3 or FaRL, nor the use of target-domain data. Ablation studies further clarify the underlying mechanisms: substituting the regional selection process with DINOv3’s CLS token reduces the AUC on Celeb-DF v2 by 26.4 percentage points, while replacing DINOv3 with FaRL features results in a 20.9 percentage point drop. These findings confirm that both the DINOv3 representation and the spatial indexing mechanism are essential components, as neither component alone can replicate the performance of the complete system.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...