arXiv

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

June 4, 2026 · Guruprasad Viswanathan Ramesh, Ashish Hooda, Shimaa Ahmed, Harrison J Rosenberg, Ramya Korlakai Vinayak, Kassem Fawaz · Original Source

Title: CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Abstract:

Given the critical nature of applications utilizing face recognition (FR) technology, ensuring these systems remain reliable and robust across a wide spectrum of populations and environmental conditions is paramount. Traditionally, the assessment of FR performance has depended on established datasets like LFW to gauge average accuracy. While certain benchmarks account for broad intra-identity shifts—such as aging, pose variations, and lighting changes—they often overlook the subtle, fine-grained modifications humans experience, such as alterations in hairstyle or makeup. These nuanced variations are significantly underrepresented in current evaluation standards. Counterfactual evaluation offers a viable pathway to test system robustness against such detailed changes.

However, previous synthetic face datasets created using image generators have suffered from restricted attribute coverage. This limitation stemmed from reliance on human reviewers for verification within their generation pipelines. To address this gap, we introduce CounterFace, a novel dataset designed for counterfactual evaluation. It encompasses 20 facial attributes and 8 demographic factors, surpassing earlier synthetic datasets by 14 attributes and 2 demographic categories. The dataset was produced via a completely automated workflow leveraging commercial-off-the-shelf image generators paired with custom verifiers, thereby eliminating the need for human intervention in the verification process.

CounterFace consists of 11,821 counterfactual face pairs. The fidelity of these generated images was validated through a post-hoc user study. We benchmarked six FR systems—two commercial solutions (AWS Rekognition, Face++) and four open-source models (AdaFace, MagFace, ArcFace, FaceNet)—against 160 distinct attribute-demographic combinations. Unlike standard benchmarks, CounterFace enables the isolation of specific failure modes for individual systems. Our findings reveal that performance declines vary significantly depending on the specific attribute and demographic involved for all tested systems. Notably, occluding features, such as facial hair and facemasks, consistently resulted in degraded performance across the board.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC