arXiv

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Title: CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

Abstract:

Given the critical nature of applications utilizing face recognition (FR) technology, ensuring these systems remain reliable and robust across a wide spectrum of populations and environmental conditions is paramount. Traditionally, the assessment of FR performance has depended on established datasets like LFW to gauge average accuracy. While certain benchmarks account for broad intra-identity shifts—such as aging, pose variations, and lighting changes—they often overlook the subtle, fine-grained modifications humans experience, such as alterations in hairstyle or makeup. These nuanced variations are significantly underrepresented in current evaluation standards. Counterfactual evaluation offers a viable pathway to test system robustness against such detailed changes.

However, previous synthetic face datasets created using image generators have suffered from restricted attribute coverage. This limitation stemmed from reliance on human reviewers for verification within their generation pipelines. To address this gap, we introduce CounterFace, a novel dataset designed for counterfactual evaluation. It encompasses 20 facial attributes and 8 demographic factors, surpassing earlier synthetic datasets by 14 attributes and 2 demographic categories. The dataset was produced via a completely automated workflow leveraging commercial-off-the-shelf image generators paired with custom verifiers, thereby eliminating the need for human intervention in the verification process.

CounterFace consists of 11,821 counterfactual face pairs. The fidelity of these generated images was validated through a post-hoc user study. We benchmarked six FR systems—two commercial solutions (AWS Rekognition, Face++) and four open-source models (AdaFace, MagFace, ArcFace, FaceNet)—against 160 distinct attribute-demographic combinations. Unlike standard benchmarks, CounterFace enables the isolation of specific failure modes for individual systems. Our findings reveal that performance declines vary significantly depending on the specific attribute and demographic involved for all tested systems. Notably, occluding features, such as facial hair and facemasks, consistently resulted in degraded performance across the board.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

Reuters

Amazon unveils new AI warehouse robot in $12 billion Europe push

Amazon unveiled a new AI warehouse robot, marking a key step in its $12 billion European expansion strategy to enhance l...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...