BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling
Title: BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling
Original: arXiv:2604.16808v4 Announce Type: replace Abstract: Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Authentic lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators typically do not impose these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this signal, which we term temporal lip jitter, by computing kinematic statistics from 64 perioral landmarks over short sliding windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data. We train only on English data and test in a zero-shot setting on five unseen generators and seven languages.
Rewritten:
Abstract: Conventional lip-sync deepfake detection methods primarily depend on identifying pixel-level artifacts or assessing audio-visual synchronization. However, these approaches often falter when encountering shifts in either the generator model or the spoken language, as their learned features are inextricably linked to specific training distributions. In contrast, our method leverages the fundamental biomechanical limits of human speech. Natural lip movements are governed by tissue mechanics and neuromuscular bandwidth; conversely, contemporary generative models frequently overlook these physiological constraints. This oversight results in unnatural motion trajectories characterized by excessive variance in velocity, acceleration, and jerk—deviations absent in genuine speech. We identify this anomaly as "temporal lip jitter." To detect it, we calculate kinematic statistics derived from 64 perioral landmarks within brief sliding time windows, which are then processed by a streamlined three-branch neural network. Notably, this system operates exclusively on landmark coordinates, entirely excluding pixel data, audio inputs, or voiceprint information. While our model is trained solely on English datasets, we evaluate its efficacy in a zero-shot scenario across seven distinct languages and five previously unseen generation models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





