arXiv

BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling

Title: BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling

Original: arXiv:2604.16808v4 Announce Type: replace Abstract: Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Authentic lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators typically do not impose these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this signal, which we term temporal lip jitter, by computing kinematic statistics from 64 perioral landmarks over short sliding windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data. We train only on English data and test in a zero-shot setting on five unseen generators and seven languages.

Rewritten:

Abstract: Conventional lip-sync deepfake detection methods primarily depend on identifying pixel-level artifacts or assessing audio-visual synchronization. However, these approaches often falter when encountering shifts in either the generator model or the spoken language, as their learned features are inextricably linked to specific training distributions. In contrast, our method leverages the fundamental biomechanical limits of human speech. Natural lip movements are governed by tissue mechanics and neuromuscular bandwidth; conversely, contemporary generative models frequently overlook these physiological constraints. This oversight results in unnatural motion trajectories characterized by excessive variance in velocity, acceleration, and jerk—deviations absent in genuine speech. We identify this anomaly as "temporal lip jitter." To detect it, we calculate kinematic statistics derived from 64 perioral landmarks within brief sliding time windows, which are then processed by a streamlined three-branch neural network. Notably, this system operates exclusively on landmark coordinates, entirely excluding pixel data, audio inputs, or voiceprint information. While our model is trained solely on English datasets, we evaluate its efficacy in a zero-shot scenario across seven distinct languages and five previously unseen generation models.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...