arXiv

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

Title: Enhancing Phoneme-Level Visual Speech Recognition Through Point-Visual Fusion and Language Model Reconstruction

Abstract: Interpreting spoken language exclusively through visual inputs, such as facial expressions and lip movements, defines the difficult task of Visual Automatic Speech Recognition (V-ASR). This endeavor is particularly complex because it lacks auditory signals and must contend with the visual similarity of certain phonemes, known as visemes, which produce nearly identical lip motions. While current approaches typically attempt to predict words or characters directly from these visual cues, they are often hindered by significant error rates stemming from viseme ambiguity and a heavy reliance on extensive pre-training datasets. To overcome these limitations, we introduce a novel two-stage framework grounded in phonemes. This method integrates visual and landmark motion features, subsequently employing a Large Language Model (LLM) for word reconstruction. The first stage performs V-ASR to predict phonemes, a strategy that simplifies training complexity, while simultaneously utilizing facial landmark features to account for individual speaker characteristics. In the second stage, an encoder-decoder LLM, specifically NLLB, transforms the predicted phonemes back into coherent words. By leveraging a large visual dataset for deep learning fine-tuning, our proposed PV-ASR method exhibits superior accuracy, achieving a Word Error Rate (WER) of 17.4% on the LRS2 dataset and 21.0% on the LRS3 dataset.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...