arXiv

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

June 2, 2026 · Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh · Original Source

Title: Enhancing Phoneme-Level Visual Speech Recognition Through Point-Visual Fusion and Language Model Reconstruction

Abstract: Interpreting spoken language exclusively through visual inputs, such as facial expressions and lip movements, defines the difficult task of Visual Automatic Speech Recognition (V-ASR). This endeavor is particularly complex because it lacks auditory signals and must contend with the visual similarity of certain phonemes, known as visemes, which produce nearly identical lip motions. While current approaches typically attempt to predict words or characters directly from these visual cues, they are often hindered by significant error rates stemming from viseme ambiguity and a heavy reliance on extensive pre-training datasets. To overcome these limitations, we introduce a novel two-stage framework grounded in phonemes. This method integrates visual and landmark motion features, subsequently employing a Large Language Model (LLM) for word reconstruction. The first stage performs V-ASR to predict phonemes, a strategy that simplifies training complexity, while simultaneously utilizing facial landmark features to account for individual speaker characteristics. In the second stage, an encoder-decoder LLM, specifically NLLB, transforms the predicted phonemes back into coherent words. By leveraging a large visual dataset for deep learning fine-tuning, our proposed PV-ASR method exhibits superior accuracy, achieving a Word Error Rate (WER) of 17.4% on the LRS2 dataset and 21.0% on the LRS3 dataset.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC