Global News Digest

arXiv

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

Title: Looking Beyond the Lips: The Role of Upper-Face Emotional Signals in Audiovisual Sentence Comprehension Amidst Acoustic Noise

Abstract: Natural speech comprehension is a multimodal process that synthesizes auditory signals with visual data such as articulation, head movement, and facial expressions. While current audiovisual speech systems predominantly rely on the mouth region for linguistic decoding and treat emotional expressions as distinct classification tasks, this study explores the contribution of upper-face affective cues to sentence recognition. Specifically, we examine whether these cues enhance performance when audio is degraded. Leveraging the CREMA-D audiovisual emotional speech dataset, we developed feature-based sentence classifiers tested under four distinct cue configurations: audio-only (A), audio combined with mouth/lower-face features (A+M), audio combined with upper-face features (A+U), and audio integrated with both mouth and upper-face features (A+M+U). The models were assessed using actor-independent splits on clean audio and pink-noise environments at signal-to-noise ratios (SNR) of +10 dB, +5 dB, and 0 dB.

The results indicate that incorporating mouth and lower-face features significantly enhances robustness against audio degradation. At an SNR of 0 dB, the A+M condition increased accuracy by 0.0794 compared to the audio-only baseline, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective signals demonstrated a more complex influence. While the direct accuracy improvement from adding upper-face cues (A+M+U versus A+M) was minimal, models utilizing the full face showed superior calibration across all SNR levels and significantly outperformed controls with shuffled upper-face data in noisy scenarios. These outcomes imply that affective facial information bolsters multimodal resilience and aids confidence estimation during acoustic uncertainty, even if it does not directly convey lexical content. More broadly, the research underscores the importance of socially expressive facial cues in the design of human-centric audiovisual interaction systems.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.