Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty
Title: Looking Beyond the Lips: The Role of Upper-Face Emotional Signals in Audiovisual Sentence Comprehension Amidst Acoustic Noise
Abstract: Natural speech comprehension is a multimodal process that synthesizes auditory signals with visual data such as articulation, head movement, and facial expressions. While current audiovisual speech systems predominantly rely on the mouth region for linguistic decoding and treat emotional expressions as distinct classification tasks, this study explores the contribution of upper-face affective cues to sentence recognition. Specifically, we examine whether these cues enhance performance when audio is degraded. Leveraging the CREMA-D audiovisual emotional speech dataset, we developed feature-based sentence classifiers tested under four distinct cue configurations: audio-only (A), audio combined with mouth/lower-face features (A+M), audio combined with upper-face features (A+U), and audio integrated with both mouth and upper-face features (A+M+U). The models were assessed using actor-independent splits on clean audio and pink-noise environments at signal-to-noise ratios (SNR) of +10 dB, +5 dB, and 0 dB.
The results indicate that incorporating mouth and lower-face features significantly enhances robustness against audio degradation. At an SNR of 0 dB, the A+M condition increased accuracy by 0.0794 compared to the audio-only baseline, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective signals demonstrated a more complex influence. While the direct accuracy improvement from adding upper-face cues (A+M+U versus A+M) was minimal, models utilizing the full face showed superior calibration across all SNR levels and significantly outperformed controls with shuffled upper-face data in noisy scenarios. These outcomes imply that affective facial information bolsters multimodal resilience and aids confidence estimation during acoustic uncertainty, even if it does not directly convey lexical content. More broadly, the research underscores the importance of socially expressive facial cues in the design of human-centric audiovisual interaction systems.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




