Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
Title: Tackling Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
Abstract:
Speech Large Language Models (SLLMs) currently lag behind their text-based counterparts when handling complex reasoning tasks. Our investigation reveals that this performance disparity does not stem from a generalized cognitive deficit. Through an evaluation of three distinct SLLMs, we demonstrate that speech-to-text (S2T) systems perform on par with or better than text-to-text (T2T) models in spatial, syntactic, and factual domains. However, in logical tasks that demand entity tracking, S2T accuracy drops to chance levels. We identify this specific decline as an entity binding failure, where continuous speech features lead models to lose precise associations between entities and their properties during implicit reasoning processes.
To address this issue, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a method that compels SLLMs to explicitly list entities and link them to claims prior to reasoning. EA-CoT effectively closes the performance gap, achieving absolute accuracy improvements of up to 24.4%, even in scenarios where spoken names are misrecognized. Ablation studies confirm that these enhancements are driven solely by explicit semantic binding, suggesting that the observed modality gap is a resolvable bottleneck rather than an inherent limitation.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




