Bridging What the Model Thinks and How It Speaks: Expressive Speech Generation via Self-Aware Intent-Realization Alignment
Title: Connecting Internal Reasoning with Vocal Expression: Expressive Speech Synthesis Through Self-Aware Intent-Realization Alignment
Original: arXiv:2604.11424v2 Announce Type: replace Abstract: Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion. We identify this mismatch as the semantic understanding-acoustic realization gap. Existing approaches typically rely on externally specified proxies, such as emotion labels or style prompts, which require annotations and struggle to capture dynamically evolving expressive intent throughout dialogue. To overcome these limitations, we propose SASLM (Self-Aware Speech Language Model), a proxy-free framework that bridges what the model thinks and how it speaks through self-aware intent-realization alignment: (1) Intent-Aware Bridging self-distills expressive intent from the model's own evolving semantic generation states via a Variational Information Bottleneck (VIB), thereby guiding expressive speech realization without external expressive supervision; while (2) Realization-Aware Alignment reflectively aligns generated acoustics with intended expression through self-reward optimization, progressively improving intent-realization consistency during speech generation. Despite using only 3B parameters and 800 hours of expressive speech data, SASLM achieves state-of-the-art performance on EchoMind among open-source systems, surpassing models over 10 times larger and approaching commercial systems.
Rewritten: Title: Aligning Cognitive Intent with Vocal Output: A Self-Aware Approach to Expressive Speech Generation
Original: arXiv:2604.11424v2 Announce Type: replace Abstract: Speech Language Models (SLMs) exhibit strong semantic understanding, yet often fail to translate this capacity into expressive acoustic realization, producing speech with flattened prosody and misaligned emotion. We identify this mismatch as the semantic understanding-acoustic realization gap. Existing approaches typically rely on externally specified proxies, such as emotion labels or style prompts, which require annotations and struggle to capture dynamically evolving expressive intent throughout dialogue. To overcome these limitations, we propose SASLM (Self-Aware Speech Language Model), a proxy-free framework that bridges what the model thinks and how it speaks through self-aware intent-realization alignment: (1) Intent-Aware Bridging self-distills expressive intent from the model's own evolving semantic generation states via a Variational Information Bottleneck (VIB), thereby guiding expressive speech realization without external expressive supervision; while (2) Realization-Aware Alignment reflectively aligns generated acoustics with intended expression through self-reward optimization, progressively improving intent-realization consistency during speech generation. Despite using only 3B parameters and 800 hours of expressive speech data, SASLM achieves state-of-the-art performance on EchoMind among open-source systems, surpassing models over 10 times larger and approaching commercial systems.
Rewritten: Abstract: While Speech Language Models (SLMs) demonstrate robust semantic comprehension, they frequently struggle to convert this understanding into emotionally resonant acoustic outputs, often resulting in monotone prosody and inconsistent emotional tone. We term this discrepancy the "semantic understanding-acoustic realization gap." Current methods usually depend on external indicators like emotion tags or style instructions, necessitating manual annotation and failing to track the fluid nature of expressive intent during conversations. To address these challenges, we introduce SASLM (Self-Aware Speech Language Model), a framework that eliminates the need for external proxies by aligning internal thought processes with vocal delivery through self-aware intent-realization alignment. This approach operates via two mechanisms: (1) Intent-Aware Bridging, which extracts expressive intent directly from the model’s shifting semantic states using a Variational Information Bottleneck (VIB), thus directing speech production without relying on external supervisory signals; and (2) Realization-Aware Alignment, which uses self-reward optimization to continuously harmonize the generated audio with the intended emotional expression, thereby enhancing consistency throughout the generation process. Remarkably, SASLM requires only 3 billion parameters and 800 hours of expressive speech training data, yet it secures top-tier results on the EchoMind benchmark among open-source models. It outperforms systems with ten times more parameters and nearly matches the capabilities of proprietary commercial solutions.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





