Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
Title: Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
Original: arXiv:2606.01479v1 Announce Type: new Abstract: Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.
Rewrite: While the incorporation of large language models (LLMs) into text-to-speech (TTS) frameworks has significantly enhanced vocal expressiveness, achieving interpretable control over emotion remains a significant hurdle. Current methodologies largely depend on external conditioning mechanisms or broad activation steering, which provide scant visibility into the internal representations that drive emotional modulation. This study employs sparse autoencoders (SAEs) to examine emotion-related fluctuations within the semantic hidden states of LLM-driven TTS systems, aiming to isolate sparse latent features. Our findings indicate that emotional variability is spread across numerous sparse latent features; however, targeted intervention on a limited subset allows for clear, interpretable emotional regulation. Leveraging this insight, we propose a feature-level intervention framework capable of both inducing and suppressing emotions bidirectionally, all without altering the model’s backbone parameters. Additionally, we demonstrate that specific latent features correlate with distinct acoustic properties, such as pitch, implying that emotional delivery stems from coordinated latent effects rather than a uniform global change. Experimental results confirm that manipulating these sparse latent features yields emotion induction and suppression outcomes that are on par with or better than those of global steering methods and current TTS benchmarks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





