arXiv

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

Title: Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

Original: arXiv:2606.01479v1 Announce Type: new Abstract: Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.

Rewrite: While the incorporation of large language models (LLMs) into text-to-speech (TTS) frameworks has significantly enhanced vocal expressiveness, achieving interpretable control over emotion remains a significant hurdle. Current methodologies largely depend on external conditioning mechanisms or broad activation steering, which provide scant visibility into the internal representations that drive emotional modulation. This study employs sparse autoencoders (SAEs) to examine emotion-related fluctuations within the semantic hidden states of LLM-driven TTS systems, aiming to isolate sparse latent features. Our findings indicate that emotional variability is spread across numerous sparse latent features; however, targeted intervention on a limited subset allows for clear, interpretable emotional regulation. Leveraging this insight, we propose a feature-level intervention framework capable of both inducing and suppressing emotions bidirectionally, all without altering the model’s backbone parameters. Additionally, we demonstrate that specific latent features correlate with distinct acoustic properties, such as pitch, implying that emotional delivery stems from coordinated latent effects rather than a uniform global change. Experimental results confirm that manipulating these sparse latent features yields emotion induction and suppression outcomes that are on par with or better than those of global steering methods and current TTS benchmarks.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...