Global News Digest

arXiv

DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

**Title: DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

Abstract:

Fully discrete Speech Large Language Models rely heavily on speech tokenizers as a fundamental component. However, current approaches face limitations: they either focus exclusively on semantic encoding, merge semantic content with acoustic style in an inseparable manner, or fail to achieve complete disentanglement between semantics and acoustics. To address these challenges, we introduce DSA-Tokenizer, a model that explicitly separates speech into distinct discrete semantic and acoustic tokens through specific optimization constraints. In this framework, semantic tokens are guided by Automatic Speech Recognition (ASR) supervision to capture linguistic information, whereas acoustic tokens are optimized for mel-spectrogram restoration to encode stylistic features.

We further propose a hierarchical decoder based on Flow Matching, alongside a training strategy that combines joint reconstruction with context inpainting. This architecture enables the model to perform both high-fidelity reconstruction and voice cloning across different utterances. To accelerate inference, we distill the DiT decoder, reducing the number of sampling steps to just four during inference, and enhance synthesis quality through fine-tuning with a Generative Adversarial Network (GAN). Experimental results indicate that DSA-Tokenizer achieves robust semantic-acoustic disentanglement, ensuring reliable and controllable voice cloning alongside efficient, high-fidelity generation with low Word Error Rate (WER) and Character Error Rate (CER). Furthermore, our findings suggest that disentangled tokenization serves as a more effective interface for subsequent large-model speech generation tasks. Audio samples can be accessed at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.