DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
**Title: DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
Abstract:
Fully discrete Speech Large Language Models rely heavily on speech tokenizers as a fundamental component. However, current approaches face limitations: they either focus exclusively on semantic encoding, merge semantic content with acoustic style in an inseparable manner, or fail to achieve complete disentanglement between semantics and acoustics. To address these challenges, we introduce DSA-Tokenizer, a model that explicitly separates speech into distinct discrete semantic and acoustic tokens through specific optimization constraints. In this framework, semantic tokens are guided by Automatic Speech Recognition (ASR) supervision to capture linguistic information, whereas acoustic tokens are optimized for mel-spectrogram restoration to encode stylistic features.
We further propose a hierarchical decoder based on Flow Matching, alongside a training strategy that combines joint reconstruction with context inpainting. This architecture enables the model to perform both high-fidelity reconstruction and voice cloning across different utterances. To accelerate inference, we distill the DiT decoder, reducing the number of sampling steps to just four during inference, and enhance synthesis quality through fine-tuning with a Generative Adversarial Network (GAN). Experimental results indicate that DSA-Tokenizer achieves robust semantic-acoustic disentanglement, ensuring reliable and controllable voice cloning alongside efficient, high-fidelity generation with low Word Error Rate (WER) and Character Error Rate (CER). Furthermore, our findings suggest that disentangled tokenization serves as a more effective interface for subsequent large-model speech generation tasks. Audio samples can be accessed at https://anonymous.4open.science/w/DSA_Tokenizer_demo/.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




