HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
Title: HoliTok: Continuous Holistic Tokenization with Dual Capabilities for Robust Speech Generation and Understanding
Abstract:
To function as a unified speech foundation model, a system requires a holistic tokenization framework that is simultaneously learnable by language models and capable of decoding into high-fidelity waveforms. Current speech tokenizers frequently struggle to meet both criteria concurrently, which often necessitates more complex architectures and intricate training procedures. To address this, we introduce HoliTok, a continuous holistic speech tokenization model tailored for integrated generation and understanding tasks. HoliTok compresses 48 kHz audio input into a streamlined sequence of 128-dimensional latent vectors at a rate of 25 Hz. The model employs a progressive training strategy designed to balance signal-level fidelity, semantic integration, and latent learnability. Leveraging this tokenization approach, we developed a unified AR+DiT architecture capable of handling both speech synthesis and recognition using the same latent sequence for generation-specific and combined generation-understanding tasks. Our experiments demonstrate that HoliTok delivers competitive reconstruction quality, enhances learnability for high-quality and controllable synthesis, and stands out as the only representation among those tested to function robustly within our unified architecture without requiring supplementary optimization techniques. These findings position HoliTok as a potent speech tokenizer and a foundational interface for unified spoken language modeling. Code access is available at: https://github.com/bovod-sjtu/HoliTok.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





