CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
Title: CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
Abstract:
Neural audio codecs serve as a critical element in speech processing workflows by converting audio signals into discrete tokens for subsequent modeling. Nevertheless, current codecs often face challenges in balancing reconstruction fidelity with token efficiency. They tend to encode perceptually irrelevant details, such as recording artifacts and background noise, which detracts from the representation of linguistically and acoustically significant content. To address this, we reframe audio tokenization as a selective information bottleneck challenge and introduce CleanCodec, a denoising audio codec designed to retain only perceptually salient features while discarding imperceptible data. Operating at a rate of just 12.5 tokens per second, CleanCodec sets a new standard for tokenization efficiency, significantly surpassing existing solutions in both speaker similarity and speech intelligibility. Furthermore, assessments on downstream applications, including text-to-speech and voice conversion, reveal enhanced performance and inference speeds up to 17 times faster, underscoring substantial efficiency improvements.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




