Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
Title: Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
Abstract:
We introduce Chatterbox-Flash, a zero-shot text-to-speech system developed by adapting a pre-trained autoregressive TTS decoder into a block-diffusion framework. This architecture facilitates the parallel generation of tokens within individual blocks while maintaining a block-by-block streaming capability. Our analysis reveals that directly applying standard block-diffusion decoding to discrete speech tokens compromises output quality. This degradation stems from a long-tailed token distribution, which skews parallel position selection toward a limited set of high-frequency tokens. To address this issue without altering the model’s architecture, we propose two inference-time strategies: prior-calibrated scoring, which adjusts for the block-level marginal token distribution, and an early-decoding schedule that halts iterations adaptively based on calibrated confidence levels. In evaluations on standard zero-shot TTS benchmarks, Chatterbox-Flash achieves high-fidelity synthesis that matches the performance of leading autoregressive and non-autoregressive baselines. Furthermore, it supports streaming inference with a time-to-first-packet comparable to streaming AR systems, alongside a significantly reduced real-time factor. Source code and audio examples can be accessed at https://github.com/resemble-ai/chatterbox-flash.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





