arXiv

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

June 2, 2026 · Deokjin Seo, Gangin Park, Kihyun Nam · Original Source

Title: Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Abstract:

We introduce Chatterbox-Flash, a zero-shot text-to-speech system developed by adapting a pre-trained autoregressive TTS decoder into a block-diffusion framework. This architecture facilitates the parallel generation of tokens within individual blocks while maintaining a block-by-block streaming capability. Our analysis reveals that directly applying standard block-diffusion decoding to discrete speech tokens compromises output quality. This degradation stems from a long-tailed token distribution, which skews parallel position selection toward a limited set of high-frequency tokens. To address this issue without altering the model’s architecture, we propose two inference-time strategies: prior-calibrated scoring, which adjusts for the block-level marginal token distribution, and an early-decoding schedule that halts iterations adaptively based on calibrated confidence levels. In evaluations on standard zero-shot TTS benchmarks, Chatterbox-Flash achieves high-fidelity synthesis that matches the performance of leading autoregressive and non-autoregressive baselines. Furthermore, it supports streaming inference with a time-to-first-packet comparable to streaming AR systems, alongside a significantly reduced real-time factor. Source code and audio examples can be accessed at https://github.com/resemble-ai/chatterbox-flash.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC