Efficient ASR Training with Conversations that Never Happened
Title: Streamlining ASR Training via Simulated Interactions
Abstract:
The development of conversational Automatic Speech Recognition (ASR) systems for low-resource languages and specialized fields is often hindered by a lack of domain-specific, multi-speaker training data. To address this challenge, we introduce an augmentation framework that constructs scenario-based dialogues complete with participant metadata. This process involves mapping speaker characteristics to Text-to-Speech (TTS) voice profiles and integrating these synthesized utterances into simulated, speaker-aware conversations.
We assessed five different Large Language Model (LLM) families across various configurations—including single-generator, fixed-budget mixture, and scale-up settings—while employing a consistent FastConformer-Large training protocol for each. Our comprehensive evaluation utilized the Hungarian BEA-Dialogue benchmark corpus, demonstrating that the proposed method is versatile and can be applied to any language provided the necessary resources for each component are available.
The findings reveal that while synthetic conversations consistently enhance speech recognition accuracy, the magnitude of these improvements is heavily influenced by the choice of generator and the composition of the data. Notably, our most extensive training setup, which combined just 67 hours of authentic conversational data with 636 hours of simulated material, outperformed a zero-shot model trained on 2,700 hours of real Hungarian speech. These results suggest that LLM-generated dialogues, when synthesized using TTS, serve as a highly effective and practical supplement to genuine conversational datasets for training speech models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



