Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy
Title: Read What You Hear: Reference-Free Hypotheses Evaluation with Acoustic Discrepancy
Abstract: While traditional automatic speech recognition (ASR) assessment typically depends on reference transcriptions, reference-free methodologies usually rely on internal confidence scores or supplementary language models. To address this, we introduce READ (Reference-free Hypothesis Evaluation with Acoustic Discrepancy), a new metric that assesses ASR hypotheses directly from the audio signal by prioritizing acoustic grounding. READ leverages a pretrained auto-regressive text-to-speech (TTS) model to calculate the conditional likelihood of speech tokens based on a text hypothesis, thereby quantifying the fine-grained acoustic divergence between the spoken audio and the written text. Notably, READ requires no additional training to be utilized for hypothesis refinement. Our experiments demonstrate that READ correlates with distinct recognition errors and enhances ASR performance, yielding a relative error rate reduction of up to 20%, with the most significant improvements observed in noisy environments.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





