arXiv

Benchmarking Speech-to-Speech Translation Models

June 3, 2026 · Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo · Original Source

Title: Establishing Standards for Speech-to-Speech Translation Model Evaluation

Abstract:

While Speech-to-Speech Translation (S2ST) has seen rapid technological progress, the field currently suffers from a lack of standardized offline evaluation protocols. Existing studies often utilize disjoint sets of metrics, which makes direct comparison between different systems difficult or impossible. To address this, we present COMPASS, a comprehensive and reproducible benchmarking framework that consolidates 46 distinct metrics spanning eight evaluation dimensions. We applied this framework to analyze 1,248 configurations combining various models and languages, drawn from the FLEURS and CVSS datasets. This analysis covers both cascaded and end-to-end architectures across ten language pairs.

Our findings reveal that different architectural approaches possess distinct advantages. The performance gap between the best and worst systems exceeds 30% in areas such as naturalness and speaker preservation, whereas differences in translation quality are relatively minor, often differing by only a few points. Consequently, relying on a single metric leads to a systematic misrepresentation of overall system quality. By employing correlation filtering, we streamlined the 46 metrics down to 10 per direction. Notably, three evaluation axes necessitate different metrics depending on the translation direction (for instance, using TER/UTMOS versus ChrF++/NISQA-MOS). This reduced subset maintains ranking integrity (Spearman's $\rho > 0.80$) while reducing evaluation time by approximately 2.5 times.

Furthermore, human validation studies conducted across dubbing, podcast, and medical contexts demonstrated that generic Mean Opinion Score (MOS) predictors are ineffective at forecasting listener preference. In contrast, domain-specific top-tier metrics showed strong alignment with human judgment ($\rho \geq 0.90$). We make COMPASS available as a foundational tool for conducting domain-aware evaluation in S2ST research.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC