arXiv

Benchmarking Speech-to-Speech Translation Models

Title: Establishing Standards for Speech-to-Speech Translation Model Evaluation

Abstract:

While Speech-to-Speech Translation (S2ST) has seen rapid technological progress, the field currently suffers from a lack of standardized offline evaluation protocols. Existing studies often utilize disjoint sets of metrics, which makes direct comparison between different systems difficult or impossible. To address this, we present COMPASS, a comprehensive and reproducible benchmarking framework that consolidates 46 distinct metrics spanning eight evaluation dimensions. We applied this framework to analyze 1,248 configurations combining various models and languages, drawn from the FLEURS and CVSS datasets. This analysis covers both cascaded and end-to-end architectures across ten language pairs.

Our findings reveal that different architectural approaches possess distinct advantages. The performance gap between the best and worst systems exceeds 30% in areas such as naturalness and speaker preservation, whereas differences in translation quality are relatively minor, often differing by only a few points. Consequently, relying on a single metric leads to a systematic misrepresentation of overall system quality. By employing correlation filtering, we streamlined the 46 metrics down to 10 per direction. Notably, three evaluation axes necessitate different metrics depending on the translation direction (for instance, using TER/UTMOS versus ChrF++/NISQA-MOS). This reduced subset maintains ranking integrity (Spearman's $\rho > 0.80$) while reducing evaluation time by approximately 2.5 times.

Furthermore, human validation studies conducted across dubbing, podcast, and medical contexts demonstrated that generic Mean Opinion Score (MOS) predictors are ineffective at forecasting listener preference. In contrast, domain-specific top-tier metrics showed strong alignment with human judgment ($\rho \geq 0.90$). We make COMPASS available as a foundational tool for conducting domain-aware evaluation in S2ST research.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TechCrunch

The world’s largest privately owned laser just turned on

Xcimer Energy activated the Phoenix laser, the world’s largest privately owned laser, aiming to commercialize fusion pow...

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya
Bloomberg

Uber Targets Doubling Its Fleet of Electric Motorcycles in Kenya

Uber plans to double its electric motorcycle fleet in Kenya. This expansion aims to enhance sustainable transport option...

AI Saves Time But Most Companies Waste the Gain, Study Shows
Bloomberg

AI Saves Time But Most Companies Waste the Gain, Study Shows

A study reveals that while AI saves employee time, most companies fail to capitalize on these gains, squandering potenti...

JPMorgan Lifts S&P Target on Earnings 'Supercycle'
Bloomberg

JPMorgan Lifts S&P Target on Earnings 'Supercycle'

JPMorgan raised its S&P 500 target, citing an earnings “supercycle” that reflects heightened confidence in corporate pro...

Europe Sleepwalking Into Economic Ruin, Serb Leader Says
Bloomberg

Europe Sleepwalking Into Economic Ruin, Serb Leader Says

Serbian leader warns Europe is sleepwalking into economic ruin.

Delta Electronics Flags Power Crunch
Bloomberg

Delta Electronics Flags Power Crunch

Delta Electronics warns of a looming power deficit due to surging demand and constrained production, predicting serious ...