Model-Based Quality Assessment for Massively Multilingual Parallel Data
Title: Model-Driven Quality Evaluation for Extensive Multilingual Parallel Corpora
Abstract: Massive multilingual bitext datasets frequently suffer from two primary issues: the inclusion of non-parallel sentence pairs and substandard translation quality. To address this, we break down model-based assessment into two separate processes: evaluating parallelism using multilingual embeddings and estimating quality without references (QE). In the area of parallelism, we test four embedding models on FLORES-200 and BOUQuET retrieval tasks, spanning 6,654 source-to-target combinations within our target language-pair inventory. Regarding quality estimation, we assess nine reference-free evaluators against professional FLORES-200 translations across 41,412 ordered source-to-target directions. Our findings indicate that no single model maintains consistent reliability across all translation directions. Furthermore, while simple ensembles of QE models can weaken the signals from high-performing models, there is a strong correlation between documented target-language coverage and higher QE scores. Consequently, these results imply that assessing multilingual parallel data is most effectively treated as a direction-specific routing and calibration challenge, acknowledging that no universal metric can adequately serve all languages.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





