arXiv

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

June 2, 2026 · Priyaranjan Pattnayak · Original Source

Title: SN-WER: A Script-Normalized Approach to WER in Multi-Script Indic ASR Assessment

Abstract:

While Word Error Rate (WER) remains the standard metric for evaluating automatic speech recognition (ASR) systems, it tends to inflate error counts when reference and hypothesis texts represent the same words using different scripts. This discrepancy is particularly prevalent in multilingual contexts where ASR models frequently output romanized text. To address this, we introduce Script-Normalized WER (SN-WER), a scoring mechanism designed exclusively for evaluation that requires no training. SN-WER operates by transliterating both the reference and hypothesis texts into a canonical script specific to the language prior to calculating WER.

Our assessment of SN-WER encompasses five Indic languages, two distinct datasets, and three different ASR models. Results from curated FLEURS data demonstrate that SN-WER can shrink artificially widened model performance gaps by as much as 12%. Conversely, on the noisier Common Voice dataset, the reductions in error rates are either minimal or inconsistent, suggesting that these discrepancies stem from actual recognition failures rather than mere script mismatches.

Further controlled stress tests reveal that SN-WER mitigates 67% of the WER inflation caused by artificial romanization. Additionally, controls involving lexical substitutions indicate that SN-WER maintains sensitivity to semantic errors comparable to standard WER, with a Delta SN-WER to Delta WER ratio of approximately 1.09. The method proves robust against variations in transliterator selection and normalization techniques, exhibiting token-collision rates under 0.1% in the tested Indic environments. We contend that SN-WER should be adopted as a companion metric alongside WER and Character Error Rate (CER) for script-agnostic ASR evaluation, particularly in scenarios where transcripts are utilized for downstream tasks such as search, indexing, or multilingual large language model pipelines.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC