arXiv

NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

June 2, 2026 · Md Abdullah Al Kafi, Raka Moni, Walayat Hussain · Original Source

Title: NormEval: A Unified Multi-Metric Framework for Evaluating Semantic Fidelity in Text Normalization

Abstract:

Stemming and lemmatization are essential building blocks within natural language processing (NLP) workflows. However, as new normalization tools emerge for a variety of languages, the methods used to assess them remain disjointed. Current evaluations often rely on isolated metrics such as Compression Ratio, downstream accuracy, or sequence-to-sequence prediction scores. This fragmented approach fails to differentiate between useful vocabulary reduction and detrimental semantic distortion. Given that text normalization supports critical intelligent systems in high-stakes fields like legal document analysis and clinical decision support, a rigorous and principled evaluation methodology is crucial.

To address these challenges, this paper introduces NormEval, a comprehensive, multilingual evaluation framework. The framework integrates five complementary metrics: Compression Ratio (CR), Model Performance Delta (MPD), Information Retention Score (IRS), Algorithm Effectiveness Score (AES), and Average Normalized Levenshtein Distance (ANLD). Together, these metrics evaluate normalization quality across three distinct dimensions: macro-level efficiency, downstream utility, and micro-level morphological fidelity.

A central component of this framework is the "Safety Gate" hypothesis, which positions ANLD as an intrinsic structural hygiene check. By leveraging character-level divergence ($\Delta$), ANLD exposes aggressive mutations that might otherwise be masked by macro-level embeddings or downstream task performance. Comprehensive ablation studies conducted on both English and Bangla datasets demonstrate that every component of the framework is vital. The removal of any single metric degrades performance in at least one evaluation aspect, ultimately leading to inaccurate algorithm rankings.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC