A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
Title: Benchmarking Machine Transliteration for Tajik and Farsi: A Comparative Evaluation of Rule-Based and Transformer Architectures
Abstract: This study offers the first extensive comparative assessment of contemporary machine learning frameworks applied to transliteration between Tajik (written in Cyrillic) and Persian (written in Arabic script). A central contribution of this work is the development and verification of a novel parallel corpus, compiled from a diverse array of heterogeneous sources. These sources encompass crowdsourced initiatives, lexicographic entries, parallel texts from the "Shahnameh" and "Masnavi-i Ma'navi," diplomatic documents, official terminology lists, and transliterated correspondences. The original dataset consisted of 328,253 sentence pairs, from which a representative subset of 40,000 pairs was selected via stratified random sampling.
The experimental phase evaluated six distinct model categories: a rule-based baseline, an LSTM with attention mechanisms, a character-level Transformer, a Grapheme-to-Phoneme (G2P) Transformer trained from scratch, pre-trained multilingual models (mBART and mT5 utilizing LoRA), and a byte-level ByT5 model. The results indicate a decisive advantage for ByT5, achieving chrF++ scores of 87.4 for Tajik-to-Farsi transliteration and 80.1 for the reverse direction. Notably, the G2P Transformer demonstrated significant superiority over mBART (72.3 compared to 62.2 chrF++), even with data constraints. In contrast, models employing subword tokenization, such as mT5, performed poorly, yielding chrF++ scores below 18.5. These outcomes confirm that for precise Tajik-Farsi transliteration, byte- or character-level architectures are unequivocally more effective than conventional multilingual Seq2Seq models dependent on subword tokenization.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





