arXiv

A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures

Title: Benchmarking Machine Transliteration for Tajik and Farsi: A Comparative Evaluation of Rule-Based and Transformer Architectures

Abstract: This study offers the first extensive comparative assessment of contemporary machine learning frameworks applied to transliteration between Tajik (written in Cyrillic) and Persian (written in Arabic script). A central contribution of this work is the development and verification of a novel parallel corpus, compiled from a diverse array of heterogeneous sources. These sources encompass crowdsourced initiatives, lexicographic entries, parallel texts from the "Shahnameh" and "Masnavi-i Ma'navi," diplomatic documents, official terminology lists, and transliterated correspondences. The original dataset consisted of 328,253 sentence pairs, from which a representative subset of 40,000 pairs was selected via stratified random sampling.

The experimental phase evaluated six distinct model categories: a rule-based baseline, an LSTM with attention mechanisms, a character-level Transformer, a Grapheme-to-Phoneme (G2P) Transformer trained from scratch, pre-trained multilingual models (mBART and mT5 utilizing LoRA), and a byte-level ByT5 model. The results indicate a decisive advantage for ByT5, achieving chrF++ scores of 87.4 for Tajik-to-Farsi transliteration and 80.1 for the reverse direction. Notably, the G2P Transformer demonstrated significant superiority over mBART (72.3 compared to 62.2 chrF++), even with data constraints. In contrast, models employing subword tokenization, such as mT5, performed poorly, yielding chrF++ scores below 18.5. These outcomes confirm that for precise Tajik-Farsi transliteration, byte- or character-level architectures are unequivocally more effective than conventional multilingual Seq2Seq models dependent on subword tokenization.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...