arXiv

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

Title: Reframing Outliers as Potential Errors: A Multi-Reference Adjudication Framework for Auditing Pali-to-English LLM Translations

Abstract

Relying on single-score translation metrics often obscures the distinction between acceptable linguistic variation and actual mistakes. This issue is particularly pronounced in the translation of classical languages, where a single source passage may legitimately yield multiple valid English interpretations. To address this, we conducted an audit of Pali-to-English translations generated by four leading large language models: GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3. The study examined 1,700 excerpts from the Pali Canon. Instead of adhering to a single "gold standard," we established a local reference envelope using three respected human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi.

We utilized the normalized embedding drift of each candidate translation relative to the reference centroid as a triage mechanism rather than a definitive error marker. Candidates exhibiting a drift score above 1.5 (totaling 1,203 instances) were subsequently evaluated by a blinded panel of three LLM judges. This panel was calibrated using a validation set of 300 instances previously adjudicated by human authors.

The analysis yielded two significant findings. First, embedding drift serves as a predictor of error severity rather than indicating error in all cases. The rate of major errors among high-drift candidates increased steadily, rising from 7.9% in the 1.5–2.0 drift range to 51.6% for drift scores exceeding 3.0. Notably, approximately 80% of the outliers in the 1.5–2.0 range were deemed to be valid translation variations. Second, distinctions between models were most evident in the high-drift segment. GPT-5.5 demonstrated the lowest rate of major errors in this tail, with confidence intervals that overlapped with those of Claude Sonnet 4.6 and Gemini 3.1 Pro. In contrast, Grok 4.3 produced the highest volume of outliers and recorded the highest major-error rate in the tail, reaching 74.4% for drift scores above 3.0, with an overall tail error rate of 27.6%.

The most prevalent categories of major errors—such as omissions, truncations, and mistakes involving doctrinal terminology—are precisely the types of failures most likely to mislead readers of religious texts. This study proposes a reusable audit methodology for classical-to-modern translation: establish a local reference envelope based on multiple human translators, employ embedding drift to prioritize content for review, and adjudicate the flagged outliers rather than automatically classifying them as errors.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...