arXiv

From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication

June 2, 2026 · M\'at\'e Metzger, Nadnapang Phophichit, Hansa Dhammahaso · Original Source

Title: Reframing Outliers as Potential Errors: A Multi-Reference Adjudication Framework for Auditing Pali-to-English LLM Translations

Abstract

Relying on single-score translation metrics often obscures the distinction between acceptable linguistic variation and actual mistakes. This issue is particularly pronounced in the translation of classical languages, where a single source passage may legitimately yield multiple valid English interpretations. To address this, we conducted an audit of Pali-to-English translations generated by four leading large language models: GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3. The study examined 1,700 excerpts from the Pali Canon. Instead of adhering to a single "gold standard," we established a local reference envelope using three respected human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi.

We utilized the normalized embedding drift of each candidate translation relative to the reference centroid as a triage mechanism rather than a definitive error marker. Candidates exhibiting a drift score above 1.5 (totaling 1,203 instances) were subsequently evaluated by a blinded panel of three LLM judges. This panel was calibrated using a validation set of 300 instances previously adjudicated by human authors.

The analysis yielded two significant findings. First, embedding drift serves as a predictor of error severity rather than indicating error in all cases. The rate of major errors among high-drift candidates increased steadily, rising from 7.9% in the 1.5–2.0 drift range to 51.6% for drift scores exceeding 3.0. Notably, approximately 80% of the outliers in the 1.5–2.0 range were deemed to be valid translation variations. Second, distinctions between models were most evident in the high-drift segment. GPT-5.5 demonstrated the lowest rate of major errors in this tail, with confidence intervals that overlapped with those of Claude Sonnet 4.6 and Gemini 3.1 Pro. In contrast, Grok 4.3 produced the highest volume of outliers and recorded the highest major-error rate in the tail, reaching 74.4% for drift scores above 3.0, with an overall tail error rate of 27.6%.

The most prevalent categories of major errors—such as omissions, truncations, and mistakes involving doctrinal terminology—are precisely the types of failures most likely to mislead readers of religious texts. This study proposes a reusable audit methodology for classical-to-modern translation: establish a local reference envelope based on multiple human translators, employ embedding drift to prioritize content for review, and adjudicate the flagged outliers rather than automatically classifying them as errors.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC