Translation Heads: Disentangling meaning from language in LLM-based machine translation
Title: Translation Heads: Disentangling meaning from language in LLM-based machine translation
Abstract: While Mechanistic Interpretability (MI) aims to elucidate the internal workings of neural networks, the sheer scale of Large Language Models (LLMs) has previously constrained MI research in Machine Translation (MT) to word-level examinations. This study adopts a mechanistic lens to investigate sentence-level MT, focusing on attention heads to decipher how LLMs internally encode and allocate translation responsibilities. We break down MT into two distinct subtasks: identifying the target language (generating text in the correct language) and maintaining sentence equivalence (preserving the original meaning). Through an analysis of three open-source model families across 20 translation directions, we identify that separate, sparse groups of attention heads are specialized for each subtask. Leveraging this finding, we develop subtask-specific steering vectors. Our results demonstrate that adjusting merely 1% of these relevant heads allows for instruction-free MT performance that rivals instruction-based prompting. Conversely, selectively ablating these heads specifically impairs their associated translation functions.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




