How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings
**Title: The Limits of Auto-Interpretation Labels: A Controlled Investigation Across Languages, Scripts, and Paraphrases
Abstract:
Sparse autoencoders (SAEs) are gaining traction as tools for deciphering language models, with automatically generated natural-language descriptions serving as the main method for understanding individual feature functions. This study investigates the robustness of these labels, specifically asking whether a feature identified by a specific concept label consistently tracks that concept across different languages, writing systems, and phrasings. Leveraging Serbian digraphia as a controlled experimental setting—where the same language is rendered in both Latin and Cyrillic scripts through deterministic transliteration—we first demonstrate that SAE features triggered by identical content across various languages, scripts, and wordings exhibit significant overlap (achieving a peak Jaccard similarity of 0.57, compared to a random baseline of 0.13). This evidence points to the existence of genuine cross-lingual semantic features.
However, when we evaluate whether auto-interpretation labels remain accurate in these contexts, we find they frequently fall short. Features labeled with semantic descriptions fail to capture the corresponding meaning in Serbian up to four times more often than they do in English. Furthermore, these labels perform worse on Serbian Cyrillic than on Serbian Latin, despite the two scripts being deterministic transliterations of one another. This discrepancy indicates that the errors correlate with the extent to which each script form is represented in the training data. Although this performance gap widens in deeper network layers, the auto-generated labels offer no warning of their own inaccuracy. These findings imply that auto-interpretation labels may be mirroring a feature’s response to well-represented inputs rather than reflecting the underlying concept itself.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





