arXiv

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

**Title: The Limits of Auto-Interpretation Labels: A Controlled Investigation Across Languages, Scripts, and Paraphrases

Abstract:

Sparse autoencoders (SAEs) are gaining traction as tools for deciphering language models, with automatically generated natural-language descriptions serving as the main method for understanding individual feature functions. This study investigates the robustness of these labels, specifically asking whether a feature identified by a specific concept label consistently tracks that concept across different languages, writing systems, and phrasings. Leveraging Serbian digraphia as a controlled experimental setting—where the same language is rendered in both Latin and Cyrillic scripts through deterministic transliteration—we first demonstrate that SAE features triggered by identical content across various languages, scripts, and wordings exhibit significant overlap (achieving a peak Jaccard similarity of 0.57, compared to a random baseline of 0.13). This evidence points to the existence of genuine cross-lingual semantic features.

However, when we evaluate whether auto-interpretation labels remain accurate in these contexts, we find they frequently fall short. Features labeled with semantic descriptions fail to capture the corresponding meaning in Serbian up to four times more often than they do in English. Furthermore, these labels perform worse on Serbian Cyrillic than on Serbian Latin, despite the two scripts being deterministic transliterations of one another. This discrepancy indicates that the errors correlate with the extent to which each script form is represented in the training data. Although this performance gap widens in deeper network layers, the auto-generated labels offer no warning of their own inaccuracy. These findings imply that auto-interpretation labels may be mirroring a feature’s response to well-represented inputs rather than reflecting the underlying concept itself.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...