arXiv

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Title: Archetypal SAE Stability: An Artifact of Initialization and Metric Design

Abstract:

Sparse autoencoders (SAEs) facilitate dictionary learning by extracting overcomplete bases from neural network activations, a process that enhances interpretability and mitigates polysemanticity. Despite these benefits, SAE-derived features exhibit significant variability across different random seeds, a challenge widely recognized as instability. Archetypal SAEs (Fel et al., 2025) were introduced as a general intervention for dictionary learning to improve the reliability of concept extraction, claiming to yield more stable dictionaries upon training completion.

In this work, we argue that the reported stability of archetypal SAEs is not inherent to the method but rather stems from the use of identical initializations across multiple experimental runs. Our analysis seeks to resolve ambiguities in mechanistic interpretability by distinguishing between two concepts: "stability," defined as agreement between two models trained independently, and "stabilization," which refers to the convergence of runs with different initializations toward a shared solution. This distinction is vital for natural language processing (NLP), where the stability of SAE features is increasingly cited as proof of their utility as reusable analytical units.

We reveal that archetypal SAE experiments rely on a deterministic k-means decoder initialization, which effectively sets the inter-run dictionary distance to zero prior to the commencement of training. When this specific initialization strategy is removed, the archetypal constraint offers no stabilization benefits in our experimental setting. Additionally, we identify a preprocessing-dependent issue regarding cosine geometry that obscures the interpretation of endpoint stability metrics. Ultimately, our findings reinforce the importance of integrating SAE research within the broader context of dictionary learning, emphasizing that claims of stability must be substantiated through rigorous trajectory diagnostics and ablation studies of initialization procedures.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...