arXiv

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

June 2, 2026 · Micha{\l} Brzozowski, Neo Christopher Chung · Original Source

Title: Archetypal SAE Stability: An Artifact of Initialization and Metric Design

Abstract:

Sparse autoencoders (SAEs) facilitate dictionary learning by extracting overcomplete bases from neural network activations, a process that enhances interpretability and mitigates polysemanticity. Despite these benefits, SAE-derived features exhibit significant variability across different random seeds, a challenge widely recognized as instability. Archetypal SAEs (Fel et al., 2025) were introduced as a general intervention for dictionary learning to improve the reliability of concept extraction, claiming to yield more stable dictionaries upon training completion.

In this work, we argue that the reported stability of archetypal SAEs is not inherent to the method but rather stems from the use of identical initializations across multiple experimental runs. Our analysis seeks to resolve ambiguities in mechanistic interpretability by distinguishing between two concepts: "stability," defined as agreement between two models trained independently, and "stabilization," which refers to the convergence of runs with different initializations toward a shared solution. This distinction is vital for natural language processing (NLP), where the stability of SAE features is increasingly cited as proof of their utility as reusable analytical units.

We reveal that archetypal SAE experiments rely on a deterministic k-means decoder initialization, which effectively sets the inter-run dictionary distance to zero prior to the commencement of training. When this specific initialization strategy is removed, the archetypal constraint offers no stabilization benefits in our experimental setting. Additionally, we identify a preprocessing-dependent issue regarding cosine geometry that obscures the interpretation of endpoint stability metrics. Ultimately, our findings reinforce the importance of integrating SAE research within the broader context of dictionary learning, emphasizing that claims of stability must be substantiated through rigorous trajectory diagnostics and ablation studies of initialization procedures.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC