arXiv

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

June 3, 2026 · Micha{\l} Brzozowski, Neo Christopher Chung · Original Source

Title: Aligned Training: A Parameter-Free Approach to Enhancing the Quality and Stability of Sparse Autoencoder Features

Abstract:

Sparse autoencoders (SAEs) serve as a primary tool for interpreting the internal mechanisms of deep neural networks (DNNs) by decomposing activations into high-dimensional features. Nevertheless, they suffer from significant limitations, most notably the presence of numerous inactive "dead" features and inherent instability. While existing SAE variants seek to address these problems, they typically necessitate extra data, resampling procedures, or additional training phases. In this work, we introduce aligned training, a parameter-free reparameterization technique that simultaneously boosts reconstruction accuracy, eradicates dead features, and markedly increases stability across different training seeds.

Our method is grounded in a previously unnoticed phenomenon: the quality of SAE features, quantified by the inner product between encoder and decoder directions (termed the alignment score), exhibits a bimodal distribution across contemporary architectures. Aligned training imposes a geometric constraint that forces the inner product between encoder and decoder weights to equal one for each feature. This mechanism eliminates a specific source of degeneracy in SAE training without introducing any new hyperparameters.

Experiments across various models, dictionary sizes, and sparsity levels demonstrate that aligned training achieves Pareto improvements on SAEBench benchmarks. Furthermore, beyond resolving issues related to dead features, stability, and reconstruction, the method is compatible with mechanical interpretability techniques such as Top/BatchTop-K architectures and p-Annealing. Ultimately, aligned training significantly elevates the quality and stability of SAE features without incurring additional computational complexity or cost.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC