arXiv

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

June 3, 2026 · Evan Duan · Original Source

Title: The Impact of Quantization on Interpretable Features: A Sparse Autoencoder Study of Language Models

Quantization has become a standard procedure for deploying large language models, with success typically measured by how closely the quantized version’s perplexity or downstream accuracy mirrors that of the full-precision original. However, it remains largely untested whether the internal computation methods persist or if the interpretable features identified in full-precision models withstand weight rounding, despite the growing reliance on these features for safety audits and steering interventions. This study investigates whether sparse autoencoder (SAE) features, extracted from dense full-precision models, maintain their fidelity after quantization.

By employing a frozen SAE as a constant measurement basis, we encoded activations for identical tokens across both full-precision and round-to-nearest (RTN) quantized models. We assessed the survival of individual features using Pearson correlation, testing bit-widths ranging from INT8 down to INT4 on the Pythia-70M and Gemma-2-2B architectures. Our findings reveal that feature survival is not binary but graded; degradation occurs systematically rather than through total failure. Specifically, 62.4% of active features remained intact at INT6 on Pythia-70M, while 51.3% survived at the same bit-width on Gemma-2-2B. Furthermore, the majority of features that did not survive were blurred rather than completely destroyed.

The survival of these features can be predicted solely from full-precision statistics, achieving cross-validated AUCs between 0.92 and 0.97, with peak activation serving as the most significant marginal predictor. Crucially, standard task metrics may fail to detect this internal damage. For instance, on Gemma-2-2B, moving to INT7 improved perplexity scores yet simultaneously degraded 18.7% of the features.

Additionally, we observed a strong correlation between the effects of quantization and matched-perplexity magnitude pruning. These two compression techniques damaged highly overlapping feature sets, exhibiting a Jaccard overlap of 0.79 to 0.86 and a Spearman correlation of 0.98 in damage scores. This suggests a shared vulnerability induced by compression. Ultimately, these results indicate that achieving behavioral parity is insufficient to guarantee that interpretability findings transfer to quantized deployments, highlighting the need for feature-level audits of compression techniques.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC