Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs
Title: Deriving Self-Interpretation Capabilities from Interpretability Artifacts: Training Minimalist Adapters on Vector-Label Pairs
Abstract:
While self-interpretation techniques encourage language models to articulate their internal mechanisms, their reliability is often compromised by a high sensitivity to hyperparameters. This study demonstrates that by training lightweight adapters on interpretability artifacts—while keeping the underlying language model completely frozen—one can achieve consistent self-interpretation performance across various tasks and model architectures. Specifically, a scalar affine adapter requiring only $d_\text{model}+1$ parameters proves sufficient. When trained, these adapters produce sparse autoencoder feature labels that surpass the original training labels in quality, achieving a 70% generation scoring rate compared to 50% at the 70-billion parameter scale. Furthermore, they identify topics with a 94% recall@1 rate, a stark contrast to the 1% performance of untrained baselines. The adapters also successfully decode bridge entities in multi-hop reasoning tasks; these entities are absent from both the prompt and the final response, thereby revealing implicit reasoning processes without relying on chain-of-thought methods. Analysis reveals that the learned bias vector alone contributes to 85% of the observed improvement, and notably, simpler adapter structures exhibit better generalization than more complex alternatives. By controlling for model knowledge through prompted descriptions, we observe that the benefits of self-interpretation scale more effectively than general capability gains as model sizes increase from 7B to 72B parameters. Ultimately, our findings confirm that self-interpretation capabilities improve with model scale, even when the interpreted model remains unmodified.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



