Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation
Title: Refining Gradient-Based Circuit Localization Through Interaction-Aware Backpropagation
Abstract: The primary objective of circuit localization is to pinpoint the specific subsets of model components that drive particular behaviors within large language models, thereby facilitating granular mechanistic analysis. Conventional approaches typically operate under the assumption that components function independently, calculating their significance by perturbing them one at a time. However, because neural network components inherently interact, neglecting these interdependencies results in a systematic distortion of importance estimates. Our research identifies "attention self-repair" as a particularly disruptive interaction; in this scenario, the redistribution of softmax weights causes the gradients associated with high-impact attention scores to disappear, as other positions with comparable values step in to compensate. To address this, we propose Gradient Interaction Modifications (GIM), a method that explicitly incorporates feature interactions into the backpropagation process. GIM sets a new standard for performance on the circuit localization track of the Mechanistic Interpretability Benchmark and surpasses current gradient-based techniques in feature attribution across a wide range of tasks. By capturing interaction effects and clarifying why previous methods tend to undervalue component importance, GIM supports more accurate mechanistic investigations of large language models. The GIM Python package is publicly accessible at https://github.com/corticph/gim.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





