Query Circuits: Explaining How Language Models Answer User Prompts
Title: Query Circuits: Decoding Language Model Responses to Specific User Prompts
Abstract: Providing local, input-level explanations is essential for understanding why a language model generates a specific output. While current techniques successfully map global capability circuitsāsuch as indirect object identificationāthey fail to explain why a model responds to a particular query in a specific manner. To bridge this gap, we present "query circuits," a method that directly tracks the information flow within a model as it transforms a given input into an output. Because these circuits are identified internally rather than relying on surrogate models like sparse autoencoders, they offer explanations that are both more faithful and computationally efficient.
To ensure query circuits are practical, we tackle two primary obstacles. First, we propose Normalized Deviation Faithfulness (NDF), a robust metric designed to assess how accurately a discovered circuit recovers the modelās decision for a single input; this metric is versatile and applicable to circuit discovery in contexts beyond our specific framework. Second, we create sampling-based techniques to efficiently locate circuits that are sparse yet accurately reflect the model's behavior.
Our evaluation across several benchmarks, including IOI, arithmetic, MMLU, and ARC, reveals that extremely sparse query circuits exist within models and can restore a significant portion of their performance on individual queries. For instance, a circuit comprising merely 1.3% of the modelās connections is capable of recovering approximately 60% of its performance on MMLU questions. Ultimately, query circuits represent a significant advancement toward scalable and faithful explanations of how language models process individual inputs. The project page is available at https://tony10101105.github.io/query-circuit/.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




