The Attribution Contract: Feature Attribution for Generative Language Models
Title: The Attribution Contract: Feature Attribution for Generative Language Models
Abstract:
Feature attribution techniques aim to pinpoint the input characteristics that influence a model’s output. However, in the context of generative language models, defining what constitutes a "feature" is inherently ambiguous. In autoregressive architectures, previously generated tokens serve a dual role: they are outputs of the model and inputs for subsequent predictions. Conversely, diffusion-based models generate content through iterative denoising or unmasking processes rather than fixed left-to-right sequencing, meaning local explanations may focus on intermediate diffusion states rather than specific next-token predictions.
We posit that this ambiguity represents more than a mere technical hurdle; it is a fundamental conceptual constraint arising from applying feature attribution methods designed for classifiers directly to generative language modeling. To address this, we propose the "Attribution Contract," a framework that specifies the parameters of feature-attribution claims. This specification explicitly identifies the output being explained, the eligible features for attribution, the assumed generative process, the constants held fixed, and the specific model score under analysis.
The contract elucidates why identical attribution algorithms can yield different answers based on their instantiation. We contend that many disputes regarding feature attribution in generative language models stem not from disagreements over algorithms, but from implicit or unstated explanatory contracts. Through case studies involving autoregressive and diffusion models, we demonstrate when attributing to earlier tokens, intermediate states, or denoising phases is insightful versus when it is misleading. Furthermore, we argue that evaluation methodologies for feature attribution in this domain should assess methods in conjunction with their specific contracts.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





