A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models
Title: A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models
Abstract: Achieving interpretability is a critical hurdle in deploying language models (LMs) within clinical environments, particularly for the progression diagnosis of Alzheimer’s disease, where reliable and timely predictions are paramount. Current attribution techniques often yield unstable explanations and suffer from significant variability between methods, largely due to the polysemantic characteristics of representations in Transformer-based LMs and LLMs. Meanwhile, mechanistic interpretability methods fail to offer explicit importance scores and do not align directly with model inputs and outputs. To address these limitations, we present a cohesive interpretability framework that bridges attributional and mechanistic approaches by leveraging monosemantic feature extraction. By establishing a monosemantic embedding space at the level of a Transformer-based LM layer and optimizing the system to minimize inter-method variability, our method generates stable importance scores at the input level. This approach elucidates salient features through a decompressed representation of the target layer, thereby facilitating the secure and trustworthy use of LMs in the fields of cognitive health and neurodegenerative disease.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




