Prototype Transformer: Towards Language Model Architectures Interpretable by Design
Title: Prototype Transformer: Towards Language Model Architectures Interpretable by Design
Abstract: Although leading language models (LMs) now outperform humans in various areas, their internal reasoning processes remain largely unintelligible. This lack of transparency undermines user trust and heightens the potential for deceptive outputs and hallucinations. To address this, we present the Prototype Transformer (ProtoT), a new autoregressive LM architecture. ProtoT substitutes the computationally expensive, quadratic-cost self-attention mechanism typical of standard Transformers with a linear-cost module driven by prototypes—learned parameter vectors. Within this framework, these prototypes function as communication channels that consolidate contextual data across varying temporal scales. Our analysis demonstrates that this architectural choice enables prototypes to spontaneously acquire identifiable concepts, such as "woman," throughout the training phase. This capability provides a viable route for decoding model reasoning and implementing precise modifications to model behavior. Benchmarking against baseline models reveals that ProtoT exhibits strong scalability regarding both model and data volume, maintains robustness against input disturbances, and delivers competitive performance in text generation and downstream applications, including the GLUE benchmark. These findings indicate that ProtoT represents a significant advancement in developing autoregressive language models that are inherently interpretable.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




