arXiv

Learning to Remember, Learn, and Forget in Attention-Based Models

June 2, 2026 · Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci · Original Source

Title: Mastering Retention, Acquisition, and Erasure in Attention-Driven Architectures

Abstract: Transformers utilize In-Context Learning (ICL) as an online associative memory mechanism, a capability widely credited for their superior performance in handling intricate sequence processing tasks. Nevertheless, gated linear attention models suffer from inherent limitations: their memory capacity is rigid, and they are highly susceptible to interference, particularly when processing lengthy sequences. To address these issues, we introduce Palimpsa, a self-attention framework that reinterprets ICL through the lens of continual learning, specifically targeting the stability-plasticity dilemma. Palimpsa employs Bayesian metaplasticity, linking the plasticity of each attention state to an importance metric derived from a prior distribution that encodes accumulated knowledge. We demonstrate that several existing gated linear attention models can be viewed as specific architectural configurations and posterior approximations within this framework. Notably, Mamba2 emerges as a distinct case of Palimpsa characterized by dominant forgetting mechanisms. This theoretical connection allows for the conversion of any non-metaplastic model into a metaplastic one, thereby substantially increasing its memory capacity. Empirical results indicate that Palimpsa consistently surpasses baseline methods on the Multi-Query Associative Recall (MQAR) benchmark and in Commonsense Reasoning tasks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC