MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Title: MesaNet: Enhancing Sequence Modeling via Locally Optimal Test-Time Training
Abstract:
While causal transformers utilizing softmax self-attention currently dominate the field of sequence modeling, their widespread adoption is hindered by the linear scaling of memory and computational requirements during inference. To address this, recent research has focused on linearizing the softmax operation, leading to the development of powerful recurrent neural networks (RNNs) like DeltaNet, Mamba, and xLSTM. These architectures offer the advantage of constant memory and compute costs. A unifying perspective on these models reveals that their recurrent dynamics can be derived from an in-context regression objective, which is approximately optimized via an online learning rule.
Building upon this foundation, we introduce a scalable, chunkwise parallelizable variant of the Mesa layer (von Oswald et al., 2024). Unlike the original Mesa layer, which was restricted to sequential processing and thus lacked scalability, our new approach minimizes an in-context loss to optimality at every time step. This is achieved through the use of a fast conjugate gradient solver, ensuring numerical stability.
Our comprehensive experiments, spanning models up to the billion-parameter scale, demonstrate that this method of optimal test-time training yields lower language modeling perplexity and superior performance on downstream benchmarks compared to prior RNNs. These improvements are particularly notable in tasks demanding long-context understanding. However, this performance boost requires additional floating-point operations (flops) during inference. Consequently, our findings align with emerging trends that leverage increased test-time compute to enhance model capabilities, specifically by dedicating computational resources to solving sequential optimization problems embedded within the neural network itself.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





