arXiv

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Title: MesaNet: Enhancing Sequence Modeling via Locally Optimal Test-Time Training

Abstract:

While causal transformers utilizing softmax self-attention currently dominate the field of sequence modeling, their widespread adoption is hindered by the linear scaling of memory and computational requirements during inference. To address this, recent research has focused on linearizing the softmax operation, leading to the development of powerful recurrent neural networks (RNNs) like DeltaNet, Mamba, and xLSTM. These architectures offer the advantage of constant memory and compute costs. A unifying perspective on these models reveals that their recurrent dynamics can be derived from an in-context regression objective, which is approximately optimized via an online learning rule.

Building upon this foundation, we introduce a scalable, chunkwise parallelizable variant of the Mesa layer (von Oswald et al., 2024). Unlike the original Mesa layer, which was restricted to sequential processing and thus lacked scalability, our new approach minimizes an in-context loss to optimality at every time step. This is achieved through the use of a fast conjugate gradient solver, ensuring numerical stability.

Our comprehensive experiments, spanning models up to the billion-parameter scale, demonstrate that this method of optimal test-time training yields lower language modeling perplexity and superior performance on downstream benchmarks compared to prior RNNs. These improvements are particularly notable in tasks demanding long-context understanding. However, this performance boost requires additional floating-point operations (flops) during inference. Consequently, our findings align with emerging trends that leverage increased test-time compute to enhance model capabilities, specifically by dedicating computational resources to solving sequential optimization problems embedded within the neural network itself.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...