arXiv

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

Title: CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

Abstract:

This paper introduces CART (Context-Anchored Recurrent Transformer), a language model designed for parameter efficiency by reusing a single core block $R$ times throughout its depth. In contrast to earlier looped transformer architectures that recalculate key-value tensors during each iteration, CART generates $K$ and $V$ matrices just once from a multi-layer prelude. The recurrent core then interacts with these fixed tensors through multi-head latent attention. To ensure stability within the recurrence, the model employs a learned Linear Time-Invariant (LTI) gate. Across all 36 fully trained configurations, the spectral radius of this gate remains tightly constrained within the range $[0.79, 0.83]$.

We conducted evaluations of CART on single consumer GPUs in two phases. The first phase involved screening 64 configurations over 3,000 training steps. The second phase focused on training 36 selected configurations for approximately 1 billion tokens (30,500 steps), varying the prelude depth $P$ at 6 and the recurrence count $R$ at 6, 8, or 10, across three different random seeds.

Analysis across widths $d \in {256, 512, 768, 1024}$ revealed two consistent trends: first, the depth of the prelude ($P$) has a more significant impact than the loop count ($R$); second, the optimal ranking of $R$ observed in Stage 1 inverted during full training. Specifically, $R=6$ emerged as the superior choice for widths $d \geq 512$.

In a parameter-parity test at the binding width $d=1024$, CART failed to outperform a dense baseline matched for parameter count. It underperformed by 1–2% when comparing stored parameters and by roughly 10% when comparing effective parameters. Diagnostic ablation studies attributed the effective-parameter performance gap to two roughly equal components: approximately 5% loss due to weight sharing and another 5% stemming from the model’s unique architecture, which separates the prelude, anchor, core, and coda. Furthermore, components of the recurrent core machinery—including hyper-connections, the LTI gate, and loop-index embeddings—were found to be individually vestigial. Finally, variable-$R$ inference performed poorly outside the trained $R$ values, indicating that test-time depth scaling is not viable under this specific architectural recipe.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...