CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability
Title: CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability
Abstract:
This paper introduces CART (Context-Anchored Recurrent Transformer), a language model designed for parameter efficiency by reusing a single core block $R$ times throughout its depth. In contrast to earlier looped transformer architectures that recalculate key-value tensors during each iteration, CART generates $K$ and $V$ matrices just once from a multi-layer prelude. The recurrent core then interacts with these fixed tensors through multi-head latent attention. To ensure stability within the recurrence, the model employs a learned Linear Time-Invariant (LTI) gate. Across all 36 fully trained configurations, the spectral radius of this gate remains tightly constrained within the range $[0.79, 0.83]$.
We conducted evaluations of CART on single consumer GPUs in two phases. The first phase involved screening 64 configurations over 3,000 training steps. The second phase focused on training 36 selected configurations for approximately 1 billion tokens (30,500 steps), varying the prelude depth $P$ at 6 and the recurrence count $R$ at 6, 8, or 10, across three different random seeds.
Analysis across widths $d \in {256, 512, 768, 1024}$ revealed two consistent trends: first, the depth of the prelude ($P$) has a more significant impact than the loop count ($R$); second, the optimal ranking of $R$ observed in Stage 1 inverted during full training. Specifically, $R=6$ emerged as the superior choice for widths $d \geq 512$.
In a parameter-parity test at the binding width $d=1024$, CART failed to outperform a dense baseline matched for parameter count. It underperformed by 1–2% when comparing stored parameters and by roughly 10% when comparing effective parameters. Diagnostic ablation studies attributed the effective-parameter performance gap to two roughly equal components: approximately 5% loss due to weight sharing and another 5% stemming from the model’s unique architecture, which separates the prelude, anchor, core, and coda. Furthermore, components of the recurrent core machinery—including hyper-connections, the LTI gate, and loop-index embeddings—were found to be individually vestigial. Finally, variable-$R$ inference performed poorly outside the trained $R$ values, indicating that test-time depth scaling is not viable under this specific architectural recipe.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





