arXiv

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Title: CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Abstract:

When sequentially fine-tuning large language models, practitioners typically face a dilemma: either allow the shared model parameters to continue learning, risking catastrophic forgetting, or freeze them after the initial task, thereby preventing any cross-task refinement. Existing modular approaches, such as AdapterFusion, LoRAHub, PackNet, and Progressive Networks, opt for the latter strategy. In contrast, we present CRMA (Constrained Residual Mixing Adapter), a novel residual adapter that employs Sinkhorn normalization to ensure its internal mixing matrix, $M$, remains doubly-stochastic during every forward pass. According to Birkhoff’s theorem, this construction guarantees that $|M|_2 \le 1$, establishing a structural spectral bound rather than relying on penalty terms.

This spectrally bounded backbone enables the continuous training of a shared substrate—a capability absent in previous modular methods—while still maintaining robust guarantees against forgetting. In evaluations on Mistral-7B across five sequential domains and three random seeds, applying modular per-task LoRA atop a CRMA backbone reduced loss-relative drift from +42.96% ± 5.5 (observed in naive sequential fine-tuning) to -0.17% ± 0.17, with non-overlapping ranges across seeds. Furthermore, this approach improved holdout loss for prior tasks by 1.99% ± 0.54 compared to a matched baseline with a frozen substrate.

Three distinct experimental configurations—comprising a controlled ablation on Mistral-7B (4 domains), a contamination-controlled replication using TinyLlama (3 domains), and cross-domain probes on Mistral-7B (at 7B scale)—all demonstrated positive backward transfer. Notably, these results were achieved without replay buffers, without increasing per-task memory overhead, and without utilizing distillation techniques.

Additional inference-time ablations on Gemma-2-9B confirmed that CRMA effectively mediates access to knowledge acquired during sequential training: performance reached 98/100 correct answers versus 38/100 on identical weights and questions, differing only by the toggling of CRMA injection. Finally, 867 logged training steps verified that $|M|_2$ remained at 1.0 within float32 precision, with a maximum deviation of $1.2 \times 10^{-7}$. The efficacy of this forgetting-prevention mechanism was validated across model sizes ranging from 1.1B to 9.2B parameters and across four distinct architecture families.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...