arXiv

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

June 2, 2026 · Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram · Original Source

Title: CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Abstract:

When sequentially fine-tuning large language models, practitioners typically face a dilemma: either allow the shared model parameters to continue learning, risking catastrophic forgetting, or freeze them after the initial task, thereby preventing any cross-task refinement. Existing modular approaches, such as AdapterFusion, LoRAHub, PackNet, and Progressive Networks, opt for the latter strategy. In contrast, we present CRMA (Constrained Residual Mixing Adapter), a novel residual adapter that employs Sinkhorn normalization to ensure its internal mixing matrix, $M$, remains doubly-stochastic during every forward pass. According to Birkhoff’s theorem, this construction guarantees that $|M|_2 \le 1$, establishing a structural spectral bound rather than relying on penalty terms.

This spectrally bounded backbone enables the continuous training of a shared substrate—a capability absent in previous modular methods—while still maintaining robust guarantees against forgetting. In evaluations on Mistral-7B across five sequential domains and three random seeds, applying modular per-task LoRA atop a CRMA backbone reduced loss-relative drift from +42.96% ± 5.5 (observed in naive sequential fine-tuning) to -0.17% ± 0.17, with non-overlapping ranges across seeds. Furthermore, this approach improved holdout loss for prior tasks by 1.99% ± 0.54 compared to a matched baseline with a frozen substrate.

Three distinct experimental configurations—comprising a controlled ablation on Mistral-7B (4 domains), a contamination-controlled replication using TinyLlama (3 domains), and cross-domain probes on Mistral-7B (at 7B scale)—all demonstrated positive backward transfer. Notably, these results were achieved without replay buffers, without increasing per-task memory overhead, and without utilizing distillation techniques.

Additional inference-time ablations on Gemma-2-9B confirmed that CRMA effectively mediates access to knowledge acquired during sequential training: performance reached 98/100 correct answers versus 38/100 on identical weights and questions, differing only by the toggling of CRMA injection. Finally, 867 logged training steps verified that $|M|_2$ remained at 1.0 within float32 precision, with a maximum deviation of $1.2 \times 10^{-7}$. The efficacy of this forgetting-prevention mechanism was validated across model sizes ranging from 1.1B to 9.2B parameters and across four distinct architecture families.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC