CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting

Each task gets its own fresh LoRA adapter. They all compose against a CRMA backbone whose internal mixing matrix is doubly-stochastic at every forward pass, so its spectral norm is bounded by 1 by Birkhoff's theorem. The modular architecture does the forgetting prevention; CRMA's specific contribution is the spectrally bounded substrate that lets the shared backbone keep training instead of being frozen. Result on Mistral-7B across 5 sequential domains: -0.17% ± 0.17 loss-relative drift across 3 seeds, versus +42.96% ± 5.5 for naive sequential fine-tuning.

Authors: Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram · ModelBrew AI

The Headline Numbers

−0.17%

MODULAR drift (3 seeds ± 0.17)

+42.96%

NAIVE forgetting (3 seeds ± 5.5)

Per-seed MODULAR range $[-0.36\%, -0.03\%]$ and per-seed NAIVE range $[+38.1\%, +49.0\%]$ are disjoint at every seed. The effect size is two orders of magnitude larger than the between-seed standard deviation.

Protocol: Mistral-7B, 5 sequential domains (Medical, Legal, Financial, Code, Science), 500 train / 100 holdout per domain, seeds 0 / 42 / 1234. Loss-relative backward transfer as defined in Section 7.1.3 of the paper; positive values indicate forgetting.

The Honest Reframe: Where the Benefit Comes From

The paper's most important finding is the FROZEN-vs-MODULAR comparison. We ran three conditions on identical sequential data:

NAIVE: Single shared LoRA, sequential training, no protection (catastrophic forgetting baseline).
FROZEN: Plain base model, backbone held fixed, per-task fresh LoRA for each domain, no CRMA component in the forward pass at all. This isolates the modular per-task LoRA architecture on its own.
MODULAR: Plain base model, per-task fresh LoRA, CRMA backbone that continues to train across tasks. This is the full method.

FROZEN already gives essentially all of the forgetting-prevention effect — zero drift by construction, since the backbone never moves. CRMA adds a 1.99% ± 0.54 learning advantage (across 3 seeds) on top of FROZEN, positive at every seed (1.46%, 1.97%, 2.54%).

This reframes the paper's pitch honestly: most of the forgetting-prevention benefit comes from the modular per-task adapter architecture — giving each task its own fresh LoRA rather than sharing a single one. CRMA's specific contribution is the spectrally bounded substrate that keeps the shared backbone stable enough to continue training across tasks without interfering with prior adapters.

Prior art acknowledgment: Modular per-task adapters are not novel. LoRAHub, X-LoRA, LoRAMoE, AdapterFusion, PackNet, and Progressive Networks all share the "one adapter per task" idea. What CRMA adds is the structural non-expansiveness guarantee that lets the shared backbone keep learning instead of being frozen.

3-Seed Results: Mistral-7B, 5 Domains

Seed	MODULAR drift	NAIVE forget	FROZEN drift	MOD holdout	FROZEN holdout	MOD advantage
0	-0.03%	+38.1%	+1.47%	1.2530	1.2716	1.46%
42	-0.10%	+41.7%	+1.66%	1.2657	1.2912	1.97%
1234	-0.36%	+49.0%	+2.71%	1.2705	1.3036	2.54%
3-seed avg	−0.17% ± 0.17	+42.96% ± 5.5	+1.95% ± 0.64	1.2631 ± 0.009	1.2888 ± 0.016	1.99% ± 0.54

Welch's unequal-variance $t$-test on the MODULAR-vs-NAIVE means gives $t \approx 13.6$ (df $\approx 2$, $p \approx 0.005$). The exact 3-vs-3 permutation test is bounded from below by its combinatorial floor: $p = 2/20 = 0.10$ two-sided, $p = 1/20 = 0.05$ one-sided. Neither number is the primary argument; the disjoint per-seed ranges are.

Inference-Time Ablation: Gemma-2-9B, 98/100 vs 38/100

On Gemma-2-9B, we tested the same sequentially-trained model weights on the same 100 held-out questions (20 per domain) under two conditions — with CRMA injected at inference, and with the CRMA module bypassed. This isolates CRMA as the access mechanism for sequentially trained knowledge.

Domain	Without CRMA	With CRMA
Medical (D1)	5/20	20/20
Enterprise (D2)	12/20	19/20
Finance (D3)	8/20	20/20
Military (D4)	3/20	19/20
Real Estate (D5)	10/20	20/20
Total	38/100	98/100

With CRMA: Wilson 95% CI $[93.0\%, 99.5\%]$. Without CRMA: Wilson 95% CI $[29.0\%, 47.8\%]$. The intervals are disjoint. McNemar's exact test on the paired outcomes gives $p \lesssim 2 \times 10^{-16}$ under the worst-case paired assumption.

Multi-Model Coverage: 5 Models, 4 Architecture Families

The architecture holds across 5 production language models from 4 architecture families, with no model-specific tuning:

Model	Family	Params	Domains	QA Score	Wilson 95% CI
Phi-4-mini	Microsoft Phi	~3.8B	3	3/3	[44%, 100%]
Mistral-7B	Mistral	7.24B	5	26/31 (84%)	[68%, 93%]
Saul-7B	Mistral (legal)	7.24B	3 legal sub-domains	18/18	[81.5%, 100%]
Gemma-2-9B	Google Gemma	9.24B	5	30/31 (97%)	[83%, 99.5%]
TinyLlama-1.1B	LLaMA	1.1B	4	Training dynamics only	—

QA scoring is first-author evaluation against pre-written reference answers — single rater, unblinded, not a domain expert. We are explicit about this limitation.

The Saul-7B 18/18 retention on three sequential legal sub-domains is the strongest single-vertical evidence in the paper. Saul-7B is a Mistral-7B continued-pretrained for legal practice; CRMA's modular architecture preserved all 18 reference answers across the full sequential chain.

Standard Benchmark Preservation: Gemma-2-9B

To check whether the modular architecture preserves general capabilities beyond the trained domains, we ran Gemma-2-9B on six standard benchmarks using lm-evaluation-harness before and after the full 5-domain sequential training:

Benchmark	Base	After 5 phases	Delta
HellaSwag	79.94%	76.32%	−3.62 pp
ARC-Challenge	65.36%	62.12%	−3.24 pp
TruthfulQA	60.18%	56.53%	−3.65 pp
WinoGrande	76.01%	74.35%	−1.66 pp
MMLU	71.87%	65.41%	−6.46 pp
GSM8K (corrected)	79.68%	69.5%	−10.2 pp

HellaSwag, ARC-Challenge, TruthfulQA, and WinoGrande degrade by 1.7–3.7 pp. MMLU and GSM8K degrade more (−6.5 pp and −10 to −13 pp respectively). We report these in full rather than hiding them. We do not attribute the math-reasoning degradation mechanistically in this paper; a LoRA-only vs LoRA+CRMA isolation ablation is planned follow-up work.

GSM8K note: the automated scorer initially reported −19.3 pp. A full 1,319-question diagnostic revealed the fine-tuned model never uses the "####" format standard scorers require; a corrected scorer verified on 220 manual reviews recovered the true degradation of −10.2 pp.

Spectral-Norm Structural Invariant

CRMA enforces the spectral bound through the parameterization itself, not through a penalty or clipping. To validate this empirically, we logged the mixing matrix's spectral norm every 5 training steps across all 5 Gemma-2-9B domains — 867 checkpoints in total.

867

Logged training steps

< 1.2 × 10⁻⁷

Max deviation from 1.0

$\|\mathbf{M}\|_2 = 1.0$ within float32 precision at every logged step. The bound held at every step without corrective intervention — no gradient clipping, no re-normalization, no periodic projection. The doubly-stochastic parameterization enforces it by construction.

Routing Is Load-Bearing

Because the architecture is modular, an inference-time router has to pick the right adapter for each query. Ours is a contrastive centroid classifier, and on the 5-domain Mistral-7B benchmark it was correct on all 31 questions. Every forgetting-prevention number in this paper is conditional on correct routing — a routing failure would send a query to the wrong adapter and look like a forgetting event even though no forgetting has occurred.

The 31/31 figure is on five maximally distinct domains, which is the easy regime for a centroid classifier. We have not yet stress-tested the router on overlapping sub-domains, on deliberately ambiguous queries, or on out-of-distribution queries. A quantitative stress test on same-vertical boundary queries is planned follow-up work. Until that result is in, same-vertical deployment is an explicit limitation of what this paper establishes.

Limitations

No head-to-head published CL baselines. O-LoRA, InfLoRA, and Lewandowski et al. on the same benchmark would strengthen the paper substantially. This is the single largest gap in the paper and is explicitly acknowledged. Running these at head-to-head scale was beyond the budget of this work.
QA evaluation is first-author, single-rater, unblinded. No inter-rater reliability, no condition blinding, no independent domain expert. On Saul-7B the first author is not a legal practitioner. A blinded two-rater evaluation is planned.
GSM8K degradation is not mechanistically isolated. The corrected 10–13 pp drop on math reasoning could be an LoRA artifact, a CRMA artifact, or a bundle interaction. A LoRA-only vs LoRA+CRMA controlled ablation is planned.
Same-vertical routing has not been stress-tested. A boundary-query probe on Saul-7B's three legal sub-domains is planned.
Decoder-only transformers only. 1.1B–9.2B parameter range. Encoder-decoder, ViT, CNN, and RL-policy settings are not tested.
Healthcare, finance, and defense are motivation, not deployment evidence. Only the legal vertical has direct experimental support (Saul-7B).

References

French, R. M. (1999). "Catastrophic forgetting in connectionist networks." Trends in Cognitive Sciences, 3(4):128–135.
Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13):3521–3526.
Lopez-Paz, D. and Ranzato, M. (2017). "Gradient episodic memory for continual learning." NeurIPS.
Li, Z. and Hoiem, D. (2018). "Learning without forgetting." IEEE TPAMI, 40(12):2935–2947.
Miyato, T., et al. (2018). "Spectral normalization for generative adversarial networks." ICLR.
van de Ven, G. M. and Tolias, A. S. (2019). "Three scenarios for continual learning." arXiv:1904.07734.
Hu, E. J., et al. (2022). "LoRA: Low-rank adaptation of large language models." ICLR.
Wang, X., et al. (2023). "TRACE: A comprehensive benchmark for continual learning in LLMs." arXiv:2310.06762.
Wang, Y., et al. (2024). "O-LoRA: Orthogonal low-rank adaptation for LLM continual learning." ICLR.
Liang, H. and Li, Z. (2024). "InfLoRA: Interference-free low-rank adaptation for continual learning." CVPR.
Biderman, D., et al. (2024). "LoRA learns less and forgets less." TMLR.
Huang, C., et al. (2024). "LoRAHub: Efficient cross-task generalization via dynamic LoRA composition." COLM.
Buehler, E. L. and Buehler, M. J. (2024). "X-LoRA: Mixture of low-rank adapter experts." arXiv:2402.07148.
Dou, S., et al. (2024). "LoRAMoE: Alleviating world knowledge forgetting in LLMs via MoE-style plugin." ACL.
Pfeiffer, J., et al. (2021). "AdapterFusion: Non-destructive task composition for transfer learning." EACL.
Dohare, S., et al. (2024). "Loss of plasticity in deep continual learning." Nature, 632:768–774.
Lewandowski, A., et al. (2025). "Learning continually by spectral regularization." ICLR.
Sinkhorn, R. (1964). "A relationship between arbitrary positive matrices and doubly stochastic matrices." Annals of Mathematical Statistics, 35(2):876–879.
Birkhoff, G. (1946). "Three observations on linear algebra." Univ. Nac. Tucumán Rev.

Citation: Nayudu, K., Nutakki, A., Naidu, S. V., and Shanmugasundaram, A. (2026). "CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting." ModelBrew AI. Patent pending (US Provisional, filed February 2026).