CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting
Authors: Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram · ModelBrew AI
The Headline Numbers
Per-seed MODULAR range $[-0.36\%, -0.03\%]$ and per-seed NAIVE range $[+38.1\%, +49.0\%]$ are disjoint at every seed. The effect size is two orders of magnitude larger than the between-seed standard deviation.
Protocol: Mistral-7B, 5 sequential domains (Medical, Legal, Financial, Code, Science), 500 train / 100 holdout per domain, seeds 0 / 42 / 1234. Loss-relative backward transfer as defined in Section 7.1.3 of the paper; positive values indicate forgetting.
The Honest Reframe: Where the Benefit Comes From
The paper's most important finding is the FROZEN-vs-MODULAR comparison. We ran three conditions on identical sequential data:
- NAIVE: Single shared LoRA, sequential training, no protection (catastrophic forgetting baseline).
- FROZEN: Plain base model, backbone held fixed, per-task fresh LoRA for each domain, no CRMA component in the forward pass at all. This isolates the modular per-task LoRA architecture on its own.
- MODULAR: Plain base model, per-task fresh LoRA, CRMA backbone that continues to train across tasks. This is the full method.
FROZEN already gives essentially all of the forgetting-prevention effect — zero drift by construction, since the backbone never moves. CRMA adds a 1.99% ± 0.54 learning advantage (across 3 seeds) on top of FROZEN, positive at every seed (1.46%, 1.97%, 2.54%).
This reframes the paper's pitch honestly: most of the forgetting-prevention benefit comes from the modular per-task adapter architecture — giving each task its own fresh LoRA rather than sharing a single one. CRMA's specific contribution is the spectrally bounded substrate that keeps the shared backbone stable enough to continue training across tasks without interfering with prior adapters.
Prior art acknowledgment: Modular per-task adapters are not novel. LoRAHub, X-LoRA, LoRAMoE, AdapterFusion, PackNet, and Progressive Networks all share the "one adapter per task" idea. What CRMA adds is the structural non-expansiveness guarantee that lets the shared backbone keep learning instead of being frozen.
3-Seed Results: Mistral-7B, 5 Domains
| Seed | MODULAR drift | NAIVE forget | FROZEN drift | MOD holdout | FROZEN holdout | MOD advantage |
|---|---|---|---|---|---|---|
| 0 | -0.03% | +38.1% | +1.47% | 1.2530 | 1.2716 | 1.46% |
| 42 | -0.10% | +41.7% | +1.66% | 1.2657 | 1.2912 | 1.97% |
| 1234 | -0.36% | +49.0% | +2.71% | 1.2705 | 1.3036 | 2.54% |
| 3-seed avg | −0.17% ± 0.17 | +42.96% ± 5.5 | +1.95% ± 0.64 | 1.2631 ± 0.009 | 1.2888 ± 0.016 | 1.99% ± 0.54 |
Welch's unequal-variance $t$-test on the MODULAR-vs-NAIVE means gives $t \approx 13.6$ (df $\approx 2$, $p \approx 0.005$). The exact 3-vs-3 permutation test is bounded from below by its combinatorial floor: $p = 2/20 = 0.10$ two-sided, $p = 1/20 = 0.05$ one-sided. Neither number is the primary argument; the disjoint per-seed ranges are.
Inference-Time Ablation: Gemma-2-9B, 98/100 vs 38/100
On Gemma-2-9B, we tested the same sequentially-trained model weights on the same 100 held-out questions (20 per domain) under two conditions — with CRMA injected at inference, and with the CRMA module bypassed. This isolates CRMA as the access mechanism for sequentially trained knowledge.
| Domain | Without CRMA | With CRMA |
|---|---|---|
| Medical (D1) | 5/20 | 20/20 |
| Enterprise (D2) | 12/20 | 19/20 |
| Finance (D3) | 8/20 | 20/20 |
| Military (D4) | 3/20 | 19/20 |
| Real Estate (D5) | 10/20 | 20/20 |
| Total | 38/100 | 98/100 |
With CRMA: Wilson 95% CI $[93.0\%, 99.5\%]$. Without CRMA: Wilson 95% CI $[29.0\%, 47.8\%]$. The intervals are disjoint. McNemar's exact test on the paired outcomes gives $p \lesssim 2 \times 10^{-16}$ under the worst-case paired assumption.
Multi-Model Coverage: 5 Models, 4 Architecture Families
The architecture holds across 5 production language models from 4 architecture families, with no model-specific tuning:
| Model | Family | Params | Domains | QA Score | Wilson 95% CI |
|---|---|---|---|---|---|
| Phi-4-mini | Microsoft Phi | ~3.8B | 3 | 3/3 | [44%, 100%] |
| Mistral-7B | Mistral | 7.24B | 5 | 26/31 (84%) | [68%, 93%] |
| Saul-7B | Mistral (legal) | 7.24B | 3 legal sub-domains | 18/18 | [81.5%, 100%] |
| Gemma-2-9B | Google Gemma | 9.24B | 5 | 30/31 (97%) | [83%, 99.5%] |
| TinyLlama-1.1B | LLaMA | 1.1B | 4 | Training dynamics only | — |
QA scoring is first-author evaluation against pre-written reference answers — single rater, unblinded, not a domain expert. We are explicit about this limitation.
The Saul-7B 18/18 retention on three sequential legal sub-domains is the strongest single-vertical evidence in the paper. Saul-7B is a Mistral-7B continued-pretrained for legal practice; CRMA's modular architecture preserved all 18 reference answers across the full sequential chain.
Standard Benchmark Preservation: Gemma-2-9B
To check whether the modular architecture preserves general capabilities beyond the trained domains, we ran Gemma-2-9B on six standard benchmarks using lm-evaluation-harness before and after the full 5-domain sequential training:
| Benchmark | Base | After 5 phases | Delta |
|---|---|---|---|
| HellaSwag | 79.94% | 76.32% | −3.62 pp |
| ARC-Challenge | 65.36% | 62.12% | −3.24 pp |
| TruthfulQA | 60.18% | 56.53% | −3.65 pp |
| WinoGrande | 76.01% | 74.35% | −1.66 pp |
| MMLU | 71.87% | 65.41% | −6.46 pp |
| GSM8K (corrected) | 79.68% | 69.5% | −10.2 pp |
HellaSwag, ARC-Challenge, TruthfulQA, and WinoGrande degrade by 1.7–3.7 pp. MMLU and GSM8K degrade more (−6.5 pp and −10 to −13 pp respectively). We report these in full rather than hiding them. We do not attribute the math-reasoning degradation mechanistically in this paper; a LoRA-only vs LoRA+CRMA isolation ablation is planned follow-up work.
GSM8K note: the automated scorer initially reported −19.3 pp. A full 1,319-question diagnostic revealed the fine-tuned model never uses the "####" format standard scorers require; a corrected scorer verified on 220 manual reviews recovered the true degradation of −10.2 pp.
Spectral-Norm Structural Invariant
CRMA enforces the spectral bound through the parameterization itself, not through a penalty or clipping. To validate this empirically, we logged the mixing matrix's spectral norm every 5 training steps across all 5 Gemma-2-9B domains — 867 checkpoints in total.
$\|\mathbf{M}\|_2 = 1.0$ within float32 precision at every logged step. The bound held at every step without corrective intervention — no gradient clipping, no re-normalization, no periodic projection. The doubly-stochastic parameterization enforces it by construction.
Routing Is Load-Bearing
Because the architecture is modular, an inference-time router has to pick the right adapter for each query. Ours is a contrastive centroid classifier, and on the 5-domain Mistral-7B benchmark it was correct on all 31 questions. Every forgetting-prevention number in this paper is conditional on correct routing — a routing failure would send a query to the wrong adapter and look like a forgetting event even though no forgetting has occurred.
The 31/31 figure is on five maximally distinct domains, which is the easy regime for a centroid classifier. We have not yet stress-tested the router on overlapping sub-domains, on deliberately ambiguous queries, or on out-of-distribution queries. A quantitative stress test on same-vertical boundary queries is planned follow-up work. Until that result is in, same-vertical deployment is an explicit limitation of what this paper establishes.
Limitations
- No head-to-head published CL baselines. O-LoRA, InfLoRA, and Lewandowski et al. on the same benchmark would strengthen the paper substantially. This is the single largest gap in the paper and is explicitly acknowledged. Running these at head-to-head scale was beyond the budget of this work.
- QA evaluation is first-author, single-rater, unblinded. No inter-rater reliability, no condition blinding, no independent domain expert. On Saul-7B the first author is not a legal practitioner. A blinded two-rater evaluation is planned.
- GSM8K degradation is not mechanistically isolated. The corrected 10–13 pp drop on math reasoning could be an LoRA artifact, a CRMA artifact, or a bundle interaction. A LoRA-only vs LoRA+CRMA controlled ablation is planned.
- Same-vertical routing has not been stress-tested. A boundary-query probe on Saul-7B's three legal sub-domains is planned.
- Decoder-only transformers only. 1.1B–9.2B parameter range. Encoder-decoder, ViT, CNN, and RL-policy settings are not tested.
- Healthcare, finance, and defense are motivation, not deployment evidence. Only the legal vertical has direct experimental support (Saul-7B).
References
- French, R. M. (1999). "Catastrophic forgetting in connectionist networks." Trends in Cognitive Sciences, 3(4):128–135.
- Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13):3521–3526.
- Lopez-Paz, D. and Ranzato, M. (2017). "Gradient episodic memory for continual learning." NeurIPS.
- Li, Z. and Hoiem, D. (2018). "Learning without forgetting." IEEE TPAMI, 40(12):2935–2947.
- Miyato, T., et al. (2018). "Spectral normalization for generative adversarial networks." ICLR.
- van de Ven, G. M. and Tolias, A. S. (2019). "Three scenarios for continual learning." arXiv:1904.07734.
- Hu, E. J., et al. (2022). "LoRA: Low-rank adaptation of large language models." ICLR.
- Wang, X., et al. (2023). "TRACE: A comprehensive benchmark for continual learning in LLMs." arXiv:2310.06762.
- Wang, Y., et al. (2024). "O-LoRA: Orthogonal low-rank adaptation for LLM continual learning." ICLR.
- Liang, H. and Li, Z. (2024). "InfLoRA: Interference-free low-rank adaptation for continual learning." CVPR.
- Biderman, D., et al. (2024). "LoRA learns less and forgets less." TMLR.
- Huang, C., et al. (2024). "LoRAHub: Efficient cross-task generalization via dynamic LoRA composition." COLM.
- Buehler, E. L. and Buehler, M. J. (2024). "X-LoRA: Mixture of low-rank adapter experts." arXiv:2402.07148.
- Dou, S., et al. (2024). "LoRAMoE: Alleviating world knowledge forgetting in LLMs via MoE-style plugin." ACL.
- Pfeiffer, J., et al. (2021). "AdapterFusion: Non-destructive task composition for transfer learning." EACL.
- Dohare, S., et al. (2024). "Loss of plasticity in deep continual learning." Nature, 632:768–774.
- Lewandowski, A., et al. (2025). "Learning continually by spectral regularization." ICLR.
- Sinkhorn, R. (1964). "A relationship between arbitrary positive matrices and doubly stochastic matrices." Annals of Mathematical Statistics, 35(2):876–879.
- Birkhoff, G. (1946). "Three observations on linear algebra." Univ. Nac. Tucumán Rev.
Citation: Nayudu, K., Nutakki, A., Naidu, S. V., and Shanmugasundaram, A. (2026). "CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting." ModelBrew AI. Patent pending (US Provisional, filed February 2026).