RESEARCH PAPER · FLAGSHIP · April 2026 · 15 min read
PATENT PENDING

CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting

Each task gets its own fresh LoRA adapter. They all compose against a CRMA backbone whose internal mixing matrix is doubly-stochastic at every forward pass, so its spectral norm is bounded by 1 by Birkhoff's theorem. The modular architecture does the forgetting prevention; CRMA's specific contribution is the spectrally bounded substrate that lets the shared backbone keep training instead of being frozen. Result on Mistral-7B across 5 sequential domains: -0.17% ± 0.17 loss-relative drift across 3 seeds, versus +42.96% ± 5.5 for naive sequential fine-tuning.

Authors: Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram · ModelBrew AI

The Headline Numbers

−0.17%
MODULAR drift (3 seeds ± 0.17)
+42.96%
NAIVE forgetting (3 seeds ± 5.5)

Per-seed MODULAR range $[-0.36\%, -0.03\%]$ and per-seed NAIVE range $[+38.1\%, +49.0\%]$ are disjoint at every seed. The effect size is two orders of magnitude larger than the between-seed standard deviation.

Protocol: Mistral-7B, 5 sequential domains (Medical, Legal, Financial, Code, Science), 500 train / 100 holdout per domain, seeds 0 / 42 / 1234. Loss-relative backward transfer as defined in Section 7.1.3 of the paper; positive values indicate forgetting.

The Honest Reframe: Where the Benefit Comes From

The paper's most important finding is the FROZEN-vs-MODULAR comparison. We ran three conditions on identical sequential data:

FROZEN already gives essentially all of the forgetting-prevention effect — zero drift by construction, since the backbone never moves. CRMA adds a 1.99% ± 0.54 learning advantage (across 3 seeds) on top of FROZEN, positive at every seed (1.46%, 1.97%, 2.54%).

This reframes the paper's pitch honestly: most of the forgetting-prevention benefit comes from the modular per-task adapter architecture — giving each task its own fresh LoRA rather than sharing a single one. CRMA's specific contribution is the spectrally bounded substrate that keeps the shared backbone stable enough to continue training across tasks without interfering with prior adapters.

Prior art acknowledgment: Modular per-task adapters are not novel. LoRAHub, X-LoRA, LoRAMoE, AdapterFusion, PackNet, and Progressive Networks all share the "one adapter per task" idea. What CRMA adds is the structural non-expansiveness guarantee that lets the shared backbone keep learning instead of being frozen.

3-Seed Results: Mistral-7B, 5 Domains

Seed MODULAR drift NAIVE forget FROZEN drift MOD holdout FROZEN holdout MOD advantage
0 -0.03% +38.1% +1.47% 1.2530 1.2716 1.46%
42 -0.10% +41.7% +1.66% 1.2657 1.2912 1.97%
1234 -0.36% +49.0% +2.71% 1.2705 1.3036 2.54%
3-seed avg −0.17% ± 0.17 +42.96% ± 5.5 +1.95% ± 0.64 1.2631 ± 0.009 1.2888 ± 0.016 1.99% ± 0.54

Welch's unequal-variance $t$-test on the MODULAR-vs-NAIVE means gives $t \approx 13.6$ (df $\approx 2$, $p \approx 0.005$). The exact 3-vs-3 permutation test is bounded from below by its combinatorial floor: $p = 2/20 = 0.10$ two-sided, $p = 1/20 = 0.05$ one-sided. Neither number is the primary argument; the disjoint per-seed ranges are.

Inference-Time Ablation: Gemma-2-9B, 98/100 vs 38/100

On Gemma-2-9B, we tested the same sequentially-trained model weights on the same 100 held-out questions (20 per domain) under two conditions — with CRMA injected at inference, and with the CRMA module bypassed. This isolates CRMA as the access mechanism for sequentially trained knowledge.

Domain Without CRMA With CRMA
Medical (D1)5/2020/20
Enterprise (D2)12/2019/20
Finance (D3)8/2020/20
Military (D4)3/2019/20
Real Estate (D5)10/2020/20
Total 38/100 98/100

With CRMA: Wilson 95% CI $[93.0\%, 99.5\%]$. Without CRMA: Wilson 95% CI $[29.0\%, 47.8\%]$. The intervals are disjoint. McNemar's exact test on the paired outcomes gives $p \lesssim 2 \times 10^{-16}$ under the worst-case paired assumption.

Multi-Model Coverage: 5 Models, 4 Architecture Families

The architecture holds across 5 production language models from 4 architecture families, with no model-specific tuning:

Model Family Params Domains QA Score Wilson 95% CI
Phi-4-miniMicrosoft Phi~3.8B33/3[44%, 100%]
Mistral-7BMistral7.24B526/31 (84%)[68%, 93%]
Saul-7BMistral (legal)7.24B3 legal sub-domains18/18[81.5%, 100%]
Gemma-2-9BGoogle Gemma9.24B530/31 (97%)[83%, 99.5%]
TinyLlama-1.1BLLaMA1.1B4Training dynamics only

QA scoring is first-author evaluation against pre-written reference answers — single rater, unblinded, not a domain expert. We are explicit about this limitation.

The Saul-7B 18/18 retention on three sequential legal sub-domains is the strongest single-vertical evidence in the paper. Saul-7B is a Mistral-7B continued-pretrained for legal practice; CRMA's modular architecture preserved all 18 reference answers across the full sequential chain.

Standard Benchmark Preservation: Gemma-2-9B

To check whether the modular architecture preserves general capabilities beyond the trained domains, we ran Gemma-2-9B on six standard benchmarks using lm-evaluation-harness before and after the full 5-domain sequential training:

Benchmark Base After 5 phases Delta
HellaSwag79.94%76.32%−3.62 pp
ARC-Challenge65.36%62.12%−3.24 pp
TruthfulQA60.18%56.53%−3.65 pp
WinoGrande76.01%74.35%−1.66 pp
MMLU71.87%65.41%−6.46 pp
GSM8K (corrected)79.68%69.5%−10.2 pp

HellaSwag, ARC-Challenge, TruthfulQA, and WinoGrande degrade by 1.7–3.7 pp. MMLU and GSM8K degrade more (−6.5 pp and −10 to −13 pp respectively). We report these in full rather than hiding them. We do not attribute the math-reasoning degradation mechanistically in this paper; a LoRA-only vs LoRA+CRMA isolation ablation is planned follow-up work.

GSM8K note: the automated scorer initially reported −19.3 pp. A full 1,319-question diagnostic revealed the fine-tuned model never uses the "####" format standard scorers require; a corrected scorer verified on 220 manual reviews recovered the true degradation of −10.2 pp.

Spectral-Norm Structural Invariant

CRMA enforces the spectral bound through the parameterization itself, not through a penalty or clipping. To validate this empirically, we logged the mixing matrix's spectral norm every 5 training steps across all 5 Gemma-2-9B domains — 867 checkpoints in total.

867
Logged training steps
< 1.2 × 10−7
Max deviation from 1.0

$\|\mathbf{M}\|_2 = 1.0$ within float32 precision at every logged step. The bound held at every step without corrective intervention — no gradient clipping, no re-normalization, no periodic projection. The doubly-stochastic parameterization enforces it by construction.

Routing Is Load-Bearing

Because the architecture is modular, an inference-time router has to pick the right adapter for each query. Ours is a contrastive centroid classifier, and on the 5-domain Mistral-7B benchmark it was correct on all 31 questions. Every forgetting-prevention number in this paper is conditional on correct routing — a routing failure would send a query to the wrong adapter and look like a forgetting event even though no forgetting has occurred.

The 31/31 figure is on five maximally distinct domains, which is the easy regime for a centroid classifier. We have not yet stress-tested the router on overlapping sub-domains, on deliberately ambiguous queries, or on out-of-distribution queries. A quantitative stress test on same-vertical boundary queries is planned follow-up work. Until that result is in, same-vertical deployment is an explicit limitation of what this paper establishes.

Limitations

References

  1. French, R. M. (1999). "Catastrophic forgetting in connectionist networks." Trends in Cognitive Sciences, 3(4):128–135.
  2. Kirkpatrick, J., et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS, 114(13):3521–3526.
  3. Lopez-Paz, D. and Ranzato, M. (2017). "Gradient episodic memory for continual learning." NeurIPS.
  4. Li, Z. and Hoiem, D. (2018). "Learning without forgetting." IEEE TPAMI, 40(12):2935–2947.
  5. Miyato, T., et al. (2018). "Spectral normalization for generative adversarial networks." ICLR.
  6. van de Ven, G. M. and Tolias, A. S. (2019). "Three scenarios for continual learning." arXiv:1904.07734.
  7. Hu, E. J., et al. (2022). "LoRA: Low-rank adaptation of large language models." ICLR.
  8. Wang, X., et al. (2023). "TRACE: A comprehensive benchmark for continual learning in LLMs." arXiv:2310.06762.
  9. Wang, Y., et al. (2024). "O-LoRA: Orthogonal low-rank adaptation for LLM continual learning." ICLR.
  10. Liang, H. and Li, Z. (2024). "InfLoRA: Interference-free low-rank adaptation for continual learning." CVPR.
  11. Biderman, D., et al. (2024). "LoRA learns less and forgets less." TMLR.
  12. Huang, C., et al. (2024). "LoRAHub: Efficient cross-task generalization via dynamic LoRA composition." COLM.
  13. Buehler, E. L. and Buehler, M. J. (2024). "X-LoRA: Mixture of low-rank adapter experts." arXiv:2402.07148.
  14. Dou, S., et al. (2024). "LoRAMoE: Alleviating world knowledge forgetting in LLMs via MoE-style plugin." ACL.
  15. Pfeiffer, J., et al. (2021). "AdapterFusion: Non-destructive task composition for transfer learning." EACL.
  16. Dohare, S., et al. (2024). "Loss of plasticity in deep continual learning." Nature, 632:768–774.
  17. Lewandowski, A., et al. (2025). "Learning continually by spectral regularization." ICLR.
  18. Sinkhorn, R. (1964). "A relationship between arbitrary positive matrices and doubly stochastic matrices." Annals of Mathematical Statistics, 35(2):876–879.
  19. Birkhoff, G. (1946). "Three observations on linear algebra." Univ. Nac. Tucumán Rev.

Citation: Nayudu, K., Nutakki, A., Naidu, S. V., and Shanmugasundaram, A. (2026). "CRMA: Modular LoRA for Continual Fine-Tuning of Large Language Models Without Catastrophic Forgetting." ModelBrew AI. Patent pending (US Provisional, filed February 2026).