03 · Never forget — Patent-pending CRMA

A Continual-Learning Engine for Production Fine-tuning

A research-stage continual-learning engine for stacking domains onto the same model — medical → legal → finance → code — without erasing what came before. −0.17% drift on Mistral-7B across 5 domains (3 seeds), versus +43% with naive sequential LoRA, on our internal protocol. Closed beta — invite only. Built on patent-pending CRMA.


CRMA at a glance.

Tested across multiple model scales, three random seeds, five real-world domains. Per-seed CRMA and naive ranges are disjoint at every seed.

−0.17%
Backbone drift with CRMA
3-seed avg, 5 domains, Mistral-7B
+43%
Forgetting without CRMA
Naive sequential LoRA, 3-seed avg
98/100
Gemma inference ablation
Same weights, 100 questions, CRMA toggled. 38/100 without.
18/18
Saul-7B legal sub-domains
First-author retention validated across 3 sequential CL transitions

Most fine-tunes erase the past. CRMA doesn't.

Train a model on medical data. Then train it on legal data. Then code. With naive LoRA, each new training run partially erases the last. With CRMA, every domain stacks — and the model can answer questions across all of them.

Without CRMA — Naive Sequential LoRA

Step 1 — Train on medical data
● Medical
Step 2 — Then train on legal data
✗ Medical — gone
● Legal
Step 3 — Then train on code
✗ Medical — gone
✗ Legal — gone
● Code
It only remembers the last thing you taught it.

With CRMA — Continual Learning

Step 1 — Train on medical data
✓ Medical
Step 2 — Then train on legal data
✓ Medical — still there
✓ Legal
Step 3 — Then train on code
✓ Medical — still there
✓ Legal — still there
✓ Code
It remembers everything. Tested at 7B parameters: −0.17% drift across 3 seeds.

Three papers. Real experiments. Ongoing research.

CRMA comes from original research — not a wrapper around existing tools. We publish our methodology, run multi-seed experiments, and update the algorithm based on results. Patent pending (US provisional filed Feb 2026).

Analysis

Six CL Methods Tested — Six Failures

EWC, replay, gradient projection, knowledge distillation, O-LoRA, 10-component stacks. Best result: 58.4% forgetting. We tested them all so you don't have to.

Preprint · v2-v7 experiments · TinyLlama & Mistral-7B
Read paper →
Results

Near-Zero Forgetting on Mistral-7B Across 5 Domains

Modular LoRA on a spectrally bounded CRMA backbone: −0.17% ± 0.17 MODULAR drift vs +42.96% ± 5.5 NAIVE forgetting across 3 seeds. Per-seed ranges disjoint. Validated on 5 models across 4 architecture families.

Preprint · 3 seeds · 5 domains · Mistral-7B & Gemma-2-9B
Read paper →
Background

Why Catastrophic Forgetting Happens

The mathematical and architectural reasons LLMs forget when you fine-tune them sequentially. The shared-parameter dilemma. Why dropout, regularization, and learning-rate tricks don't fix it.

Blog · Conceptual primer · Recommended start
Read post →
Comparison

CRMA vs LoRA — What Changes

LoRA and CRMA both add small trainable adapters. The difference is what happens when you stack a second domain. LoRA overwrites; CRMA composes through a spectrally bounded substrate.

Blog · Architecture comparison
Read post →

Current Research & Development

Validated

Multi-seed experiments across 3 random seeds on Mistral-7B. 5 real-world domains (medical, legal, financial, code, science). Results reproducible across seeds.

In Progress

Enhanced reasoning via self-distillation fine-tuning (SDFT). Scale testing beyond 7B. Head-to-head benchmark against O-LoRA and other academic CL methods.

Roadmap

Real-time continual learning (streaming updates). Agent fine-tuning with tool-use preservation. Automatic domain boundary detection.


Per-domain breakdown after 5 sequential domains.

Each domain was trained sequentially. Drift measures how much earlier domains degraded after all 5 were trained. Negative = slight improvement (positive transfer).

Domain CRMA Frozen Naive LoRA
Medical−0.56%+2.22%+149.6%
Legal−0.55%+1.83%+34.3%
Financial+0.59%+1.74%+17.8%
Code−0.51%+2.78%+13.0%
Science+0.20%+1.17%+0.08%
3-seed Avg −0.17% +1.95% +42.96%

Key insight: CRMA drift is on the order of an order of magnitude lower than FROZEN (~1.95%), and two orders of magnitude lower than naive sequential LoRA (~43%). 3-seed average across seeds 0, 42, 1234; Mistral-7B.

Method comparison at a glance

Method Forgetting Overhead Price/M tokens CL Support
CRMA −0.17% drift None Closed beta — contact us Built-in
Naive LoRA+43% (7B) / +225% (1.1B)NoneVariesNo
OpenAINo CLN/A$3–25No
Mistral / TogetherNo CLN/A$0.48–9No

How we measure: "Forgetting" = change in holdout loss on previously learned domains after training on new ones. Negative = the model got slightly better (ideal). Positive = knowledge was lost. Measured across 5 real-world domains (medical, legal, financial, code, science) on Mistral-7B, averaged over 3 random seeds.

View full benchmark methodology & caveats

CRMA Internal (Mistral-7B, 5 domains, 3-seed avg): CRMA Modular −0.17% ± 0.17 drift, Frozen +1.95% ± 0.64, Naive +42.96% ± 5.5. Per-seed MODULAR and NAIVE ranges are disjoint. No replay, no EWC, no knowledge distillation.

Gemma-2-9B inference ablation: 98/100 with CRMA (Wilson 95% CI [93.0%, 99.5%]) vs 38/100 without (Wilson 95% CI [29.0%, 47.8%]). Same weights, same questions, only CRMA toggled.

Spectral norm invariant: ‖M‖₂ held at 1.0 within float32 precision across 867 logged training steps spanning 5 sequential domains on Gemma-2-9B. Max deviation < 1.2 × 10⁻⁷. Birkhoff bound holds by construction.

Pricing context (April 2026): ModelBrew FT $3.99/M, all 7–9B models, with gradient visibility + built-in Dataset Optimizer. CL is in closed beta and not available for self-serve purchase at this time. OpenAI GPT-4.1 $3.00/M (no CL, FT only on their models). Together/Fireworks/OpenPipe $0.48-0.50/M (FT only, no cleaner, no CL). Mistral La Plateforme $1.00/M.

Head-to-head baselines: We have not run head-to-head comparisons against published CL methods (O-LoRA, InfLoRA, Lewandowski et al.) on our protocol. This is the single largest gap in our research; it is acknowledged openly in the paper. Our internal controls compare NAIVE vs FROZEN vs MODULAR on identical data.

CRMA results are from internal benchmarks using holdout evaluation. All forgetting-prevention numbers are conditional on correct inference-time routing.


Stop forgetting. Start stacking.

CRMA continual learning is currently in closed beta. Fine-tuning ($3.99/M, free tier on TinyLlama) is available now.