ModelBrew AI Blog · February 2026 · 9 min read

CRMA vs LoRA: What Changes When You Add Stability Constraints?

LoRA changed fine-tuning. But sequential training still breaks models. What if the adapter itself could prevent forgetting?

The Starting Point

LoRA (Low-Rank Adaptation) was a breakthrough. It proved that you could fine-tune large language models by training only a small set of adapter parameters, reducing memory requirements by orders of magnitude. Almost every fine-tuning workflow today uses LoRA or its quantized variant, QLoRA.

But LoRA was designed for a single objective: efficiently adapt a model to one task or dataset. It was not designed for what happens after that — when you need to train on a second dataset, then a third, then a fourth. When models need to learn continuously without losing what they already know.

CRMA is a different kind of adapter. It shares LoRA's efficiency — small trainable parameters, frozen base model — but adds mathematical stability constraints that change how the model behaves during and after training. The adapter itself enforces a set of rules that prevent the destructive weight updates responsible for catastrophic forgetting.

This article compares the two approaches directly, using data from experiments at the 1.1B and 7B parameter scale.

The Comparison Framework

Comparing fine-tuning methods requires looking at more than just final loss numbers. A method that produces great results on a single benchmark can still fail catastrophically in production. We evaluate across three dimensions that matter for real-world deployment:

Single-Task Fine-Tuning

On single-task fine-tuning — training a model on one dataset and evaluating on that same domain — both LoRA and CRMA produce high-quality results. This is expected. LoRA is a mature technique that has been proven across thousands of fine-tuning runs, and any method that could not match it on this basic capability would not be worth discussing.

CRMA matches LoRA's single-task performance and adds a small per-forward-pass Sinkhorn projection cost. Exact wall-clock overhead was not formally measured in our experiments; it was not a bottleneck in any of our training runs, but we report it as unmeasured rather than putting a number on it.

The question is not whether this overhead is worth it for a single fine-tuning run. For a one-shot task, it often is not — standard LoRA is perfectly adequate. The question is what you get in return when you move beyond the single-task scenario.

Where They Diverge: Sequential Training

Sequential training is the scenario that separates these approaches. It is also the scenario that matters most for production systems, because production models almost always need to learn over time.

The Setup

The experiment is straightforward. Take a base model. Fine-tune it on domain A (for example, medical text). Evaluate on domain A. Then continue training the same adapter on domain B (for example, legal text). Evaluate on both domains.

The question: after training on domain B, how much domain A knowledge survives?

Standard LoRA: Catastrophic Forgetting

With standard LoRA, the answer is clear. Domain A knowledge does not just degrade — it is destroyed. The adapter's weights shift to optimize for domain B, and the domain A information is overwritten.

This is not a subtle effect. At the 7B scale, standard LoRA shows +351% forgetting on the first domain after sequential training. At the 1.1B scale, it is +225% forgetting. These numbers mean the model performs dramatically worse on domain A than it did before any fine-tuning at all. The fine-tuning actively harmed the model's capability in the first domain.

This happens because standard LoRA has no mechanism to protect previously learned representations. Every gradient update optimizes purely for the current batch of data. The optimizer does not know or care that certain weight configurations are important for earlier domains.

CRMA: Near-Zero Drift

Modular LoRA on a CRMA backbone produces a fundamentally different result. On Mistral-7B across 5 sequential domains (3 seeds), MODULAR shows −0.17% ± 0.17 loss-relative drift on prior-task holdout, versus +42.96% ± 5.5 for naive sequential LoRA. Per-seed MODULAR and NAIVE ranges are disjoint at every seed.

This is achieved through two structural choices rather than training-time negotiation. First, the architecture is modular: each task gets a fresh LoRA adapter, not a shared one, so task N's gradient updates cannot rewrite the weights that task 1 depended on. Second, the per-task adapters share a CRMA backbone whose internal mixing matrix is doubly-stochastic at every forward pass; by Birkhoff's theorem, this gives the mixing operation a spectral norm bounded by 1 by construction. The backbone is non-expansive on its own transformation regardless of what data is being trained on.

The Numbers (3-seed Mistral-7B, 5 sequential domains)

Condition Drift / Forgetting Per-seed range
CRMA Modular −0.17% ± 0.17 [−0.36%, −0.03%]
FROZEN (no CRMA, plain backbone) +1.95% ± 0.64 [+1.47%, +2.71%]
Naive sequential LoRA +42.96% ± 5.5 [+38.1%, +49.0%]

Naive sequential LoRA's forgetting is not a rounding error — it is a 43% loss-relative drift on prior-task performance, with Legal reaching +593% in the single-seed Mistral-7B v8.1 run. Modular LoRA (fresh per-task adapter) plus a CRMA backbone holds drift within the measurement noise of the 3-seed run.

The Honest Reframe: What CRMA Actually Adds

The FROZEN condition above uses the modular per-task architecture without any CRMA component at all — plain pretrained backbone, per-task fresh LoRA. FROZEN already produces near-zero drift, because a frozen backbone cannot drift by definition. CRMA's specific contribution is a 1.99% ± 0.54 learning advantage on top of FROZEN (positive at every seed: 1.46%, 1.97%, 2.54%).

This means most of the forgetting-prevention benefit comes from the modular per-task adapter architecture — the idea of giving each task its own fresh LoRA rather than sharing a single one. That idea is prior art (LoRAHub, X-LoRA, LoRAMoE, AdapterFusion, PackNet, Progressive Networks). What CRMA adds is the spectrally bounded backbone that allows the shared substrate to keep training across tasks instead of being frozen, without breaking the drift guarantee. That is a narrower but more defensible contribution than "CRMA eliminates catastrophic forgetting."

When to Use Which

Neither approach is universally superior. The right choice depends on your use case. Here is a practical decision framework:

Standard LoRA Is the Right Choice When:

CRMA Is the Right Choice When:

What This Means for Production AI

The distinction between LoRA and CRMA reflects a broader shift in the field. The first wave of fine-tuning innovation was about efficiency — making it possible to train large models at all on limited hardware. LoRA and QLoRA solved that problem definitively.

The second wave is about reliability. As fine-tuning moves from research experiments to production deployments, the requirements shift. Producing a good model once isn't enough anymore. You need to produce good models consistently, update them safely, and trust that new training will not break what already works.

This is not unique to language models. Every engineering discipline goes through this progression — first making something work, then making it reliable. What is notable about CRMA is that the reliability comes not from process improvements or better hyperparameter tuning, but from mathematical constraints built into the training mechanism itself.

The bottom line: LoRA is a great tool for one-shot fine-tuning. CRMA is what you need when models have to keep learning — when new knowledge must coexist with old, and when training stability is not optional but essential.

The question for any team building with fine-tuned language models is straightforward: will your model need to learn more than once? If the answer is yes, stability constraints are not a nice-to-have. They are a requirement.