ModelBrew AI Blog · February 2026 · 9 min read

CRMA vs LoRA: What Changes When You Add Stability Constraints?

LoRA changed fine-tuning. But sequential training still breaks models. What if the adapter itself could prevent forgetting?

The Starting Point

LoRA (Low-Rank Adaptation) was a breakthrough. It proved that you could fine-tune large language models by training only a small set of adapter parameters, reducing memory requirements by orders of magnitude. Almost every fine-tuning workflow today uses LoRA or its quantized variant, QLoRA.

But LoRA was designed for a single objective: efficiently adapt a model to one task or dataset. It was not designed for what happens after that — when you need to train on a second dataset, then a third, then a fourth. When models need to learn continuously without losing what they already know.

CRMA is a different kind of adapter. It shares LoRA's efficiency — small trainable parameters, frozen base model — but adds mathematical stability constraints that fundamentally change how the model behaves during and after training. The adapter itself enforces a set of rules that prevent the destructive weight updates responsible for catastrophic forgetting.

This article compares the two approaches directly, using data from experiments at the 1.1B and 7B parameter scale.

The Comparison Framework

Comparing fine-tuning methods requires looking at more than just final loss numbers. A method that produces great results on a single benchmark can still fail catastrophically in production. We evaluate across three dimensions that matter for real-world deployment:

Single-Task Fine-Tuning

On single-task fine-tuning — training a model on one dataset and evaluating on that same domain — both LoRA and CRMA produce high-quality results. This is expected. LoRA is a mature technique that has been proven across thousands of fine-tuning runs, and any method that could not match it on this basic capability would not be worth discussing.

CRMA matches LoRA's single-task performance while adding a small computational overhead. That overhead comes from the stability constraints evaluated at each training step. In practice, the per-step cost is modest: training runs take roughly 10-20% longer wall-clock time, depending on model size and sequence length.

The question is not whether this overhead is worth it for a single fine-tuning run. For a one-shot task, it often is not — standard LoRA is perfectly adequate. The question is what you get in return when you move beyond the single-task scenario.

Where They Diverge: Sequential Training

Sequential training is the scenario that separates these approaches. It is also the scenario that matters most for production systems, because production models almost always need to learn over time.

The Setup

The experiment is straightforward. Take a base model. Fine-tune it on domain A (for example, medical text). Evaluate on domain A. Then continue training the same adapter on domain B (for example, legal text). Evaluate on both domains.

The question: after training on domain B, how much domain A knowledge survives?

Standard LoRA: Catastrophic Forgetting

With standard LoRA, the answer is stark. Domain A knowledge does not just degrade — it is destroyed. The adapter's weights shift to optimize for domain B, and the domain A information is overwritten.

This is not a subtle effect. At the 7B scale, standard LoRA shows +351% forgetting on the first domain after sequential training. At the 1.1B scale, it is +225% forgetting. These numbers mean the model performs dramatically worse on domain A than it did before any fine-tuning at all. The fine-tuning actively harmed the model's capability in the first domain.

This happens because standard LoRA has no mechanism to protect previously learned representations. Every gradient update optimizes purely for the current batch of data. The optimizer does not know or care that certain weight configurations are important for earlier domains.

CRMA: Near-Zero Drift

CRMA produces a fundamentally different result. After the same sequential training protocol — domain A, then domain B — the adapter shows -0.1% backbone drift. The model's performance on domain A is essentially unchanged. It learned domain B without overwriting domain A.

This is not achieved through replay buffers (re-showing old data during new training), elastic weight consolidation (penalizing changes to "important" weights), or any of the traditional continual learning techniques that trade training efficiency for stability. CRMA's stability comes from constraints built into the adapter itself. The mathematical properties of the adapter ensure that updates cannot push the model's representations outside a stable region, regardless of what data is being trained on.

The Numbers

Here are the results from controlled experiments at two model scales, using the same base models, same datasets, same training configuration, and same evaluation protocol. The only variable is the adapter type.

Metric Scale CRMA Standard LoRA
Backbone drift 7B -0.1% +351%
Backbone drift 1.1B -0.1% +225%
Gradient stability 7B 84% more stable baseline
Gradient stability 1.1B 39% more stable baseline

The drift numbers tell the core story: standard LoRA's forgetting is not a rounding error — it is a factor of 3.5x degradation at the 7B scale. CRMA holds the backbone essentially constant across sequential training runs.

The gradient stability numbers are equally meaningful for production use. Training that is 39-84% more stable means fewer failed runs, more predictable outcomes, and less time spent debugging hyperparameters.

Gradient Stability: Why It Matters in Practice

Gradient stability might sound like an academic concern, but it has direct practical consequences. When gradients are unstable, you get:

CRMA constrains gradient behavior mathematically. The adapter enforces bounds on how much the model's internal representations can change per step. This does not mean training is slower or less expressive — it means training stays within a stable region of the optimization landscape. The model still learns the target domain effectively, but it does so without the erratic gradient behavior that plagues standard LoRA at scale.

The 39-84% improvement in gradient stability (measured by gradient variance across training) translates directly to more reliable training runs. For teams running fine-tuning as a service or automating training pipelines, this is the difference between a system that works reliably and one that requires constant intervention.

When to Use Which

Neither approach is universally superior. The right choice depends on your use case. Here is a practical decision framework:

Standard LoRA Is the Right Choice When:

CRMA Is the Right Choice When:

What This Means for Production AI

The distinction between LoRA and CRMA mirrors a broader trend in the field. The first wave of fine-tuning innovation was about efficiency — making it possible to train large models at all on limited hardware. LoRA and QLoRA solved that problem definitively.

The second wave is about reliability. As fine-tuning moves from research experiments to production deployments, the requirements shift. It is no longer enough to produce a good model once. You need to produce good models consistently, update them safely, and trust that new training will not break what already works.

This is not unique to language models. Every engineering discipline goes through this progression — first making something work, then making it reliable. What is notable about CRMA is that the reliability comes not from process improvements or better hyperparameter tuning, but from mathematical constraints built into the training mechanism itself.

The bottom line: LoRA is a great tool for one-shot fine-tuning. CRMA is what you need when models have to keep learning — when new knowledge must coexist with old, and when training stability is not optional but essential.

The question for any team building with fine-tuned language models is straightforward: will your model need to learn more than once? If the answer is yes, stability constraints are not a nice-to-have. They are a requirement.