RESEARCH PAPER 2 · February 2026 · 15 min read
PATENT PENDING

Six Approaches to Catastrophic Forgetting, Six Failures

We systematically tested every major continual learning approach — EWC, replay, gradient projection, knowledge distillation, O-LoRA, and combined stacks of up to 10 components. The best result: 58.4% average forgetting. Here's what we learned about why they all fail.

Author: Kiran V. Nayudu · Independent Research, Frederick, Maryland

The Experiment

Over seven months and six major experimental iterations, we built progressively more sophisticated continual learning systems for large language models. Each iteration added more components, more constraints, more techniques from the literature. Each iteration failed.

The benchmark: four sequential domains (Medical QA, Legal QA, Code/Programming, Finance/Economics), each with 800 training samples and 100 holdout samples. Train on each domain in sequence. Measure how much prior knowledge is destroyed.

The Journey: v2 Through v7

Version Approach Avg Forgetting What Happened
v2 O-LoRA + EWC + GradProj + Replay −2.0% Bug: our implementation froze the model (clip ratio 0.00079)
v3 EWC + Gradient Projection +91.3% First real signal — confounded data
v4 Cumulative basis + B-freeze + Replay +27.8% (P2 only) Incomplete run
v5 10-component full stack +58.4% Best single-adapter result
v6 v5 + Sparse Memory Adapter +61.5% (P2 only) OOM crash at Phase 3
v7 KD + Replay + Bottom freeze at 7B +109.3% Our KD implementation failed to converge at 7B

Finding #1: The Fisher Normalization Bug

Elastic Weight Consolidation (EWC) uses the Fisher information matrix to identify which parameters are important for prior tasks. A standard implementation normalizes all Fisher values jointly across all parameter groups.

The bug: when parameter groups have vastly different sizes (e.g., 132 adapter parameters vs. 2.88 million LoRA parameters), joint normalization silently disables EWC for the smaller group. The small group's Fisher values are dwarfed by the large group's max value, making EWC protection effectively zero for the adapter parameters.

This bug is likely present in any EWC implementation that combines heterogeneous parameter groups under a single normalization scheme.

Finding #2: O-LoRA and Gradient Projection Cancel Each Other

O-LoRA penalizes parameters for moving into the prior task's subspace. Gradient projection removes gradient components from the prior task's subspace. When combined, the penalty pushes gradients toward the subspace while projection removes them — they are mechanically contradictory.

The result: the effective learning rate drops by 100-1000x. The model appears to not forget (because it barely learns anything), producing misleadingly good CL metrics. Our v2 result of -2.0% "forgetting" was actually a frozen model.

Finding #3: Knowledge Distillation Failed to Converge at 7B

Knowledge distillation — using a frozen copy of the prior model as a "teacher" — is standard in continual learning. Our v7 experiment applied KD + replay + bottom-layer freezing on Mistral-7B across a 4-domain sequential protocol, and failed to converge to usable forgetting prevention. Average forgetting ended at +109.3% versus naive +212.8% — better than no protection, but nowhere near the zero drift we were chasing.

We do not claim that published KD methods are structurally incompatible with 7B continual learning; only that our v7 implementation on our protocol did not produce a usable result. The v8 shift to a modular per-task architecture removed the need for KD entirely.

Finding #4: Data Quality Inflates Forgetting by 97 Points

When we fixed confounds in our benchmark data (asymmetric templates, domain labels, unequal sizes), naive forgetting dropped from +185.8% to +88.8% — a 96.9 percentage point difference from data quality alone.

This means data construction errors alone can inflate forgetting by more than the entire contribution of any CL algorithm. Published CL papers that don't control for data quality may be reporting artifacts, not algorithm performance.

The Root Cause: Gradient Space Saturation

The fundamental problem isn't that individual CL methods are poorly designed. It's that shared-parameter continual learning has structural limits.

Each task's protection mechanism (EWC penalty, gradient projection basis, replay targets) consumes available gradient space. After four tasks, the effective dimensionality available for new-task learning is severely reduced:

Across the v2–v7 ablation runs, every single-adapter CL method we tried showed a monotonic escalation of new-task learning overhead as more tasks accumulated. The paper describes this as a 80–120% new-task learning overhead on v5 specifically (10-component full stack, TinyLlama-1.1B) — the best single-adapter configuration we achieved, and still unusable. The per-task progression numbers are not in the flagship paper; we describe the aggregate pattern rather than specific per-task percentages.

The Stability-Plasticity Inversion

The most counterintuitive result: on the Code domain (third in the sequence), the CL stack produced worse retention than no protection at all.

Configuration Code Task Forgetting
Naive sequential (no CL) +27.3%
10-component CL stack +46.8%

The CL stack consumed so much gradient space protecting Medical and Legal that the Code adapter was weaker. When Finance training arrived, the weak Code representation was more easily overwritten. Protection of earlier tasks actively harmed later tasks.

The Conclusion

The problem isn't tuning — it's architecture. Shared-parameter continual learning creates a stability-plasticity tradeoff that worsens with every additional task. The solution isn't better constraints on shared parameters. It's to stop sharing task-specific parameters entirely.

This finding motivated the modular architecture presented in our companion paper, which achieves −0.17% ± 0.17 loss-relative drift across 3 seeds on Mistral-7B — within measurement noise — by giving each task its own fresh LoRA adapter while sharing only a spectrally bounded CRMA backbone. The FROZEN-vs-MODULAR comparison in that paper further shows that the modular per-task adapter architecture is what carries most of the forgetting-prevention effect; CRMA's specific contribution is the non-expansive substrate that lets the shared backbone keep training.

Limitations

References

  1. Kirkpatrick, J., et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS.
  2. McCloskey, M. & Cohen, N. J. (1989). "Catastrophic Interference in Connectionist Networks." Psychology of Learning and Motivation.
  3. Wang, L., et al. (2023). "O-LoRA: Orthogonal Low-Rank Adaptation for Continual Learning." arXiv:2306.07832.
  4. Rolnick, D., et al. (2019). "Experience Replay for Continual Learning." NeurIPS 2019.
  5. Li, Z. & Hoiem, D. (2017). "Learning Without Forgetting." IEEE TPAMI.
  6. Zenke, F., et al. (2017). "Continual Learning Through Synaptic Intelligence." ICML 2017.
  7. Chaudhry, A., et al. (2019). "Efficient Lifelong Learning with A-GEM." ICLR 2019.
  8. Rusu, A. A., et al. (2016). "Progressive Neural Networks." arXiv:1606.04671.
  9. Mallya, A. & Lazebnik, S. (2018). "PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning." CVPR 2018.
  10. Hu, E. J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022.
  11. Biderman, S., et al. (2024). "LoRA Learns Less and Forgets Less." TMLR 2024.
  12. Chen, L., et al. (2023). "How Is ChatGPT's Behavior Changing over Time?" arXiv:2307.09009.
  13. Wang, Y., et al. (2025). "TRACE: A Comprehensive Benchmark for Continual Learning in LLMs." arXiv:2310.06762.
  14. Yoon, J., et al. (2025). "CLDyB: Revisiting Evaluation in Continual Learning." arXiv preprint.

Citation: Nayudu, K. V. (2026). "Six Approaches to Catastrophic Forgetting, Six Failures: A Systematic Characterization of Why Combined Continual Learning Methods Cannot Solve Sequential Fine-Tuning." Independent Research, Frederick, MD. Patent Pending (US Provisional, filed February 2026).