Fine-tune without forgetting Try Free
PATENT PENDING
FINE-TUNING WITH BUILT-IN CONTINUAL LEARNING

The Alternative to RAG
Near-Zero Forgetting

RAG looks up answers every time. ModelBrew bakes knowledge directly into the model. Train across multiple sequential domains — your model keeps what it learns, with prior-task drift within measurement noise on our 3-seed Mistral-7B benchmark. No vector database, no retrieval pipeline.

3 free runs/day on TinyLlama. Pro from $3.99/M tokens. See pricing

modelbrew — mistral-7b — drift monitor
Drift Runs Domains
100% 75% 50% 25% 0% Domain A Domain B Domain C Domain D
backbone drift -0.17% (3-seed avg)
baseline LoRA +43.0% forgetting
model Mistral-7B-Instruct
CRMA
Standard LoRA
-0.0%
MODULAR drift (3 seeds ± 0.17)
+0%
NAIVE forgetting (3 seeds ± 5.5)
98/100
Gemma inference ablation (vs 38/100) · first-author rated
18/18
Saul-7B legal sub-domains · first-author rated

Mistral-7B, 5 sequential domains, 3 seeds. Per-seed MODULAR and NAIVE ranges are disjoint at every seed. All forgetting numbers are conditional on correct inference-time routing. The 98/100 and 18/18 retention scores are first-author-evaluated with Wilson 95% confidence intervals; a blinded two-rater audit is on our roadmap.

1 US Patent Filed 3 Public Papers 5 Domains Benchmarked Mistral-7B Validated
Near-Zero Drift
−0.17%
MODULAR backbone drift across 5 domains on Mistral-7B (3 seeds ± 0.17)
API & SDKs
3 Lines to First Run
REST API. Upload data, pick a model, start training. That's it.
Inference Ablation
98/100 vs 38/100
Same Gemma-2-9B weights, 100 held-out questions, only CRMA toggled. Wilson 95% intervals disjoint.
Pricing
From $3.99/M tokens
Credits never expire. Free tier: 3 runs/day on TinyLlama.
Benchmarked
5 Domains, 3 Seeds
Medical, legal, finance, code, general. Reproducible results.
Deployment
Hosted API
Serverless GPU inference and training. OpenAI-compatible endpoint when each run finishes.

Coming soon · New frontier

Live Continual Learning

Beyond training-time CL: teach your model new facts live — in a single pass, no retraining. Instantly answerable, and nothing it already knows is forgotten. In private beta soon.

Get early access →
Catastrophic Forgetting
Continual Learning

RAG retrieves. ModelBrew remembers.

RAG systems look up answers from documents every time — slow, fragile, and expensive to maintain. ModelBrew trains the knowledge directly into the model weights. No vector database. No chunking pipeline. Your model just knows it.

Step 1
📚

Upload your data

Medical notes, legal docs, code, anything.

Step 2
🛡

Train with CRMA

CRMA guards the model so it can learn without forgetting.

Step 3

Done. Nothing lost.

Your model knows the new stuff AND still remembers the old stuff.

Without CRMA
Train on medical data
Medical
Then train on legal data...
Medical — gone
Legal
Then train on code...
Medical — gone
Legal — gone
Code
It only remembers the last thing you taught it.
With CRMA
Train on medical data
Medical
Then train on legal data...
Medical — still there
Legal
Then train on code...
Medical — still there
Legal — still there
Code
It remembers everything. Tested at 7B parameters, near-zero drift (−0.17% ± 0.17 across 3 seeds).

Built for teams that can't afford to forget.

Teams training models across multiple domains — without retraining from scratch every time.

Healthcare

Clinical NLP

Train on radiology reports, then clinical notes, then pathology — without forgetting prior specialties. Built by a healthcare practitioner who hit this problem firsthand.

Legal

Multi-Practice Firms

Fine-tune on contract review, then case law, then regulatory filings. Each practice area improves without degrading the others.

Finance

Cross-Asset Intelligence

Equities research, fixed income, credit analysis — one model that learns sequentially across asset classes without catastrophic forgetting.

Enterprise

Multi-Department AI

Designed for stacking domain adapters over time — support tickets, internal docs, product specs, HR policies — without retraining a separate model per department. Currently validated on the 5-domain Mistral-7B benchmark; production deployments are in beta.

ML Teams

Production Pipelines

Plug into existing CI/CD. Upload data per domain, choose standard FT or continual learning, track per-domain metrics and drift over time via API.


Dataset Optimizer

Clean your fine-tuning dataset before training. 60+ validator codes, AI-judge scoring with score-floor-gated rewrite, structural pair audit + judge-based polarity sample, tool-call validation, jailbreak + military-OPSEC + industry-specific PII detection. Free local validators in your browser; AI-judge cleaning is $5 per 200 rows.

🔍

60+ validator codes

Format, schema, length, dedup (exact + near + semantic), encoding, GPT-slop, refusals, repetition, mislabel detection. Every flag points back to a row index.

AI judge + rewrite

Four-axis judge with calibration exemplars; optional 14-dim and G-Eval rubrics. Rewriter preserves every number, URL, named entity, and acronym — verified by a fact-diff before the row ships.

🔁

DPO / ORPO structural audit

Eight structural defect codes — identity pairs, near-duplicate chosen, both-refusals, both-too-short, extreme length bias, sycophantic chosen, refusal-as-chosen, missing prompt. The pair-level checks row-level scanning misses.

🛠

Tool-call validation

OpenAI tool_calls and Anthropic tool_use shape detection. Missing-required-arg and wrong-arg-type are critical; unknown-arg is a warning. Built for shipping agentic fine-tunes.

🛡

Jailbreak · OPSEC · typed PII

Eight jailbreak categories (prompt injection, role bypass, system extraction, encoding attacks). Six military OPSEC codes (MGRS, EDIPI, classification markings, DTG, lat/long, network refs). Nine industry-specific PII detectors (medical: MRN/DEA/ICD-10/NPI, financial: CUSIP/SWIFT/ABA, legal: bar number/Bates) on top of the standard 10-type regex PII pass.

🚀

Proven at 100,000 rows

250 rows / sec on a single worker, peak RSS under 1.5 GB. End-to-end scan of 100k OASST1 and 100k military corpora. Real benchmark, not a marketing number.

Try Dataset Optimizer — Free →

Supports JSONL, CSV, and JSON · Up to 50MB · Local validators free; AI-judge cleaning $5 / 200 rows


Three lines to your first training run.

# Fine-tune with CRMA — near-zero forgetting import requests response = requests.post( "https://fourwheels2512--crma-finetune-fastapi-app.modal.run/start_run", data={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "epochs": "3", "use_crma": "true", }, files={"file": open("my_data.jsonl", "rb")}, headers={"Authorization": "Bearer YOUR_TOKEN"} ) print(response.json()) # {"run_id": "abc123", "status": "running", "model": "TinyLlama-1.1B"} # → Downloads: crma_adapter.zip (PEFT-compatible, plug into Transformers)

Works with any JSONL dataset. Or use the web UI — no code needed. · Full API docs →


Three papers. Real experiments. Ongoing research.

CRMA comes from original research — not a wrapper around existing tools. We publish our methodology, run multi-seed experiments, and update the algorithm based on results. Patent pending (US provisional filed Feb 2026).

Analysis

Six CL Methods Tested — Six Failures

EWC, replay, gradient projection, knowledge distillation, O-LoRA, 10-component stacks. Best result: 58.4% forgetting. We tested them all so you don’t have to.

Preprint · v2-v7 experiments · TinyLlama & Mistral-7B

Read paper →
Results

Near-Zero Forgetting on Mistral-7B Across 5 Domains

Modular LoRA on a spectrally bounded CRMA backbone: −0.17% ± 0.17 MODULAR drift vs +42.96% ± 5.5 NAIVE forgetting across 3 seeds. Per-seed ranges disjoint. Validated on 5 models across 4 architecture families. Patent pending.

Preprint · 3 seeds · 5 domains · Mistral-7B & Gemma-2-9B

Read paper →

Current Research & Development

Validated

Multi-seed experiments across 3 random seeds on Mistral-7B. 5 real-world domains (medical, legal, financial, code, science). Results reproducible across seeds.

In Progress

Enhanced reasoning via self-distillation fine-tuning (SDFT). Scale testing beyond 7B. Head-to-head benchmark against O-LoRA and other academic CL methods.

Roadmap

Real-time continual learning (streaming updates). Agent fine-tuning with tool-use preservation. Automatic domain boundary detection.


Numbers, not promises.

CRMA has been tested across multiple model scales and domains. Here's what the benchmarks show.

-0.17%
Backbone drift with CRMA
3-seed avg across 5 domains (Mistral-7B)
+43%
Forgetting without CRMA
3-seed avg, naive sequential training
98/100
Gemma inference ablation
Same weights, 100 questions, CRMA toggled. 38/100 without.
18/18
Saul-7B legal sub-domains
First-author-evaluated retention across 3 sequential legal sub-domains
Forgetting Rate
Backbone Drift After 5 Domains
0% 15% 30% 45% +43.0% +1.95% -0.17% Naive Frozen CRMA
Spectral Norm Invariant
‖M‖₂ held at 1.0 across 867 steps
0.0 1.0 2.0 D1 D2 D3 D4 D5 1.0 867 logged training steps (5 sequential domains)
Method Forgetting Overhead CL Support
CRMA -0.17% drift None Built-in (beta)
Naive LoRA +43% (7B) / +225% (1.1B) None No
OpenAI No CL N/A No
Mistral / Together No CL N/A No

How we measure: "Forgetting" = change in holdout loss on previously learned domains after training on new ones. Negative = the model got slightly better (ideal). Positive = knowledge was lost. Measured across 5 real-world domains (medical, legal, financial, code, science) on Mistral-7B, averaged over 3 random seeds. See the Pricing page for current rates.

Capability vs the Top-5 fine-tuning platforms

An honest scorecard. Some cells are A; some are not. ModelBrew is behind A-grade incumbents on SFT polish, DPO/SimPO ergonomics, and pricing/billing — and ahead on the two surfaces no other hosted platform ships: a built-in pre-train data gate and continual learning.

Capability Together.ai OpenAI FT HF AutoTrain Predibase Modal+Axolotl ModelBrew
SFT correctness AABAA C+
DPO / SimPO ACBA B−
Real eval (win-rate, MMLU) AABA−DIY A
Inference deploy AACADIY A
Pre-train data gate (Cleaner) None None None Limited None A−
Pricing / billing safety BABBA C+
API + SDK AA+CADIY A
Continual learning (regulated / MLOps) None None None None None C− (beta)

How we grade: Grades reflect ModelBrew's own internal scorecard against five hosted-fine-tuning competitors as of May 2026. ModelBrew's own grades come from its post-Wave-3.6 internal blueprint audit; competitor grades reflect publicly documented capability surfaces. SFT, DPO/SimPO, and pricing/billing are areas where ModelBrew is honestly behind A-grade incumbents. The pre-train data gate (Cleaner) and continual learning are the two unique-moat rows. Continual-learning support is in beta.

Per-Domain Drift After 5 Sequential Domains

Each domain was trained sequentially. Drift measures how much earlier domains degraded after all 5 were trained. Negative = slight improvement (positive transfer).

Domain CRMA Frozen Naive LoRA
Medical −0.56% +2.22% +149.6%
Legal −0.55% +1.83% +34.3%
Financial +0.59% +1.74% +17.8%
Code −0.51% +2.78% +13.0%
Science +0.20% +1.17% +0.08%
3-seed Avg −0.17% +1.95% +42.96%

Key insight: CRMA drift is on the order of an order of magnitude lower than FROZEN (∼1.95%), and two orders of magnitude lower than naive sequential LoRA (∼43%). The 3-seed average of the per-domain values reconciles to −0.17% at the bottom row. 3-seed average across seeds 0, 42, 1234; Mistral-7B.

View full benchmark data & methodology

CRMA Internal (Mistral-7B, 5 domains, 3-seed avg): CRMA Modular −0.17% ± 0.17 drift, Frozen +1.95% ± 0.64, Naive +42.96% ± 5.5. Per-seed MODULAR and NAIVE ranges are disjoint. No replay, no EWC, no knowledge distillation.

Gemma-2-9B inference ablation: 98/100 with CRMA (Wilson 95% CI [93.0%, 99.5%]) vs 38/100 without (Wilson 95% CI [29.0%, 47.8%]). Same weights, same questions, only CRMA toggled.

Pricing (April 2026): ModelBrew FT $3.99/M, all 7–9B models, with gradient visibility + built-in Dataset Optimizer. CL is in beta and not available for self-serve purchase at this time. OpenAI GPT-4.1 $3.00/M (no CL, FT only on their models). Together/Fireworks/OpenPipe $0.48-0.50/M (FT only, no cleaner, no CL). Mistral La Plateforme $1.00/M.

Head-to-head baselines: We have not run head-to-head comparisons against published CL methods (O-LoRA, InfLoRA, Lewandowski et al.) on our protocol. This is the single largest gap in our research; it is acknowledged openly in the paper. Our internal controls compare NAIVE vs FROZEN vs MODULAR on identical data.

CRMA results are from internal benchmarks using holdout evaluation. All forgetting-prevention numbers are conditional on correct inference-time routing.


Pay only for what you use.

No subscriptions. Sign up and get 75 credits free ($7.50). Load $20 in credits when you're ready, pay only for tokens used. 3 free training runs per day on TinyLlama.

Free
$0
75 credits at signup · no card
  • 75 credits free at signup ($7.50)
  • 3 runs per day on TinyLlama-1.1B
  • Fine-tuning mode
  • Download adapter ZIPs
  • Real-time training progress
Get Started

All 7–9B models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)

Fine-Tuning $3.99 / M tokens
Continual Learning Beta — contact us
Clean with AI (Dataset Optimizer) 50 credits per 200 rows

Credits & Balance

Minimum credit purchase $20
Credits roll over Never expire
Failed jobs Auto-refunded

Example: Fine-tune Mistral-7B on 500 medical Q&A pairs

Estimated tokens ~135K tokens
Rate (Fine-Tuning) $3.99 / M tokens
Computed cost $0.54
Deducted from balance $0.54

Example: Continual learning on Mistral-7B — 5 domains

Continual learning is currently in beta and not available for self-serve purchase. Request access if you'd like to evaluate it on your data.

Refund Policy: If a training job fails due to a system error, your credits are automatically refunded — no action needed. Unused credits are non-refundable and non-transferable. All payments are processed securely by Stripe — we never see your card details. By purchasing credits, you agree to our Terms of Service.


Production-grade security primitives.

Modern auth, RBAC, audit logging, strict CSP, HTTPS-only, atomic billing — the hardening you'd expect from a 2026 SaaS, exposed and tested.

Encryption at Rest

All model checkpoints and training data encrypted with AES-256 (Fernet). Secure delete enabled — no residual data on disk.

Security Headers

HSTS, X-Frame-Options DENY, Content-Type nosniff, XSS protection, strict Referrer-Policy, and Permissions-Policy on every response.

Audit Logging

Every API call logged with user, action, IP, and timestamp. Full audit trail for compliance reviews and incident response.

Role-Based Access

RBAC with granular permissions. Admin, user, and read-only roles. API keys separated from session tokens.

GDPR & Data Rights

One-click data export and account deletion. Your data, your control. Full compliance with data protection regulations.

Hardened Runtime

Non-root containers, health checks, safe model loading (no arbitrary code execution), and sanitized error responses.

Cleaner judge prompt-injection harden W5 S1.1 · 89e3cc5

NFKC normalization, zero-width strip, role-flip detection, per-call random nonce fence. Red-team fixture library runs in CI on every change.

Chat-template token strip W5 S1.2 · 265aed0, 2528708

Defends against tokenizer-token poisoning across Qwen3 <think>, Llama-3 <|eom_id|>, Phi-4 <|im_sep|>, Mistral, Gemma. ChatInject paper attack surface (arXiv 2509.22830).

Atomic per-user daily AI cost cap W5 S1.3 · 75f579a, 2a99c19

Single SQL transaction enforces a soft daily ceiling across all 5 user-triggerable LLM-spending routes — judge, rewrite, polarity, preference-pair, scoring — so a runaway script can’t rack up a five-figure bill.

IDOR existence-oracle defense W5 S1.4 · 2f63d07, 5b3ba58

403 → 404 across /status, /start_cl_task, and 5 sibling endpoints, with response-time symmetry so a probe can’t distinguish “exists but not yours” from “doesn’t exist.”

Modal upload MIME / magic-byte guard W5 S1.7 · 6d8dcf4, 06ff906

Rejects ZIP, PE, HTML, PDF, RAR, 7z magic bytes posing as CSV or JSONL. json.loads probe on the first JSONL line before the row stream begins.

Billing correlation_id end-to-end W5 S2.2 · cf6780c

Stripe webhook → add_credits_auto_refund traceable through one correlation ID. Refund or duplicate-webhook reconstruction is deterministic from the audit log.

Public /security trust page W5 S2.1 · 6b92b25

Retention numbers cited to the code that enforces them; vulnerability disclosure with safe-harbor language; reviewable in version control. Read the page →

security@ reporting channel W5 S2.3 · DNS 2026-05-07

Vulnerability reports to modelbrewai@gmail.com. 5-business-day acknowledgement, 10-business-day initial assessment, safe harbor for good-faith research.

18 silent-corruption RT tests W6 K2 · c694e04

Adversarial matrix covering tokenizer poisoning, lookalike-character spoofing, fix-order bypass, role spoofing, and mojibake against the cleaner pipeline. CI on every change.

Foundation-invariant property tests W6 K1 · 42ecffd

Determinism + monotonicity + revert-on-degrade enforced by property tests. The cleaner can’t produce a row whose post-clean score is lower than its pre-clean score — by construction.

Full security posture, retention schedule, and dated changelog at /security. Deployment options (managed cloud, Hybrid Export, enterprise roadmap) at /deployment.


Built by a practitioner, not a lab.

Near-zero catastrophic forgetting on our benchmarks — conditional on correct inference-time routing, validated on Mistral-7B (full 5-domain × 3-seed protocol) with an inference-time ablation on Gemma-2-9B. ModelBrew AI makes fine-tuning practical, accessible, and pilot-ready (SFT path); preference-tuning surface (SimPO/DPO) is in beta; continual learning is in beta.

Company

ModelBrew AI

Based in Frederick, Maryland. We build mathematically constrained fine-tuning technology that lets AI teams train on new data without losing what their models already know. Our platform runs on serverless GPUs — no infrastructure to manage, no MLOps team required.

Founder

Kiran Nayudu

Healthcare practitioner who built CRMA after watching fine-tuned models forget critical knowledge with every training run. Background in regulated industries and hands-on ML engineering. Built CRMA from first experiment to deployed API.

Learn about fine-tuning and continual learning.

Technical articles about stable fine-tuning and why it matters for production AI.

Featured

Why RAG Falls Short — And What Happens When You Bake Knowledge Into the Model

Everyone is building RAG pipelines. We took a different path: train knowledge directly into the model weights, across sequential domains, with near-zero forgetting.

Read more →
Comparison

DPO vs SimPO in 2026: Which Preference-Tuning Method Should You Use?

Side-by-side comparison of Direct Preference Optimization and SimPO — when each works, the trade-offs, and how ModelBrew picks the right one for your dataset.

Read more →
Guide

What Is Fine-Tuning? Why It Matters and How It's Changing AI

Fine-tuning explained for a broader audience — real-world use cases in healthcare, legal, code, and finance.

Read more →
Technical

What Are LoRA and QLoRA? A Practical Guide to Efficient Fine-Tuning

How LoRA and QLoRA made fine-tuning possible on consumer GPUs — and the stability problems they don’t solve.

Read more →
Product

How CRMA Solves Continual Learning

Stable backbone, swappable domain adapters, near-zero forgetting. No replay buffers, no growing memory.

Read more →
Analysis

Catastrophic Forgetting: The Silent Killer of Fine-Tuned Models

Why every fine-tuning run destroys prior knowledge, and what the research says about fixing it.

Read more →
Comparison

CRMA vs LoRA: What's the Difference?

Side-by-side comparison of standard LoRA and CRMA — when you need each, and what happens when you don’t use CL.

Read more →
Business

The Cost of Forgetting: Why Retraining From Scratch Is Unsustainable

The real-world compute, time, and quality costs of not having continual learning in your ML pipeline.

Read more →

Get in touch.

Questions about CRMA, enterprise pricing, or fine-tuning? Reach out.

Reach us directly

💬 Reddit

ModelBrew AI
Frederick, Maryland


Roadmap

Live

Fine-Tuning

Future

Real-Time CL


Ditch the vector database. Teach your model directly.

Start with 3 free runs on TinyLlama. No credit card, no setup, no retrieval pipeline to manage.


Full feature catalog

Every shipped capability, in one page.

Below is the long-form, machine-readable index of everything ModelBrew currently ships — fine-tuning, continual learning, preference tuning, the dataset cleaner, security primitives, the API + SDKs, pricing terms, engineering quality, and deployment patterns. Every item is anchored (#feature-*) and most cite the line of code or test that proves it. 132 features. 109 anchors. 12 FAQ entries. All cited.

Fine-tuning capabilities.

Six production-supported open-weight models across four architecture families, with LoRA + QLoRA adapters on Modal A100 GPUs. Every job ships with a sanitized HuggingFace-Hub README, a downloadable adapter ZIP, and a token-and-cost preview before the user clicks Train. The catastrophic-forgetting prevention (CRMA) is the differentiator; the rest of this section is the disciplined SFT path that feeds it.

Free tier

TinyLlama-1.1B-Chat-v1.0

3 runs/day, no credit card. The recommended way to test the pipeline end-to-end.

Pro · 7B

Mistral-7B-v0.3

The published 5-domain chain benchmark base. 26/31 zero-forget across 3 seeds.

Pro · legal

Saul-7B-Instruct-v1

Legal-domain instruction-tuned variant. Saul-7B 18/18 zero-forgetting milestone.

Pro · 8B

Qwen3-8B

Multilingual frontier-class small model. <think>-token strip handled per-family.

Pro · 9B

Gemma-2-9B-it

Google's 9B IT model with auto-attached license notice in the adapter ZIP.

Pro · 8B

Llama-3.1-8B-Instruct

Meta's 8B instruct tuned on a 128k context. Production-grade chat-template handling.

LoRA SFT with completion-only loss Shipped #feature-lora-sft

Standard supervised fine-tuning runs through a TRL SFTTrainer subclass that masks the prompt and only contributes loss on the assistant span. Completion-only loss is on by default and tested.

cite: utils/train.py:1060 _CRMASFTTrainer utils/train.py:1964 fine_tune_with_crma · tests: tests/test_sft_completion_only_loss.py

QLoRA support for memory-efficient training Shipped #feature-qlora

4-bit quantized base with low-rank adapters. Lets us fit 9B models comfortably on A100 with batch headroom for curriculum sort and replay batches.

cite: utils/train.py:1964 fine_tune_with_crma · flag: load_in_4bit in run config

Per-family chat-template autoresolve Shipped #feature-chat-template

Mistral, Saul, Qwen3, Gemma, Llama-3.1, TinyLlama, and Phi-4 chat templates are auto-detected and applied at training time. Production tokens (<think>, <|im_start|>, etc.) are stripped from uploaded user data by default to prevent template-leak attacks.

cite: utils/train.py:1842 _chat_template_for_model utils/train.py:1950 _apply_chat_template · tests: tests/test_chat_template.py, tests/test_template_v_parse.py

Pre-augmented + server-side row augmentation (up to 9x) Shipped #feature-augment

Optional augment_data flag expands the effective dataset. Validated combination of pre-build augmentation plus server-side replication, surfaced as a per-run knob.

cite: utils/augment.py · tests: tests/test_augment.py, tests/test_estimate_method_and_augment.py

Curriculum sort by NLL Shipped #feature-curriculum

Hard-example-first ordering inside an epoch. Optional and gated by config; off-by-default keeps SFT behaviour predictable for users who want a vanilla baseline.

cite: utils/train.py:47 _sort_by_curriculum utils/train.py:181 _score_texts_by_nll

Estimate-before-charge (token + cost preview) Shipped #feature-estimate

POST /estimate returns the token count and dollar cost a run will consume before the user clicks Train. The same accountant powers the spend cap and the auto-refund logic.

cite: backend/server.py:8207 POST /estimate backend/pricing.py

Modal A100 GPU on every training function Shipped #feature-a100

Production hardware on every training call. Every @app.function in modal_deploy.py uses gpu="A100" — this is not a research-vs-prod toggle, it's the only path.

cite: modal_deploy.py:420, 577, 751, 855 (every @app.function)

Per-run download endpoint (LoRA + CRMA + CL state + utils/inject.py) Shipped #feature-download

GET /download/{run_id} ships a ZIP containing your LoRA adapter, CRMA weights, cl_state.pt for chains, utils/inject.py, status.json, and a NOTICE file. Customers can leave the platform with their weights — this is the single most important on-prem-adjacent capability we ship today.

cite: backend/server.py:4371 GET /download/{run_id} /deployment · recipe 08

HuggingFace-Hub-shaped README export with sanitized hparams Shipped #feature-readme-export

GET /runs/{run_id}/readme.md returns a YAML-frontmatter, license-inheriting model card you can paste straight into HF Hub. The exporter has both an HTML-escape XSS guard for the user-supplied display name and a CRMA-IP guard that hard-fails the route if any patent-pending field name leaks.

cite: backend/server.py:4287 GET /runs/{run_id}/readme.md backend/readme_export.py:27 _escape_html backend/readme_export.py:63 _assert_safe_overlap

Run resume + chain checkpoint selection Shipped #feature-checkpoint-select

POST /runs/{run_id}/select_checkpoint lets you pick which checkpoint feeds the next chain link or the next download. Combined with the chain visualizer you can branch off any midpoint of a continual-learning chain.

cite: backend/server.py:2505 POST /runs/{run_id}/select_checkpoint

Auto column mapping (ShareGPT, OpenAI Chat, Alpaca, raw {prompt, response}) Shipped #feature-column-mapping

The dataset normalizer accepts any of four common formats and remaps to the canonical chat schema before validation. End-to-end mapping tests run on every release.

cite: backend/cleaner/normalizer.py · tests/test_column_mapping_e2e.py

Min-dataset-rows + finetune validation gate Shipped #feature-min-rows

Runs that won't converge are rejected at submit-time, not after the GPU bill arrives. Tested with adversarial single-row and empty-dataset payloads.

cite: backend/cleaner/service.py · tests/test_min_dataset_rows.py · tests/test_finetune_validation.py

Display-name sanitize Shipped #feature-display-name-sanitize

Control bytes, HTML, and zero-width characters are stripped from user-supplied run display names before storage. Defends downstream surface like the README exporter and the dashboard.

cite: tests/test_display_name_sanitize.py

Continual-learning capabilities (CRMA & chain).

The differentiator. CRMAConstrained Residual Mixing Adapter — is a patent-pending architectural primitive whose internal mixing matrix has spectral norm bounded by 1 (by Birkhoff's theorem). Per-task LoRA adapters compose against a shared CRMA substrate, with sequential chain construction at inference. The published number: 5-domain Mistral-7B 26/31 zero-forget across 3 seeds, with prior-task drift within measurement noise.

The published continual-learning number

On a 5-domain Mistral-7B benchmark with 3 seeds, modular LoRA on a CRMA backbone showed prior-task drift of −0.17% ± 0.17 loss-relative (within measurement noise), versus +42.96% ± 5.5 for naive sequential training. Per-seed ranges are disjoint. All forgetting numbers are conditional on correct inference-time routing — receipt at /claims.

CRMA — Constrained Residual Mixing Adapter Patent pending #feature-crma

The architectural CL primitive. US provisional patent filed 2026-02-28. Public details on the mathematical bound are in the arXiv-ready paper; the implementation lives in utils/crma.py and the inference-time injection in utils/inject.py.

cite: paper/crma_modular_cl_arxiv.tex · utils/crma.py (private) · utils/inject.py (shipped in adapter ZIP)

Sequential CL chain at inference Shipped #feature-cl-chain

POST /start_cl_task appends a new task to an existing chain root. GET /chain/{run_id} returns the parent-child task graph. Replay-version mismatch trip-wires refuse to chain if the replay dataset format drifts.

cite: backend/server.py:6665 POST /start_cl_task backend/server.py:8123 GET /chain/{run_id}

Chain visualization tree (frontend) Shipped #feature-chain-tree

The dashboard renders the parent-child task graph as a real tree, not a flat list. You can see your chain at a glance and pick any node as the base for the next task. No competitor ships this.

cite: frontend/src/components/CLChainTree.tsx · frontend/src/app/(app)/train/page.tsx

Spectral / null-space gradient projection (LoRA + CRMA) Shipped #feature-spectral

Direction protection across tasks. Gradient updates are projected away from the SVD bases of prior tasks for both LoRA and CRMA blocks. Combined with EWC magnitude protection it's a belt-and-braces defense against drift.

cite: utils/train.py:521 LoRAGradientProjection utils/train.py:607 CRMAGradientProjection utils/train.py:284 _compute_svd_bases

EWC Fisher penalty (magnitude protection) Shipped #feature-ewc

Elastic Weight Consolidation Fisher penalty as a complement to projection. Holds the weights that mattered for prior tasks closer to their post-train values, weighted by their Fisher importance.

cite: utils/train.py:76 _compute_ewc_fisher

AGEM gradient correction Shipped #feature-agem

Averaged-Gradient Episodic Memory: the current-task gradient is conflict-resolved against a replay reference gradient. If they conflict, we take the projection that doesn't increase the replay loss.

cite: utils/train.py:705 AGEMCallback

Reservoir replay sampling (cumulative, NLL-prioritized) Shipped #feature-replay

1% NLL-prioritized replay drawn from a reservoir of all prior-task data. Pairs with KD logit caching so the replay loss preserves prior-task behaviour, not just prior-task labels.

cite: utils/train.py:843 save_continual_state utils/train.py:254 _compute_kd_logits

LoRA EMA callback Shipped #feature-lora-ema

Exponential-moving-average of the LoRA weights. Stabilizes adapter trajectories on hard-domain corpora where the raw step-by-step weights wobble.

cite: utils/train.py:1387 LoRAEMACallback

Multi-holdout per-task eval (backward-transfer measurement) Shipped #feature-bwt

For every task in a chain we hold out a per-task eval and re-score it after every later task. That measurement is what surfaces in the dashboard as drift, and it's what powers the published 26/31 number.

cite: utils/train.py:2062 eval_dataset_paths utils/train.py:223 _evaluate_model

Replay-version mismatch trip-wire Shipped #feature-replay-version

If you try to chain a new task whose replay format doesn't match the chain root, the trip-wire refuses to start the run. Avoids silently corrupting a chain by mixing dataset schemas.

cite: utils/train.py:1811 CLReplayVersionMismatchError · tests/test_cl_replay_version_gate.py

CL state persistence (cl_state.pt) Shipped #feature-cl-state

The cumulative chain state — SVD bases, Fisher, replay reservoir, KD cache — is checkpointed to cl_state.pt after every task and shipped in the download ZIP. You can resume continual learning across runs without re-walking the prior tasks.

cite: utils/train.py:843 save_continual_state backend/server.py:4432 cl_state.pt in download

Multi-architecture verification (4 model families) Shipped #feature-multi-arch

The CRMA chain is verified across Llama-derived (TinyLlama, Llama-3.1), Mistral-derived (Mistral, Saul), Qwen3, and Gemma. Per-family chat-template and judge-token-strip rules are codified, not improvised.

cite: backend/server.py:1778 ALLOWED_MODELS /claims

Saul-7B 18/18 zero-forget milestone Shipped #feature-saul-18-18

Legal-domain CL milestone on Saul-7B-Instruct-v1. 18 of 18 prior-task evals held under the noise floor after sequential training across the canonical legal-domain chain.

cite: memory/milestone_saul7b_18_18.md (note: see /claims for current production-A100 truth)

Preference tuning (DPO / SimPO / length-norm DPO).

Three preference-learning paths: DPO, SimPO (reference-free), and length-normalized DPO. The cleaner ships a polarity / preference-pair audit that catches label-swapped pairs before they reach the trainer — the most common silent failure mode in DPO data.

DPO preference tuning Shipped #feature-dpo

Direct Preference Optimization: RLHF-free preference tuning over (chosen, rejected) pairs. Wired all the way through, from the method='dpo' branch in /start_run to the trainer.

cite: utils/train.py:3877 fine_tune_with_dpo modal_deploy.py:855 run_dpo_training_job backend/server.py:3484 method=='dpo'

SimPO (reference-free preference tuning) Shipped #feature-simpo

Reference-free preference tuning with simpo_beta and simpo_gamma knobs. No reference model required — useful when you want to preference-tune past a custom-trained base.

cite: utils/train.py:3251 fine_tune_with_simpo modal_deploy.py:751 run_simpo_training_job backend/server.py:3399 method=='simpo'

Length-normalized DPO opt-in Shipped #feature-dpo-length-norm

Counters length-bias gaming in the DPO objective. Plain DPO is vulnerable to a degenerate solution where the model just produces longer "chosen" outputs; the length-norm variant divides by sequence length, so the gradient pressure is on quality, not quantity. Recently shipped as Wave 4 Q4.

cite: commit 803df1e · tests/test_dpo_length_normalize.py

Polarity / preference-pair audit (cleaner) Shipped #feature-polarity-audit

Swap-invariant Gemini judge that catches label-swapped pairs in DPO datasets — the single most common silent failure mode in preference data. Ensemble option (2-judge, 4 calls/pair) is available as a higher-precision tier.

cite: backend/cleaner/polarity_eval.py · tests/test_polarity_prompt_injection.py · feature_pricing key polarity_pair_audit_ensemble

SimPO margin check (per-pair GPU forward pass) Shipped #feature-simpo-margin

Pre-train pair quality check: each (chosen, rejected) pair gets a forward pass on the base model and we score the SimPO margin. Pairs whose margin is already negative (or noise-level) are flagged for removal before they hit the trainer.

cite: backend/cleaner/simpo_margin.py · feature_pricing key simpo_margin_check

Dataset cleaner & data prep.

The cleaner is a separately-shipped product, not a side feature. 56 distinct issue codes across structural, chat-schema, quality, PII, domain-PII, military OPSEC, jailbreak, and duplicate detectors. ~28 autofix codes with granular per-category control. Score-floor revert contract: post-clean score is never lower than pre-clean — that's a property-tested invariant, not marketing copy.

Detector coverage at a glance

9 structural · 7 chat-schema · 17 quality (including jailbreak / prompt-injection) · 13 typed PII · 9 domain-PII with checksums (DEA, NPI, CUSIP, SWIFT/BIC, ABA, MRN, ICD-10, bar number, Bates) · 6 military OPSEC · 6 duplicate detection codes · ~28 autofix codes.

Structural validation (9 codes) Shipped #feature-structural

JSON / JSONL / array detection, ShareGPT, dual-format, empty rows, wrong type, unrecognized schema. Format-detect plus recovery on the way in — if we can salvage your file we will, but we never silently re-encode without telling you.

cite: backend/cleaner/validators.py:688 _validate_structural

Chat-schema validation (7 codes) Shipped #feature-chat-schema

Roles, role order, system-first, missing assistant, extra assistant turns. Catches the most common multi-turn-corpus errors before training.

cite: validators.py:817 _validate_chat_schema

Quality validators (17 codes) Shipped #feature-quality

Too short, too long, excessive whitespace, markdown noise, giant context, html noise, placeholder, GPT slop, unfinished, repetitive, refusal, soft-refusal, non-English, bad encoding, invisible unicode, prompt injection, jailbreak pattern. The single biggest source of dataset rot.

cite: validators.py:897 _validate_quality

Typed PII detectors (13 codes) Shipped #feature-typed-pii

Email, phone, SSN, API key, password, address, credit card, IBAN, DOB, IP, name heuristic, low-quality prompt, lazy output. Severities split per type so you can choose redact-vs-flag-vs-drop on a case-by-case basis.

cite: validators.py:1844 _has_strong_pii

Domain-specific PII with checksums (9 detectors) Shipped #feature-domain-pii

MRN, DEA (with checksum), ICD-10, NPI, CUSIP (with checksum), SWIFT/BIC, ABA routing (with checksum), bar number, Bates. The ones with mathematical checksums are validated — we don't flag a 9-digit number as an NPI unless it actually validates as an NPI.

cite: backend/cleaner/detectors_typed.py

Military OPSEC redactors (6 categories) Shipped #feature-opsec

Classification markings (CUI / FOUO / SECRET / TOP SECRET), MGRS grid coordinates, EDIPI, DTG (Date-Time Group), lat/long, network refs. Defense-customer feature; regulated-domain hard-block keeps auto-strip-refusals from running on these datasets unless explicitly opted in.

cite: backend/cleaner/validators.py (opsec_* codes) · backend/cleaner/autofix.py:111 REGULATED_DATA_DOMAINS

Jailbreak pattern detection Shipped #feature-jailbreak

DAN-style prompts, role-flip attempts, system-prompt leaks. Different signature from prompt-injection — this catches the user-facing rows you don't want your trained model emulating.

cite: backend/cleaner/jailbreak_detectors.py:115 detect_jailbreak_patterns

Duplicate detection (6 codes: exact, MinHash, RapidFuzz pairwise) Shipped #feature-dedup

Exact dupes, near-dupes via MinHash, near-dupes via RapidFuzz pairwise scan, duplicate outputs, system-prompt inconsistency, duplicate boilerplate. Three different algorithms because no single one catches everything.

cite: backend/cleaner/validators.py:1341+

~28 autofix codes (granular per-category) Shipped #feature-autofix

Encoding, whitespace, HTML, chat-template tokens, GPT slop prefix/suffix, repetition collapse, role normalize, empty-message trim, system-prompt standardize, markdown strip, slop closers, refusals (opt-in), exact dedup, near dedup, output dedup, empty rows, critical rows, short rows, lazy output, placeholder rows, unfinished, PII redact, OPSEC redact (6 categories). Toggle each independently.

cite: backend/cleaner/autofix.py:86 apply_autofixes

Score-floor revert (post-clean ≥ pre-clean, mathematical invariant) Shipped #feature-score-floor

The single most-load-bearing contract in the cleaner. If a cleaning pass produces a lower score than the input, we revert. Property-tested. This is non-negotiable — users have explicitly asked us to never lower their score, and the test suite enforces it.

cite: backend/cleaner/service.py:427 _compute_basic_floor · tests/test_cleaner_rewrite_score_floor.py · tests/test_score_floor_invariant.py

Per-row revert-on-degrade (LLM rewrite) Shipped #feature-revert-on-degrade

The LLM-rewrite path operates per row, and if a rewrite scores lower than the original, we keep the original. Combined with the fact-preservation diff this gives you LLM polish without LLM lobotomy.

cite: backend/cleaner/llm_rewrite.py:281 rewrite_rows

LLM judge (Gemini 2.5 Flash / Claude Haiku 4.5 auto-pick, 4-axis rubric) Shipped #feature-judge

4-axis quality rubric (faithfulness, helpfulness, format, safety) with 16-way concurrency. Auto-picks between Gemini 2.5 Flash and Claude Haiku 4.5 based on availability and pricing. Hardcoded-secret post-judge override flags rows whose content matches OpenAI sk-, GitHub PAT, Slack token, AWS AKIA, JWT, military markings, or mongo+srv URLs even if the judge missed them.

cite: backend/cleaner/llm_judge.py:75 llm_judge.py:48 _SECRET_PATTERNS

LLM rewrite with fact-preservation diff Shipped #feature-rewrite-fact-diff

If a rewrite moves a number, a PII token, or a structured fact, we reject the rewrite and keep the original row. The diff runs after the rewrite and before commit.

cite: backend/cleaner/llm_rewrite.py:564 _fact_diff backend/cleaner/llm_rewrite.py:669 _facts_preserved

Prompt-safety nonce-fence + role-flip splice (judge / rewrite / polarity) Shipped #feature-prompt-safety

NFKC normalization, zero-width strip, role-flip detection, per-call random nonce fence on all three LLM-using paths in the cleaner: the judge prompt, the rewrite prompt, and the polarity-pair prompt. The same defense applies everywhere a customer string flows into a model prompt.

cite: backend/cleaner/prompt_safety.py:117-260 · tests/test_judge_prompt_injection.py

Tool-call validator (JSON-schema for function-calling rows) Shipped #feature-tool-call

Validates function-calling rows against a JSON Schema. Free tier — turning your data into something an instruction-tuned model will follow shouldn't cost extra.

cite: backend/cleaner/tool_call_validator.py · tests/test_cleaner_tool_call_validator.py

Synthetic gap-filler (cluster-aware row generation) Shipped #feature-synthetic

Cluster the embedding space of your dataset, find under-represented regions, generate synthetic rows scoped to fill them. Surfaced via /api/generate-synthetic.

cite: backend/cleaner/synthetic_gen.py · backend/server.py:6492 /api/generate-synthetic

Cluster galaxy & diversity subset Shipped #feature-cluster-galaxy

Embedding-based dataset visualization. Pick a subset that maximizes coverage of the embedding manifold — useful when you have a 50k-row corpus but a 5k-row training budget.

cite: backend/cleaner/clustering.py · diversity.py · ClusterGalaxy.tsx · DiversitySubsetPanel.tsx

Concept trainer / scorer Shipped #feature-concept

Train a concept classifier on the fly — "is this row about negotiation tactics?" — then score the full corpus against it. Useful for slice-targeted curriculum.

cite: backend/cleaner/concepts.py · concepts_prebuilt.py · /scan/{id}/concept/score

Per-row novelty scoring Shipped #feature-novelty

Each row gets a novelty-vs-corpus score. Easy way to spot the rows that are doing the most actual teaching versus the rows that are filler.

cite: backend/cleaner/novelty.py · frontend NoveltyPanel.tsx

Slice analysis Shipped #feature-slice

Subset evaluation along arbitrary slices — per-language, per-tool, per-cluster, per-concept. Lets you spot the slice that's tanking your overall score.

cite: backend/cleaner/slicing.py · frontend SliceAnalysisPanel.tsx

Weak supervision (Snorkel-style labeling functions, free tier) Shipped #feature-weak-supervision

Compose programmatic labeling functions (regex, heuristic, vote) and run them across the dataset. Free tier — we'd rather make it cheap to label your own data than charge you for it.

cite: backend/cleaner/weak_supervision.py · tests/test_cleaner_weak_supervision.py

Label-error v2 (judge-disagreement, Cleanlab-flagship analog) Shipped #feature-label-error-v2

Run multiple judges on each row and flag the ones with high judge disagreement — those are the rows most likely to be mislabeled. The ML-research analog of Cleanlab's flagship feature.

cite: backend/cleaner/label_error_v2.py · tests/test_cleaner_label_error_v2.py

Cleaner presets (3: modelbrew_ft, openai_chat, support_bot) Shipped #feature-presets

One-click default fix bundles — sane starting points for the three most common dataset shapes we see. Each preset is a named configuration of detectors + autofixes.

cite: backend/cleaner/presets.py

Mojibake / encoding fixer Shipped #feature-mojibake

Handles the messy stuff — double-utf8, smart-quote regress, the "café" → "café" failure mode. Tested with a no-mojibake invariant test.

cite: validators.py:1905 _detect_bad_encoding · tests/test_cleaner_no_mojibake.py

Cleaner dry-run cost estimate Shipped #feature-cleaner-dry-run

POST /scan/{id}/dry-run previews per-feature credit cost before you commit. Same principle as the training estimate — you should never get billed for something whose price you didn't see first.

cite: backend/routes/cleaner.py:854 POST /scan/{id}/dry-run · /estimate at :816

Idempotency-key per user (cleaner upload + scan) Shipped #feature-idempotency

Replay the same upload or scan request and you'll get the same result, not a double-charge. Tested under concurrent migration races.

cite: tests/test_idempotency_key_per_user.py · tests/test_cleaner_idempotency.py · tests/test_idempotency_migration_concurrent_safety.py

Send-to-train (one-click handoff scan → run) Shipped #feature-send-to-train

POST /send-to-train/{scan_id} hands the cleaned dataset off to a training run without re-uploading. Closes the loop between cleaner and trainer.

cite: backend/routes/cleaner.py:1610 POST /send-to-train/{scan_id}

Eval pipeline (200-prompt judge with win-rate + CI) Shipped #feature-eval-pipeline

Auto-eval at end of training (free, platform-eaten). Re-eval at $0.06 per run, idempotent. Win-rate vs base, length-controlled win-rate, BERTScore, ROUGE-L, exact-match, fact-recall — all surfaced as component scores. Eval drift detection compares two eval rows for distribution shift.

cite: backend/server.py:7338 GET/POST /runs/{run_id}/eval utils/eval_lc.py:17 length_controlled_winrate utils/eval_drift.py:36 detect_eval_drift

Eval set hash + judge version stamping Shipped #feature-eval-set-hash

Every eval run records the eval set's content hash and the judge model version. If we change either, you can tell from the receipt — reproducibility, not just record-keeping.

cite: utils/eval_set.py · backend/server.py _serialize_eval_run

Security & trust primitives.

The trust posture is published in three pages with operational detail, not badges: /security with TLS / AES-256 / no-train pledge / retention numbers tied to backend code, /claims with every public number cited to source, and /status with a live Modal /health fetch. Below are the primitives themselves.

The no-train pledge

We never train on your data. TLS 1.2+ in transit, AES-256 at rest, sub-hour dataset retention. The retention numbers on the trust page are tied to the actual constants in backend/db.py, not aspirational copy. GDPR data export and self-delete are one-click. Vulnerability disclosure: 5 business days acknowledgement, safe-harbor protection for good-faith research.

TLS 1.2+ in transit, AES-256 at rest Shipped #feature-tls-aes

Modal-hosted endpoints terminate TLS. Storage encrypts at rest with AES-256. Public detail at /security sections 1 + 4.

cite: /security#tls /security#encryption

No-train pledge with retention tied to code Shipped #feature-no-train-pledge

Datasets sub-hour retention; checkpoints bounded to your run lifecycle. The numbers on the /security page match the constants in backend/db.py — not "soon" and not "we plan to."

cite: /security#no-train /security#retention

IDOR existence-oracle defense (403 → 404 + timing symmetry) Shipped #feature-idor

Insecure-Direct-Object-Reference defense: we don't tell an unauthenticated probe whether a run-id exists. 403 responses are rewritten to 404 and timing is matched so an attacker can't tell whose resource they're missing. Multi-test coverage including red-team panel.

cite: tests/test_idor.py · tests/test_start_run_idor.py · backend/tests/cleaner/test_p0_rt_idor.py · test_s3_2_file_id_idor.py

Atomic daily AI cost cap (cleaner) + training spend cap Shipped #feature-cost-cap

The cleaner's daily LLM-judge / rewrite cost cap is enforced in a single SQL transaction shared across all five cleaner routes — you can't race it with a parallel call. The training spend cap has the same property and is concurrency-tested. Free-tier 3-runs-per-day cap is also atomic.

cite: backend/routes/cleaner.py:622 _cleaner_ai_daily_cap_cents · tests/test_training_spend_cap.py · tests/test_training_spend_cap_atomic_under_concurrency.py · tests/test_free_tier_atomic_cap.py

Modal upload validator (MIME + magic byte) Partial — first-line only #feature-upload-validator

MIME-type and magic-byte guards on dataset uploads. Honest caveat: the validator currently inspects only the first JSONL line, which is on the Wave 5 P2 backlog. Marketing-grade ✓ but we list it partial here on purpose.

cite: tests/test_modal_upload_dataset_validate.py · backend/cleaner/validators.py _validate_file_too_large

GDPR data export + self-delete Shipped #feature-gdpr

GET /account/data-export returns all your user data as JSON. DELETE /account wipes the account with an audit log. CSV-injection sanitization on the export prevents spreadsheet-paste attacks.

cite: backend/server.py:8880 GET /account/data-export backend/server.py:8846 DELETE /account

Public /security + /claims + /status + /compare + /deployment Shipped #feature-public-trust

Five public trust pages, all sharing the same pearl-white catalog format. /security for posture, /claims for receipts, /status for live uptime, /compare for vendor positioning, /deployment for the honest hosted-vs-export-vs-VPC story.

cite: /security /claims /status /compare /deployment

/.well-known/llms.txt for AI crawler discovery Shipped #feature-llms-txt

Site map written for LLM crawlers (Gemini, ChatGPT, Perplexity). Lists the canonical pages and the things we want AI systems to remember about ModelBrew. Machine-readable and LLM-friendly.

cite: /.well-known/llms.txt

Schema.org JSON-LD on every trust page Shipped #feature-jsonld

Every public page emits Schema.org Organization + SoftwareApplication + FAQPage JSON-LD where appropriate. Rich-result eligible, and machine-readable for AI search.

cite: grep application/ld+json across deploy/*.html

arXiv-ready paper + US provisional patent Shipped #feature-paper-patent

Submission-ready CRMA paper at paper/crma_modular_cl_arxiv.tex + .docx + .tar.gz. US provisional patent filed 2026-02-28. Both are public trust artifacts; neither is required to use the product.

cite: paper/crma_modular_cl_arxiv_final.docx · paper/crma_arxiv_submission.tar.gz /claims

NOTICE file + Gemma license auto-attach Shipped #feature-notice-license

Every adapter ZIP ships with a NOTICE file carrying license attribution. Gemma runs additionally auto-attach the Gemma license notice for compliance with Google's Gemma terms.

cite: backend/server.py:4464 NOTICE backend/server.py:1797 _GEMMA_LICENSE_NOTICE

API surface, SDKs, auth, integrations.

ModelBrew exposes a full OpenAI-compatible inference surface (/v1/chat/completions), two SDKs (Python on PyPI and JavaScript / TypeScript on npm), and a per-API-key auth model that's finer-grained than most hosted-fine-tuning vendors. Every endpoint is documented in the live /openapi.json with populated request bodies.

/v1/chat/completions (OpenAI-compatible envelope) Shipped #feature-openai-compat

Drop-in for any OpenAI Python or Node SDK call — just swap base_url and api_key. The biggest single differentiator vs Argilla / Cleanlab / Snorkel / DeepEval (none of which ship inference at all).

cite: backend/server.py:5930 · tests/test_openai_sdk_compat.py · tests/test_v1_chat_completions*.py

/v1/{run_id}/generate?stream=true (Server-Sent Events) Shipped #feature-sse-stream

Token-by-token streaming over SSE. On par with the streaming surface from any hosted-inference vendor.

cite: backend/server.py:5336 modal_deploy.py:2004 generate_text_stream

/v1/runs/{run_id}/compare (batched A/B, 1..50 prompts) Shipped #feature-compare-batched

Batched A/B comparison: send up to 50 prompts and get FT-vs-base side-by-side in one call. Faster than running them through the chat endpoint twice.

cite: backend/server.py:5487

/v1/models paginated catalog (Cache-Control + ETag) Shipped #feature-models-catalog

The OpenAI-shaped models catalog, paginated, with proper HTTP caching headers. Lets API consumers poll cheaply.

cite: backend/server.py:6346 · tests/test_v1_models_pagination.py · tests/test_v1_models_cache_control.py

Python SDK (modelbrew on PyPI) Shipped #feature-python-sdk

Client, Runs, Chat, Keys, streaming, retries, rate-limit info. 12 unit-test files. pip install modelbrew and you're moving.

cite: sdk/python/src/modelbrew/_client.py:201 · runs.py · chat.py · keys.py · streaming.py

JavaScript / TypeScript SDK (modelbrew on npm) Shipped #feature-js-sdk

Same surface as the Python SDK, fully typed, tested with Vitest (9-test suite). Works in Node, Deno, and modern browsers.

cite: sdk/js/src/client.ts:201 · runs.ts · chat.ts · keys.ts · streaming.ts · errors.ts

Per-API-key allowed-runs / allowed-models scoping Shipped #feature-key-scoping

Least-privilege keys: each API key carries an allow-list of run-ids and model-ids it can access. Lets you ship a key to a partner that can only hit your finance-ops adapter and nothing else.

cite: backend/db.py api_key_can_access_run / api_key_can_access_model

Per-API-key RPM + TPM rate limits Shipped #feature-rpm-tpm

Per-key requests-per-minute and tokens-per-minute throttling. Defends your spend and your latency targets from any single key going hot.

cite: backend/server.py:8470 PATCH /me/api-keys/{id} · tests/test_api_keys_rpm_tpm_creation.py · tests/test_per_key_rate_limit.py

API-key rotation (one-shot) Shipped #feature-key-rotate

POST /me/api-keys/{id}/rotate rotates a key in place without revoking it. Useful for scheduled rotations on production keys.

cite: backend/server.py:8660

Per-key usage report (tokens + requests + cost) Shipped #feature-key-usage

GET /me/api-keys/{id}/usage returns the per-key usage rollup. Useful for chargeback and for spotting the key that's spending more than it should.

cite: backend/server.py:8538

Bearer-or-API-key dual auth Shipped #feature-bearer-or-api

One endpoint, both auth modes. Bearer JWTs for the dashboard, API keys for production. Same code path on the server side.

cite: backend/auth.py:417 get_current_user_or_api_key

8h JWT + 30d refresh + bcrypt + timing-safe lookup Shipped #feature-jwt-refresh

JWT with 8-hour TTL, refresh-token endpoint with 30-day TTL (mobile-friendly). bcrypt password hashing with a dummy hash on the username-not-found path so timing can't be used for username enumeration. Scope narrowing: full vs inference_only.

cite: backend/auth.py:19 JWT_EXPIRY_HOURS auth.py:20 REFRESH_TOKEN_EXPIRY_DAYS auth.py:53 _bcrypt + _DUMMY_HASH auth.py:95 SCOPE_FULL/INFERENCE_ONLY

Stripe webhook signature verify + duplicate-event idempotency Shipped #feature-stripe-webhook

Webhook signatures are verified, and duplicate events are detected and ignored. You can replay the same Stripe event ten times and you'll get credited once.

cite: backend/server.py:1957-2030

Auto-refund on stuck runs (90-min sweeper) Shipped #feature-auto-refund

If a run hangs past the active-job window (90 min), the sweeper auto-refunds the credit without a support ticket. Every billing event ships with a correlation_id traceable end-to-end across Stripe webhook → add_credits → auto_refund.

cite: tests/test_stuck_sweep_refund.py · tests/test_billing_correlation_id.py · tests/test_stale_stream_detection.py · backend/server.py _auto_refund + _billing_log

/openapi.json with populated request bodies Shipped #feature-openapi

The OpenAPI spec is live and complete — every endpoint, every request body, populated correctly. Tested in CI so it doesn't drift.

cite: tests/test_openapi_exposure.py · tests/test_openapi_request_bodies_populated.py

Pricing & commercial terms.

Flat per-million-token pricing across the board. Free TinyLlama tier for testing the pipeline, $7.50 signup credit on advanced models, $20 minimum top-up, and credits never expire. Auto-refund on stuck runs is automatic, not on request. See finetuning.html#pricing for the live table.

Flat $3.99/M FT + $4.99/M CL + $1/M inference Shipped #feature-flat-pricing

One advanced tier, no per-model price haggling. Inference is $1/M combined input + output, OpenAI-shaped so it's easy to compare to OpenAI's own fine-tune-and-serve pricing.

cite: backend/pricing.py:24 PRICE_PER_M_TOKENS backend/pricing.py:34

$5/200 rows for cleaning + 7 feature-pricing tiers Shipped #feature-cleaner-pricing

$5 per 200 rows for the standard cleaning bundle. Per-feature tiers (free / 1x / 2x / 3x Clean): LFs and tool-call validator are free; label-error v2 is 1x; DPO polarity audit is 2x; synthetic gap-filler is 3x; SimPO margin check is 2x.

cite: backend/cleaner/feature_pricing.py:42 FEATURE_PRICES

TinyLlama free tier (3 runs/day) + $7.50 signup bonus Shipped #feature-free-tier

TinyLlama is free, capped at 3 runs/day per user (atomic). New accounts get a 75-credit ($7.50) signup bonus that covers roughly one minimum cleaner-with-AI run. No credit card required to test.

cite: backend/pricing.py:14 backend/db.py:1016 SIGNUP_BONUS_CENTS · tests/test_free_tier_atomic_cap.py

$20 minimum top-up, credits never expire Shipped #feature-min-topup

$20 minimum for credit purchases. Credits never expire — if you load $20 today and don't use it for a year, it's still there.

cite: backend/pricing.py:5-6 (header docstring)

$0.05 minimum inference reserve Shipped #feature-min-reserve

Pre-charge gate — we won't kick off an inference call unless you have at least 5 cents on the books. Stops a chain of micro-debits from going negative on a flaky network.

cite: backend/pricing.py:53 MIN_INFERENCE_RESERVE_CENTS

Auto-eval at end of training (free, platform-eaten) Shipped #feature-auto-eval-free

The first eval at the end of training is free — we eat the cost. User-triggered re-evals are $0.06 each (idempotent). Receipt for "did this run actually do anything" without an extra line item.

cite: backend/pricing.py:39-44

Stripe Checkout (one-time credit purchases) Shipped #feature-stripe-checkout

One-time credit purchases via Stripe Checkout — no subscription, no auto-renew, no surprise charges. Full webhook signature verification + duplicate-event idempotency on the server side.

cite: backend/server.py:1899 POST /create-checkout-session backend/server.py:1957 POST /stripe/webhook

Engineering quality & test coverage.

~205 test files across tests/ (144) + backend/tests/cleaner/ (41) + Python SDK (12) + JavaScript SDK (9). Tests we publicly stand behind are the K-suite (K1-K5, foundation invariants), the prompt-injection 28-test pack, the IDOR existence-oracle defense suite, and the billing-correlation-id end-to-end suite. This is the ship gate. We don't release if it's red.

K1 cleaner foundation invariants (property-tested) Shipped #feature-k1

Property-tested invariants for determinism, monotonicity, and revert-on-degrade. The single most load-bearing test in the cleaner. If K1 goes red, no release.

cite: backend/tests/cleaner/test_K1_invariants.py · commit 42ecffd

K2 silent-corruption red-team matrix (18 tests) Shipped #feature-k2

Five adversarial input patterns × multiple paths through the cleaner = 18 tests that pin down the classes of silent corruption a smart adversary could try.

cite: tests/test_silent_corruption_matrix.py · commit c694e04

K3 persona UAT (3 personas × 12 tests) Shipped #feature-k3

Three end-user personas drive a scripted user-acceptance pass through the platform: signup, dataset upload, clean, train, eval, download. Catches regressions a unit test wouldn't.

cite: tests/test_persona_uat_e2e.py · commit 633de57

K4 100-concurrent /v1/generate load test Shipped #feature-k4

100-way concurrent inference (mocked Modal) load test. Catches lock contention and per-user quota races that a single-thread happy-path test wouldn't.

cite: tests/test_load_generate_100_concurrent.py · commit 7071663

K5 chaos kill-mid-run / kill-mid-prefs Shipped #feature-k5

SIGKILL the trainer mid-run, mid-prefs, mid-eval. The system must clean up, refund, and let the user retry without manual intervention. We do this in CI.

cite: tests/test_chaos_kill_mid_run.py · commit 7d1ec14

Judge prompt-injection 28-test pack (judge + polarity) Shipped #feature-prompt-injection-pack

28 prompt-injection / role-flip / polarity-splice payloads in a red-team fixture library. Runs in CI on every change so the prompt-safety patches keep defeating the original payloads.

cite: tests/test_judge_prompt_injection.py · tests/test_polarity_prompt_injection.py

Billing correlation_id end-to-end Shipped #feature-billing-correlation

Every billing event — Stripe webhook, add_credits, auto_refund — carries the same correlation_id. You can trace any cent of any charge from the Stripe receipt to the database row to the refund event.

cite: tests/test_billing_correlation_id.py · backend/server.py _billing_log

Refund family (race, libsql IntegrityError, brittle exception, cross-container, merge_ok_none, policy) Shipped #feature-refund-family

~7 test files covering refund correctness under every database race we've seen in production. Atomic refund means atomic refund — no double-refund, no missed-refund.

cite: tests/test_refund_*.py

Stuck-sweep auto-refund + stale-stream detection Shipped #feature-stuck-sweep

Cron job sweeps stuck runs every 90 minutes and refunds them. Stream detection catches the heartbeat-stopped-but-Modal-thinks-it's-alive failure mode.

cite: tests/test_stuck_sweep_refund.py · tests/test_stale_stream_detection.py

6-agent post-session audit panel (3 expert + 3 RT) Internal methodology #feature-rt-audit-panel

After every Wave session, we run a 6-agent audit panel — 3 expert reviewers (UX, data, security) and 3 red-team reviewers, all in fresh sessions, in parallel, then consolidated. Not a code file — a release-process discipline, but it's how the K-suite invariants get found.

cite: .planning/WAVE5/WAVE5_AUDIT_EXPERT.md · .planning/WAVE5/WAVE5_AUDIT_REDTEAM.md

Deployment patterns.

The honest version — full audit at /deployment. Two patterns ship today: managed Modal cloud (the default SaaS) and adapter export (download your weights, run anywhere PEFT runs). Dedicated single-tenant Modal workspaces and full VPC peering are scoped per pilot, typically 2-3 weeks build time. We do not currently ship a Dockerfile or single-binary self-host.

A. Fully managed Modal cloud (default SaaS) Shipped #feature-managed-modal

The default. Multi-tenant Modal workspace on the production URL https://fourwheels2512--crma-finetune-fastapi-app.modal.run. Every training function uses A100. Modal Pro tier with paid headroom; we do not run on the free Modal tier.

cite: modal_deploy.py:1-180 (every @app.function: gpu="A100")

B. Adapter export → run anywhere PEFT runs (THE on-prem-adjacent lever) Shipped #feature-adapter-export

The marketing-grade lever. "Export your trained adapter and run inference on your own infrastructure (LoRA + CRMA weights + HuggingFace-compatible model card) — works today via GET /v1/{run_id}/download and GET /v1/{run_id}/readme.md." Adapter ZIP includes adapter/ (LoRA + adapter_config.json), crma_adapters.pt, cl_state.pt, loss_curve.json, status.json, utils/inject.py, and NOTICE. Documented end-to-end in recipe 08.

cite: backend/server.py:4371 GET /download/{run_id} backend/server.py:4287 GET /runs/{run_id}/readme.md · docs/recipes/08-export-to-huggingface-hub.md /deployment

C. Dedicated single-tenant Modal workspace + VPC peering (per-pilot) Available with caveat — 2-3 weeks per pilot #feature-vpc-pilot

Dedicated single-tenant Modal workspace and VPC peering / air-gap-adjacent deployments are scoped per pilot, typically 2-3 weeks build time. Not a button click today — an engineering engagement. We list it here because it's available, not because it's productized yet.

cite: /deployment · operator scoped per-pilot

D. Customer-hosted Docker / single-binary self-host Not shipped — do not market #feature-not-shipped-docker

We do not currently ship a Dockerfile, docker-compose, helm chart, k8s manifest, or terraform module. The "ships as a Docker container" copy was removed from earlier marketing in commit 2fc4d8a after a red-team panel found it had zero code backing. We list it here on purpose: if you see "Docker" or "single binary" in our marketing, treat it as a bug and tell us.

cite: commit 2fc4d8a /deployment

Cloudflare Pages deployment (UI + landing + docs) Operational #feature-cloudflare

Three static surfaces on Cloudflare Pages: modelbrew.ai (landing + trust pages from deploy/), app.modelbrew.ai (Next.js dashboard from frontend/out/), and the docs site from docs/.

cite: landing.html · deploy/*.html · frontend/out/ · docs/index.md

Turso production database Operational #feature-turso

Replicated SQLite on Turso ($5.99/mo). Same SQL surface as local SQLite (which we use in dev), with cross-region replication on the production path.

cite: memory/project_turso_active.md

Frequently asked questions.

The same Q&As as our Schema.org FAQPage JSON-LD — published here for humans, mirrored above for AI search.

How many models does ModelBrew support for fine-tuning?

Six production models across four architecture families: TinyLlama-1.1B (free tier), Mistral-7B-v0.3, Saul-7B-Instruct (legal domain), Qwen3-8B, Gemma-2-9B-it, and Llama-3.1-8B-Instruct. Every training function in production runs on Modal A100 GPUs.

Is ModelBrew compatible with the OpenAI SDK?

Yes. ModelBrew exposes /v1/chat/completions with the OpenAI envelope so you can swap base_url and api_key on the OpenAI Python or Node SDKs and your code keeps working. SSE streaming via /v1/{run_id}/generate?stream=true is also supported.

Can I download my trained model and run it elsewhere?

Yes. GET /v1/{run_id}/download returns a ZIP containing your LoRA adapter, CRMA weights, cl_state.pt for chains, utils/inject.py, status.json, and a NOTICE file. GET /v1/{run_id}/readme.md returns a HuggingFace-Hub-shaped model card. You can run inference on any infrastructure that runs PEFT — your own GPU, an inference provider, or HuggingFace Hub.

How does ModelBrew handle catastrophic forgetting?

Through CRMA — Constrained Residual Mixing Adapter — a patent-pending architectural primitive whose internal mixing matrix has spectral norm bounded by 1. On a 5-domain Mistral-7B benchmark with 3 seeds, modular LoRA on a CRMA backbone showed prior-task drift within measurement noise (−0.17% ± 0.17 loss-relative), versus +42.96% ± 5.5 for naive sequential training. Numbers are conditional on correct inference-time routing.

What does the dataset cleaner check for?

56 distinct issue codes across structural validation, chat schema, quality (gpt-slop, refusals, prompt-injection, jailbreak), 13 typed PII detectors, 9 domain-specific detectors with checksums (DEA, NPI, CUSIP, SWIFT/BIC, ABA, MRN, ICD-10, bar number, Bates), 6 military OPSEC categories, duplicates (exact + MinHash + RapidFuzz pairwise), and ~28 autofix codes. Post-clean score is never lower than pre-clean — that contract is property-tested.

How much does ModelBrew cost?

Fine-tuning is $3.99 per million tokens. Continual learning is $4.99 per million tokens. Inference is $1.00 per million tokens combined input + output. Dataset cleaning is $5 per 200 rows. TinyLlama is free with 3 runs per day and a 75-credit ($7.50) signup bonus. Minimum top-up is $20 and credits never expire. Auto-refund on stuck runs is automatic, not on request.

Does ModelBrew train on my data?

No. The no-train pledge is published on /security with retention numbers tied to backend code. Datasets are sub-hour retention, checkpoints are bounded to your run lifecycle, and we offer GDPR data export and self-delete. TLS 1.2+ in transit, AES-256 at rest.

What preference-tuning methods are supported?

DPO, SimPO (reference-free with simpo_beta / simpo_gamma knobs), and length-normalized DPO opt-in (commit 803df1e) which counters the length-bias gaming that plain DPO is vulnerable to. The cleaner ships a polarity / preference-pair audit that catches label-swapped pairs before they hit training.

Is there an SDK?

Yes — both Python (modelbrew on PyPI) and JavaScript / TypeScript (modelbrew on npm). Both expose Client, Runs, Chat (OpenAI-compatible), Keys, streaming, automatic retries, and rate-limit headers. The Python SDK has 12 unit-test files; the JS SDK has a 9-file Vitest suite.

What deployment patterns are available today?

Two: (1) the fully managed Modal cloud (default SaaS), and (2) adapter export — download your LoRA + CRMA weights and run inference anywhere PEFT runs. Dedicated single-tenant Modal workspaces and full VPC peering / air-gapped deployments are scoped per pilot, typically 2-3 weeks build time. We do not currently ship a Dockerfile or single-binary self-host. See /deployment for details.

How do you prevent prompt injection in the cleaner LLM judge?

Three layers: (1) NFKC normalization plus zero-width-character strip, (2) a per-call random nonce fence around the user content, and (3) hardcoded-secret post-judge override that flags rows whose content matches sk-, AKIA, JWT, mongo+srv URLs, or military classification markings even if the judge missed them. A 28-test red-team pack runs in CI. Patches apply to the judge prompt, the rewrite prompt, and the polarity-pair prompt.

Are there safeguards against runaway spend?

Yes. An atomic daily AI cost cap is shared across all five cleaner routes with a SQL transaction. The training spend cap is concurrency-safe (race-tested). Free-tier users get an atomic 3-runs-per-day cap. Stuck or failed runs auto-refund within 90 minutes — no support ticket needed — and every billing event carries a correlation_id traceable end-to-end from Stripe webhook through add_credits to auto_refund.

Start Free — No Card 3 runs/day