The Alternative to RAG
Near-Zero Forgetting
RAG looks up answers every time. ModelBrew bakes knowledge directly into the model. Train across multiple sequential domains — your model keeps what it learns, with prior-task drift within measurement noise on our 3-seed Mistral-7B benchmark. No vector database, no retrieval pipeline.
3 free runs/day on TinyLlama. Pro from $3.99/M tokens. See pricing
Mistral-7B, 5 sequential domains, 3 seeds. Per-seed MODULAR and NAIVE ranges are disjoint at every seed. All forgetting numbers are conditional on correct inference-time routing. The 98/100 and 18/18 retention scores are first-author-evaluated with Wilson 95% confidence intervals; a blinded two-rater audit is on our roadmap.
Live Continual Learning
Beyond training-time CL: teach your model new facts live — in a single pass, no retraining. Instantly answerable, and nothing it already knows is forgotten. In private beta soon.
RAG retrieves. ModelBrew remembers.
RAG systems look up answers from documents every time — slow, fragile, and expensive to maintain. ModelBrew trains the knowledge directly into the model weights. No vector database. No chunking pipeline. Your model just knows it.
Upload your data
Medical notes, legal docs, code, anything.
Train with CRMA
CRMA guards the model so it can learn without forgetting.
Done. Nothing lost.
Your model knows the new stuff AND still remembers the old stuff.
Built for teams that can't afford to forget.
Teams training models across multiple domains — without retraining from scratch every time.
Clinical NLP
Train on radiology reports, then clinical notes, then pathology — without forgetting prior specialties. Built by a healthcare practitioner who hit this problem firsthand.
Multi-Practice Firms
Fine-tune on contract review, then case law, then regulatory filings. Each practice area improves without degrading the others.
Cross-Asset Intelligence
Equities research, fixed income, credit analysis — one model that learns sequentially across asset classes without catastrophic forgetting.
Multi-Department AI
Designed for stacking domain adapters over time — support tickets, internal docs, product specs, HR policies — without retraining a separate model per department. Currently validated on the 5-domain Mistral-7B benchmark; production deployments are in beta.
Production Pipelines
Plug into existing CI/CD. Upload data per domain, choose standard FT or continual learning, track per-domain metrics and drift over time via API.
Dataset Optimizer
Clean your fine-tuning dataset before training. 60+ validator codes, AI-judge scoring with score-floor-gated rewrite, structural pair audit + judge-based polarity sample, tool-call validation, jailbreak + military-OPSEC + industry-specific PII detection. Free local validators in your browser; AI-judge cleaning is $5 per 200 rows.
60+ validator codes
Format, schema, length, dedup (exact + near + semantic), encoding, GPT-slop, refusals, repetition, mislabel detection. Every flag points back to a row index.
AI judge + rewrite
Four-axis judge with calibration exemplars; optional 14-dim and G-Eval rubrics. Rewriter preserves every number, URL, named entity, and acronym — verified by a fact-diff before the row ships.
DPO / ORPO structural audit
Eight structural defect codes — identity pairs, near-duplicate chosen, both-refusals, both-too-short, extreme length bias, sycophantic chosen, refusal-as-chosen, missing prompt. The pair-level checks row-level scanning misses.
Tool-call validation
OpenAI tool_calls and Anthropic tool_use shape detection. Missing-required-arg and wrong-arg-type are critical; unknown-arg is a warning. Built for shipping agentic fine-tunes.
Jailbreak · OPSEC · typed PII
Eight jailbreak categories (prompt injection, role bypass, system extraction, encoding attacks). Six military OPSEC codes (MGRS, EDIPI, classification markings, DTG, lat/long, network refs). Nine industry-specific PII detectors (medical: MRN/DEA/ICD-10/NPI, financial: CUSIP/SWIFT/ABA, legal: bar number/Bates) on top of the standard 10-type regex PII pass.
Proven at 100,000 rows
250 rows / sec on a single worker, peak RSS under 1.5 GB. End-to-end scan of 100k OASST1 and 100k military corpora. Real benchmark, not a marketing number.
Supports JSONL, CSV, and JSON · Up to 50MB · Local validators free; AI-judge cleaning $5 / 200 rows
Three lines to your first training run.
Works with any JSONL dataset. Or use the web UI — no code needed. · Full API docs →
Three papers. Real experiments. Ongoing research.
CRMA comes from original research — not a wrapper around existing tools. We publish our methodology, run multi-seed experiments, and update the algorithm based on results. Patent pending (US provisional filed Feb 2026).
Six CL Methods Tested — Six Failures
EWC, replay, gradient projection, knowledge distillation, O-LoRA, 10-component stacks. Best result: 58.4% forgetting. We tested them all so you don’t have to.
Preprint · v2-v7 experiments · TinyLlama & Mistral-7B
Read paper →Near-Zero Forgetting on Mistral-7B Across 5 Domains
Modular LoRA on a spectrally bounded CRMA backbone: −0.17% ± 0.17 MODULAR drift vs +42.96% ± 5.5 NAIVE forgetting across 3 seeds. Per-seed ranges disjoint. Validated on 5 models across 4 architecture families. Patent pending.
Preprint · 3 seeds · 5 domains · Mistral-7B & Gemma-2-9B
Read paper →Current Research & Development
Multi-seed experiments across 3 random seeds on Mistral-7B. 5 real-world domains (medical, legal, financial, code, science). Results reproducible across seeds.
Enhanced reasoning via self-distillation fine-tuning (SDFT). Scale testing beyond 7B. Head-to-head benchmark against O-LoRA and other academic CL methods.
Real-time continual learning (streaming updates). Agent fine-tuning with tool-use preservation. Automatic domain boundary detection.
Numbers, not promises.
CRMA has been tested across multiple model scales and domains. Here's what the benchmarks show.
| Method | Forgetting | Overhead | CL Support |
|---|---|---|---|
| CRMA | -0.17% drift | None | Built-in (beta) |
| Naive LoRA | +43% (7B) / +225% (1.1B) | None | No |
| OpenAI | No CL | N/A | No |
| Mistral / Together | No CL | N/A | No |
How we measure: "Forgetting" = change in holdout loss on previously learned domains after training on new ones. Negative = the model got slightly better (ideal). Positive = knowledge was lost. Measured across 5 real-world domains (medical, legal, financial, code, science) on Mistral-7B, averaged over 3 random seeds. See the Pricing page for current rates.
Capability vs the Top-5 fine-tuning platforms
An honest scorecard. Some cells are A; some are not. ModelBrew is behind A-grade incumbents on SFT polish, DPO/SimPO ergonomics, and pricing/billing — and ahead on the two surfaces no other hosted platform ships: a built-in pre-train data gate and continual learning.
| Capability | Together.ai | OpenAI FT | HF AutoTrain | Predibase | Modal+Axolotl | ModelBrew |
|---|---|---|---|---|---|---|
| SFT correctness | A | A | B | A | A | C+ |
| DPO / SimPO | A | — | C | B | A | B− |
| Real eval (win-rate, MMLU) | A | A | B | A− | DIY | A |
| Inference deploy | A | A | C | A | DIY | A |
| Pre-train data gate (Cleaner) | None | None | None | Limited | None | A− |
| Pricing / billing safety | B | A | B | B | A | C+ |
| API + SDK | A | A+ | C | A | DIY | A |
| Continual learning (regulated / MLOps) | None | None | None | None | None | C− (beta) |
How we grade: Grades reflect ModelBrew's own internal scorecard against five hosted-fine-tuning competitors as of May 2026. ModelBrew's own grades come from its post-Wave-3.6 internal blueprint audit; competitor grades reflect publicly documented capability surfaces. SFT, DPO/SimPO, and pricing/billing are areas where ModelBrew is honestly behind A-grade incumbents. The pre-train data gate (Cleaner) and continual learning are the two unique-moat rows. Continual-learning support is in beta.
Per-Domain Drift After 5 Sequential Domains
Each domain was trained sequentially. Drift measures how much earlier domains degraded after all 5 were trained. Negative = slight improvement (positive transfer).
| Domain | CRMA | Frozen | Naive LoRA |
|---|---|---|---|
| Medical | −0.56% | +2.22% | +149.6% |
| Legal | −0.55% | +1.83% | +34.3% |
| Financial | +0.59% | +1.74% | +17.8% |
| Code | −0.51% | +2.78% | +13.0% |
| Science | +0.20% | +1.17% | +0.08% |
| 3-seed Avg | −0.17% | +1.95% | +42.96% |
Key insight: CRMA drift is on the order of an order of magnitude lower than FROZEN (∼1.95%), and two orders of magnitude lower than naive sequential LoRA (∼43%). The 3-seed average of the per-domain values reconciles to −0.17% at the bottom row. 3-seed average across seeds 0, 42, 1234; Mistral-7B.
View full benchmark data & methodology
CRMA Internal (Mistral-7B, 5 domains, 3-seed avg): CRMA Modular −0.17% ± 0.17 drift, Frozen +1.95% ± 0.64, Naive +42.96% ± 5.5. Per-seed MODULAR and NAIVE ranges are disjoint. No replay, no EWC, no knowledge distillation.
Gemma-2-9B inference ablation: 98/100 with CRMA (Wilson 95% CI [93.0%, 99.5%]) vs 38/100 without (Wilson 95% CI [29.0%, 47.8%]). Same weights, same questions, only CRMA toggled.
Pricing (April 2026): ModelBrew FT $3.99/M, all 7–9B models, with gradient visibility + built-in Dataset Optimizer. CL is in beta and not available for self-serve purchase at this time. OpenAI GPT-4.1 $3.00/M (no CL, FT only on their models). Together/Fireworks/OpenPipe $0.48-0.50/M (FT only, no cleaner, no CL). Mistral La Plateforme $1.00/M.
Head-to-head baselines: We have not run head-to-head comparisons against published CL methods (O-LoRA, InfLoRA, Lewandowski et al.) on our protocol. This is the single largest gap in our research; it is acknowledged openly in the paper. Our internal controls compare NAIVE vs FROZEN vs MODULAR on identical data.
CRMA results are from internal benchmarks using holdout evaluation. All forgetting-prevention numbers are conditional on correct inference-time routing.
Pay only for what you use.
No subscriptions. Sign up and get 75 credits free ($7.50). Load $20 in credits when you're ready, pay only for tokens used. 3 free training runs per day on TinyLlama.
- 75 credits free at signup ($7.50)
- 3 runs per day on TinyLlama-1.1B
- Fine-tuning mode
- Download adapter ZIPs
- Real-time training progress
- All models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)
- Fine-tuning + DPO/SimPO preference tuning
- Continual learning (beta — request access)
- Cost estimates before each run
- Credits never expire — balance rolls over
All 7–9B models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)
| Fine-Tuning | $3.99 / M tokens |
| Continual Learning | Beta — contact us |
| Clean with AI (Dataset Optimizer) | 50 credits per 200 rows |
Credits & Balance
| Minimum credit purchase | $20 |
| Credits roll over | Never expire |
| Failed jobs | Auto-refunded |
Example: Fine-tune Mistral-7B on 500 medical Q&A pairs
| Estimated tokens | ~135K tokens |
| Rate (Fine-Tuning) | $3.99 / M tokens |
| Computed cost | $0.54 |
| Deducted from balance | $0.54 |
Example: Continual learning on Mistral-7B — 5 domains
Continual learning is currently in beta and not available for self-serve purchase. Request access if you'd like to evaluate it on your data.
Refund Policy: If a training job fails due to a system error, your credits are automatically refunded — no action needed. Unused credits are non-refundable and non-transferable. All payments are processed securely by Stripe — we never see your card details. By purchasing credits, you agree to our Terms of Service.
Production-grade security primitives.
Modern auth, RBAC, audit logging, strict CSP, HTTPS-only, atomic billing — the hardening you'd expect from a 2026 SaaS, exposed and tested.
Encryption at Rest
All model checkpoints and training data encrypted with AES-256 (Fernet). Secure delete enabled — no residual data on disk.
Security Headers
HSTS, X-Frame-Options DENY, Content-Type nosniff, XSS protection, strict Referrer-Policy, and Permissions-Policy on every response.
Audit Logging
Every API call logged with user, action, IP, and timestamp. Full audit trail for compliance reviews and incident response.
Role-Based Access
RBAC with granular permissions. Admin, user, and read-only roles. API keys separated from session tokens.
GDPR & Data Rights
One-click data export and account deletion. Your data, your control. Full compliance with data protection regulations.
Hardened Runtime
Non-root containers, health checks, safe model loading (no arbitrary code execution), and sanitized error responses.
Cleaner judge prompt-injection harden W5 S1.1 · 89e3cc5
NFKC normalization, zero-width strip, role-flip detection, per-call random nonce fence. Red-team fixture library runs in CI on every change.
Chat-template token strip W5 S1.2 · 265aed0, 2528708
Defends against tokenizer-token poisoning across Qwen3 <think>, Llama-3 <|eom_id|>, Phi-4 <|im_sep|>, Mistral, Gemma. ChatInject paper attack surface (arXiv 2509.22830).
Atomic per-user daily AI cost cap W5 S1.3 · 75f579a, 2a99c19
Single SQL transaction enforces a soft daily ceiling across all 5 user-triggerable LLM-spending routes — judge, rewrite, polarity, preference-pair, scoring — so a runaway script can’t rack up a five-figure bill.
IDOR existence-oracle defense W5 S1.4 · 2f63d07, 5b3ba58
403 → 404 across /status, /start_cl_task, and 5 sibling endpoints, with response-time symmetry so a probe can’t distinguish “exists but not yours” from “doesn’t exist.”
Modal upload MIME / magic-byte guard W5 S1.7 · 6d8dcf4, 06ff906
Rejects ZIP, PE, HTML, PDF, RAR, 7z magic bytes posing as CSV or JSONL. json.loads probe on the first JSONL line before the row stream begins.
Billing correlation_id end-to-end W5 S2.2 · cf6780c
Stripe webhook → add_credits → _auto_refund traceable through one correlation ID. Refund or duplicate-webhook reconstruction is deterministic from the audit log.
Public /security trust page W5 S2.1 · 6b92b25
Retention numbers cited to the code that enforces them; vulnerability disclosure with safe-harbor language; reviewable in version control. Read the page →
security@ reporting channel W5 S2.3 · DNS 2026-05-07
Vulnerability reports to modelbrewai@gmail.com. 5-business-day acknowledgement, 10-business-day initial assessment, safe harbor for good-faith research.
18 silent-corruption RT tests W6 K2 · c694e04
Adversarial matrix covering tokenizer poisoning, lookalike-character spoofing, fix-order bypass, role spoofing, and mojibake against the cleaner pipeline. CI on every change.
Foundation-invariant property tests W6 K1 · 42ecffd
Determinism + monotonicity + revert-on-degrade enforced by property tests. The cleaner can’t produce a row whose post-clean score is lower than its pre-clean score — by construction.
Full security posture, retention schedule, and dated changelog at /security. Deployment options (managed cloud, Hybrid Export, enterprise roadmap) at /deployment.
Built by a practitioner, not a lab.
Near-zero catastrophic forgetting on our benchmarks — conditional on correct inference-time routing, validated on Mistral-7B (full 5-domain × 3-seed protocol) with an inference-time ablation on Gemma-2-9B. ModelBrew AI makes fine-tuning practical, accessible, and pilot-ready (SFT path); preference-tuning surface (SimPO/DPO) is in beta; continual learning is in beta.
ModelBrew AI
Based in Frederick, Maryland. We build mathematically constrained fine-tuning technology that lets AI teams train on new data without losing what their models already know. Our platform runs on serverless GPUs — no infrastructure to manage, no MLOps team required.
Kiran Nayudu
Healthcare practitioner who built CRMA after watching fine-tuned models forget critical knowledge with every training run. Background in regulated industries and hands-on ML engineering. Built CRMA from first experiment to deployed API.
Learn about fine-tuning and continual learning.
Technical articles about stable fine-tuning and why it matters for production AI.
Why RAG Falls Short — And What Happens When You Bake Knowledge Into the Model
Everyone is building RAG pipelines. We took a different path: train knowledge directly into the model weights, across sequential domains, with near-zero forgetting.
Read more →DPO vs SimPO in 2026: Which Preference-Tuning Method Should You Use?
Side-by-side comparison of Direct Preference Optimization and SimPO — when each works, the trade-offs, and how ModelBrew picks the right one for your dataset.
Read more →What Is Fine-Tuning? Why It Matters and How It's Changing AI
Fine-tuning explained for a broader audience — real-world use cases in healthcare, legal, code, and finance.
Read more →What Are LoRA and QLoRA? A Practical Guide to Efficient Fine-Tuning
How LoRA and QLoRA made fine-tuning possible on consumer GPUs — and the stability problems they don’t solve.
Read more →How CRMA Solves Continual Learning
Stable backbone, swappable domain adapters, near-zero forgetting. No replay buffers, no growing memory.
Read more →Catastrophic Forgetting: The Silent Killer of Fine-Tuned Models
Why every fine-tuning run destroys prior knowledge, and what the research says about fixing it.
Read more →CRMA vs LoRA: What's the Difference?
Side-by-side comparison of standard LoRA and CRMA — when you need each, and what happens when you don’t use CL.
Read more →The Cost of Forgetting: Why Retraining From Scratch Is Unsustainable
The real-world compute, time, and quality costs of not having continual learning in your ML pipeline.
Read more →Get in touch.
Questions about CRMA, enterprise pricing, or fine-tuning? Reach out.
Roadmap
Fine-Tuning
Continual Learning
Enhanced Reasoning
Real-Time CL
Ditch the vector database. Teach your model directly.
Start with 3 free runs on TinyLlama. No credit card, no setup, no retrieval pipeline to manage.
Legal Disclaimers & Legal Notices ▼
No Warranty. CRMA is provided "AS IS" without warranties. Not guaranteed to be uninterrupted or error-free.
Benchmarks. All metrics are from internal experiments under controlled conditions. Results are not guarantees — individual results vary by dataset, model, and configuration. Academic comparisons use different benchmarks.
AI Outputs. Fine-tuned models may produce inaccurate or harmful outputs. Users are responsible for validation. Not for medical, legal, or financial decisions without human review.
Liability. ModelBrew AI's total liability shall not exceed amount paid in the preceding 12 months. No liability for indirect or consequential damages.
IP. CRMA is protected by U.S. provisional patent (filed Feb 2026). Third-party names used for identification only.
Data. Your training data is used only for your job, stored temporarily, deleted after completion. We never train on your data. See Privacy Policy.
Research. Papers are pre-publication drafts, not yet peer-reviewed. Some experiments are single-seed.
Third-Party Services. Built on Modal, Stripe, and Hugging Face. We're not responsible for their outages. Stripe handles payments — we never see your card.
Governing Law. State of Maryland, USA. Exclusive jurisdiction: Frederick County courts.
By using CRMA you agree to these disclaimers, our Terms, and Privacy Policy. Contact: info@modelbrew.ai.
Every shipped capability, in one page.
Below is the long-form, machine-readable index of everything ModelBrew currently
ships — fine-tuning, continual learning, preference tuning, the dataset cleaner,
security primitives, the API + SDKs, pricing terms, engineering quality, and deployment
patterns. Every item is anchored (#feature-*) and most cite the line of code or
test that proves it. 132 features. 109 anchors. 12 FAQ entries. All cited.
Catalog
Fine-tuning capabilities.
Six production-supported open-weight models across four architecture families, with LoRA + QLoRA adapters on Modal A100 GPUs. Every job ships with a sanitized HuggingFace-Hub README, a downloadable adapter ZIP, and a token-and-cost preview before the user clicks Train. The catastrophic-forgetting prevention (CRMA) is the differentiator; the rest of this section is the disciplined SFT path that feeds it.
TinyLlama-1.1B-Chat-v1.0
3 runs/day, no credit card. The recommended way to test the pipeline end-to-end.
Mistral-7B-v0.3
The published 5-domain chain benchmark base. 26/31 zero-forget across 3 seeds.
Saul-7B-Instruct-v1
Legal-domain instruction-tuned variant. Saul-7B 18/18 zero-forgetting milestone.
Qwen3-8B
Multilingual frontier-class small model. <think>-token strip handled per-family.
Gemma-2-9B-it
Google's 9B IT model with auto-attached license notice in the adapter ZIP.
Llama-3.1-8B-Instruct
Meta's 8B instruct tuned on a 128k context. Production-grade chat-template handling.
LoRA SFT with completion-only loss Shipped #feature-lora-sft
Standard supervised fine-tuning runs through a TRL SFTTrainer subclass that masks the prompt
and only contributes loss on the assistant span. Completion-only loss is on by default and tested.
cite: utils/train.py:1060 _CRMASFTTrainer utils/train.py:1964 fine_tune_with_crma · tests: tests/test_sft_completion_only_loss.py
QLoRA support for memory-efficient training Shipped #feature-qlora
4-bit quantized base with low-rank adapters. Lets us fit 9B models comfortably on A100 with batch headroom for curriculum sort and replay batches.
cite:
utils/train.py:1964 fine_tune_with_crma
· flag: load_in_4bit in run config
Per-family chat-template autoresolve Shipped #feature-chat-template
Mistral, Saul, Qwen3, Gemma, Llama-3.1, TinyLlama, and Phi-4 chat templates are auto-detected and applied
at training time. Production tokens (<think>, <|im_start|>, etc.)
are stripped from uploaded user data by default to prevent template-leak attacks.
cite: utils/train.py:1842 _chat_template_for_model utils/train.py:1950 _apply_chat_template · tests: tests/test_chat_template.py, tests/test_template_v_parse.py
Pre-augmented + server-side row augmentation (up to 9x) Shipped #feature-augment
Optional augment_data flag expands the effective dataset. Validated combination of pre-build
augmentation plus server-side replication, surfaced as a per-run knob.
cite: utils/augment.py · tests: tests/test_augment.py, tests/test_estimate_method_and_augment.py
Curriculum sort by NLL Shipped #feature-curriculum
Hard-example-first ordering inside an epoch. Optional and gated by config; off-by-default keeps SFT behaviour predictable for users who want a vanilla baseline.
cite: utils/train.py:47 _sort_by_curriculum utils/train.py:181 _score_texts_by_nll
Estimate-before-charge (token + cost preview) Shipped #feature-estimate
POST /estimate returns the token count and dollar cost a run will consume before the
user clicks Train. The same accountant powers the spend cap and the auto-refund logic.
cite: backend/server.py:8207 POST /estimate backend/pricing.py
Modal A100 GPU on every training function Shipped #feature-a100
Production hardware on every training call. Every @app.function in modal_deploy.py
uses gpu="A100" — this is not a research-vs-prod toggle, it's the only path.
cite: modal_deploy.py:420, 577, 751, 855 (every @app.function)
Per-run download endpoint (LoRA + CRMA + CL state + utils/inject.py) Shipped #feature-download
GET /download/{run_id} ships a ZIP containing your LoRA adapter, CRMA weights,
cl_state.pt for chains, utils/inject.py, status.json, and a
NOTICE file. Customers can leave the platform with their weights — this is the single
most important on-prem-adjacent capability we ship today.
cite: backend/server.py:4371 GET /download/{run_id} /deployment · recipe 08
HuggingFace-Hub-shaped README export with sanitized hparams Shipped #feature-readme-export
GET /runs/{run_id}/readme.md returns a YAML-frontmatter, license-inheriting model card you
can paste straight into HF Hub. The exporter has both an HTML-escape XSS guard for the user-supplied
display name and a CRMA-IP guard that hard-fails the route if any patent-pending field name leaks.
cite: backend/server.py:4287 GET /runs/{run_id}/readme.md backend/readme_export.py:27 _escape_html backend/readme_export.py:63 _assert_safe_overlap
Run resume + chain checkpoint selection Shipped #feature-checkpoint-select
POST /runs/{run_id}/select_checkpoint lets you pick which checkpoint feeds the next chain
link or the next download. Combined with the chain visualizer you can branch off any midpoint of a
continual-learning chain.
cite: backend/server.py:2505 POST /runs/{run_id}/select_checkpoint
Auto column mapping (ShareGPT, OpenAI Chat, Alpaca, raw {prompt, response}) Shipped #feature-column-mapping
The dataset normalizer accepts any of four common formats and remaps to the canonical chat schema before validation. End-to-end mapping tests run on every release.
cite: backend/cleaner/normalizer.py · tests/test_column_mapping_e2e.py
Min-dataset-rows + finetune validation gate Shipped #feature-min-rows
Runs that won't converge are rejected at submit-time, not after the GPU bill arrives. Tested with adversarial single-row and empty-dataset payloads.
cite: backend/cleaner/service.py · tests/test_min_dataset_rows.py · tests/test_finetune_validation.py
Display-name sanitize Shipped #feature-display-name-sanitize
Control bytes, HTML, and zero-width characters are stripped from user-supplied run display names before storage. Defends downstream surface like the README exporter and the dashboard.
cite: tests/test_display_name_sanitize.py
Continual-learning capabilities (CRMA & chain).
The differentiator. CRMA — Constrained Residual Mixing Adapter — is a patent-pending architectural primitive whose internal mixing matrix has spectral norm bounded by 1 (by Birkhoff's theorem). Per-task LoRA adapters compose against a shared CRMA substrate, with sequential chain construction at inference. The published number: 5-domain Mistral-7B 26/31 zero-forget across 3 seeds, with prior-task drift within measurement noise.
The published continual-learning number
On a 5-domain Mistral-7B benchmark with 3 seeds, modular LoRA on a CRMA backbone showed prior-task drift of −0.17% ± 0.17 loss-relative (within measurement noise), versus +42.96% ± 5.5 for naive sequential training. Per-seed ranges are disjoint. All forgetting numbers are conditional on correct inference-time routing — receipt at /claims.
CRMA — Constrained Residual Mixing Adapter Patent pending #feature-crma
The architectural CL primitive. US provisional patent filed 2026-02-28. Public details on the
mathematical bound are in the arXiv-ready paper; the implementation lives in utils/crma.py
and the inference-time injection in utils/inject.py.
cite: paper/crma_modular_cl_arxiv.tex · utils/crma.py (private) · utils/inject.py (shipped in adapter ZIP)
Sequential CL chain at inference Shipped #feature-cl-chain
POST /start_cl_task appends a new task to an existing chain root. GET /chain/{run_id}
returns the parent-child task graph. Replay-version mismatch trip-wires refuse to chain if the replay
dataset format drifts.
cite: backend/server.py:6665 POST /start_cl_task backend/server.py:8123 GET /chain/{run_id}
Chain visualization tree (frontend) Shipped #feature-chain-tree
The dashboard renders the parent-child task graph as a real tree, not a flat list. You can see your chain at a glance and pick any node as the base for the next task. No competitor ships this.
cite: frontend/src/components/CLChainTree.tsx · frontend/src/app/(app)/train/page.tsx
Spectral / null-space gradient projection (LoRA + CRMA) Shipped #feature-spectral
Direction protection across tasks. Gradient updates are projected away from the SVD bases of prior tasks for both LoRA and CRMA blocks. Combined with EWC magnitude protection it's a belt-and-braces defense against drift.
cite: utils/train.py:521 LoRAGradientProjection utils/train.py:607 CRMAGradientProjection utils/train.py:284 _compute_svd_bases
EWC Fisher penalty (magnitude protection) Shipped #feature-ewc
Elastic Weight Consolidation Fisher penalty as a complement to projection. Holds the weights that mattered for prior tasks closer to their post-train values, weighted by their Fisher importance.
AGEM gradient correction Shipped #feature-agem
Averaged-Gradient Episodic Memory: the current-task gradient is conflict-resolved against a replay reference gradient. If they conflict, we take the projection that doesn't increase the replay loss.
Reservoir replay sampling (cumulative, NLL-prioritized) Shipped #feature-replay
1% NLL-prioritized replay drawn from a reservoir of all prior-task data. Pairs with KD logit caching so the replay loss preserves prior-task behaviour, not just prior-task labels.
cite: utils/train.py:843 save_continual_state utils/train.py:254 _compute_kd_logits
LoRA EMA callback Shipped #feature-lora-ema
Exponential-moving-average of the LoRA weights. Stabilizes adapter trajectories on hard-domain corpora where the raw step-by-step weights wobble.
Multi-holdout per-task eval (backward-transfer measurement) Shipped #feature-bwt
For every task in a chain we hold out a per-task eval and re-score it after every later task. That measurement is what surfaces in the dashboard as drift, and it's what powers the published 26/31 number.
cite: utils/train.py:2062 eval_dataset_paths utils/train.py:223 _evaluate_model
Replay-version mismatch trip-wire Shipped #feature-replay-version
If you try to chain a new task whose replay format doesn't match the chain root, the trip-wire refuses to start the run. Avoids silently corrupting a chain by mixing dataset schemas.
cite: utils/train.py:1811 CLReplayVersionMismatchError · tests/test_cl_replay_version_gate.py
CL state persistence (cl_state.pt) Shipped #feature-cl-state
The cumulative chain state — SVD bases, Fisher, replay reservoir, KD cache — is checkpointed
to cl_state.pt after every task and shipped in the download ZIP. You can resume continual
learning across runs without re-walking the prior tasks.
cite: utils/train.py:843 save_continual_state backend/server.py:4432 cl_state.pt in download
Multi-architecture verification (4 model families) Shipped #feature-multi-arch
The CRMA chain is verified across Llama-derived (TinyLlama, Llama-3.1), Mistral-derived (Mistral, Saul), Qwen3, and Gemma. Per-family chat-template and judge-token-strip rules are codified, not improvised.
Saul-7B 18/18 zero-forget milestone Shipped #feature-saul-18-18
Legal-domain CL milestone on Saul-7B-Instruct-v1. 18 of 18 prior-task evals held under the noise floor after sequential training across the canonical legal-domain chain.
cite: memory/milestone_saul7b_18_18.md (note: see /claims for current production-A100 truth)
Preference tuning (DPO / SimPO / length-norm DPO).
Three preference-learning paths: DPO, SimPO (reference-free), and length-normalized DPO. The cleaner ships a polarity / preference-pair audit that catches label-swapped pairs before they reach the trainer — the most common silent failure mode in DPO data.
DPO preference tuning Shipped #feature-dpo
Direct Preference Optimization: RLHF-free preference tuning over (chosen, rejected) pairs. Wired all
the way through, from the method='dpo' branch in /start_run to the trainer.
cite: utils/train.py:3877 fine_tune_with_dpo modal_deploy.py:855 run_dpo_training_job backend/server.py:3484 method=='dpo'
SimPO (reference-free preference tuning) Shipped #feature-simpo
Reference-free preference tuning with simpo_beta and simpo_gamma knobs. No
reference model required — useful when you want to preference-tune past a custom-trained base.
cite: utils/train.py:3251 fine_tune_with_simpo modal_deploy.py:751 run_simpo_training_job backend/server.py:3399 method=='simpo'
Length-normalized DPO opt-in Shipped #feature-dpo-length-norm
Counters length-bias gaming in the DPO objective. Plain DPO is vulnerable to a degenerate solution where the model just produces longer "chosen" outputs; the length-norm variant divides by sequence length, so the gradient pressure is on quality, not quantity. Recently shipped as Wave 4 Q4.
cite: commit 803df1e · tests/test_dpo_length_normalize.py
Polarity / preference-pair audit (cleaner) Shipped #feature-polarity-audit
Swap-invariant Gemini judge that catches label-swapped pairs in DPO datasets — the single most common silent failure mode in preference data. Ensemble option (2-judge, 4 calls/pair) is available as a higher-precision tier.
cite: backend/cleaner/polarity_eval.py · tests/test_polarity_prompt_injection.py · feature_pricing key polarity_pair_audit_ensemble
SimPO margin check (per-pair GPU forward pass) Shipped #feature-simpo-margin
Pre-train pair quality check: each (chosen, rejected) pair gets a forward pass on the base model and we score the SimPO margin. Pairs whose margin is already negative (or noise-level) are flagged for removal before they hit the trainer.
cite: backend/cleaner/simpo_margin.py · feature_pricing key simpo_margin_check
Dataset cleaner & data prep.
The cleaner is a separately-shipped product, not a side feature. 56 distinct issue codes across structural, chat-schema, quality, PII, domain-PII, military OPSEC, jailbreak, and duplicate detectors. ~28 autofix codes with granular per-category control. Score-floor revert contract: post-clean score is never lower than pre-clean — that's a property-tested invariant, not marketing copy.
Detector coverage at a glance
9 structural · 7 chat-schema · 17 quality (including jailbreak / prompt-injection) · 13 typed PII · 9 domain-PII with checksums (DEA, NPI, CUSIP, SWIFT/BIC, ABA, MRN, ICD-10, bar number, Bates) · 6 military OPSEC · 6 duplicate detection codes · ~28 autofix codes.
Structural validation (9 codes) Shipped #feature-structural
JSON / JSONL / array detection, ShareGPT, dual-format, empty rows, wrong type, unrecognized schema. Format-detect plus recovery on the way in — if we can salvage your file we will, but we never silently re-encode without telling you.
cite: backend/cleaner/validators.py:688 _validate_structural
Chat-schema validation (7 codes) Shipped #feature-chat-schema
Roles, role order, system-first, missing assistant, extra assistant turns. Catches the most common multi-turn-corpus errors before training.
Quality validators (17 codes) Shipped #feature-quality
Too short, too long, excessive whitespace, markdown noise, giant context, html noise, placeholder, GPT slop, unfinished, repetitive, refusal, soft-refusal, non-English, bad encoding, invisible unicode, prompt injection, jailbreak pattern. The single biggest source of dataset rot.
Typed PII detectors (13 codes) Shipped #feature-typed-pii
Email, phone, SSN, API key, password, address, credit card, IBAN, DOB, IP, name heuristic, low-quality prompt, lazy output. Severities split per type so you can choose redact-vs-flag-vs-drop on a case-by-case basis.
Domain-specific PII with checksums (9 detectors) Shipped #feature-domain-pii
MRN, DEA (with checksum), ICD-10, NPI, CUSIP (with checksum), SWIFT/BIC, ABA routing (with checksum), bar number, Bates. The ones with mathematical checksums are validated — we don't flag a 9-digit number as an NPI unless it actually validates as an NPI.
Military OPSEC redactors (6 categories) Shipped #feature-opsec
Classification markings (CUI / FOUO / SECRET / TOP SECRET), MGRS grid coordinates, EDIPI, DTG (Date-Time Group), lat/long, network refs. Defense-customer feature; regulated-domain hard-block keeps auto-strip-refusals from running on these datasets unless explicitly opted in.
cite: backend/cleaner/validators.py (opsec_* codes) · backend/cleaner/autofix.py:111 REGULATED_DATA_DOMAINS
Jailbreak pattern detection Shipped #feature-jailbreak
DAN-style prompts, role-flip attempts, system-prompt leaks. Different signature from prompt-injection — this catches the user-facing rows you don't want your trained model emulating.
cite: backend/cleaner/jailbreak_detectors.py:115 detect_jailbreak_patterns
Duplicate detection (6 codes: exact, MinHash, RapidFuzz pairwise) Shipped #feature-dedup
Exact dupes, near-dupes via MinHash, near-dupes via RapidFuzz pairwise scan, duplicate outputs, system-prompt inconsistency, duplicate boilerplate. Three different algorithms because no single one catches everything.
~28 autofix codes (granular per-category) Shipped #feature-autofix
Encoding, whitespace, HTML, chat-template tokens, GPT slop prefix/suffix, repetition collapse, role normalize, empty-message trim, system-prompt standardize, markdown strip, slop closers, refusals (opt-in), exact dedup, near dedup, output dedup, empty rows, critical rows, short rows, lazy output, placeholder rows, unfinished, PII redact, OPSEC redact (6 categories). Toggle each independently.
Score-floor revert (post-clean ≥ pre-clean, mathematical invariant) Shipped #feature-score-floor
The single most-load-bearing contract in the cleaner. If a cleaning pass produces a lower score than the input, we revert. Property-tested. This is non-negotiable — users have explicitly asked us to never lower their score, and the test suite enforces it.
cite: backend/cleaner/service.py:427 _compute_basic_floor · tests/test_cleaner_rewrite_score_floor.py · tests/test_score_floor_invariant.py
Per-row revert-on-degrade (LLM rewrite) Shipped #feature-revert-on-degrade
The LLM-rewrite path operates per row, and if a rewrite scores lower than the original, we keep the original. Combined with the fact-preservation diff this gives you LLM polish without LLM lobotomy.
LLM judge (Gemini 2.5 Flash / Claude Haiku 4.5 auto-pick, 4-axis rubric) Shipped #feature-judge
4-axis quality rubric (faithfulness, helpfulness, format, safety) with 16-way concurrency. Auto-picks between Gemini 2.5 Flash and Claude Haiku 4.5 based on availability and pricing. Hardcoded-secret post-judge override flags rows whose content matches OpenAI sk-, GitHub PAT, Slack token, AWS AKIA, JWT, military markings, or mongo+srv URLs even if the judge missed them.
cite: backend/cleaner/llm_judge.py:75 llm_judge.py:48 _SECRET_PATTERNS
LLM rewrite with fact-preservation diff Shipped #feature-rewrite-fact-diff
If a rewrite moves a number, a PII token, or a structured fact, we reject the rewrite and keep the original row. The diff runs after the rewrite and before commit.
cite: backend/cleaner/llm_rewrite.py:564 _fact_diff backend/cleaner/llm_rewrite.py:669 _facts_preserved
Prompt-safety nonce-fence + role-flip splice (judge / rewrite / polarity) Shipped #feature-prompt-safety
NFKC normalization, zero-width strip, role-flip detection, per-call random nonce fence on all three LLM-using paths in the cleaner: the judge prompt, the rewrite prompt, and the polarity-pair prompt. The same defense applies everywhere a customer string flows into a model prompt.
cite: backend/cleaner/prompt_safety.py:117-260 · tests/test_judge_prompt_injection.py
Tool-call validator (JSON-schema for function-calling rows) Shipped #feature-tool-call
Validates function-calling rows against a JSON Schema. Free tier — turning your data into something an instruction-tuned model will follow shouldn't cost extra.
cite: backend/cleaner/tool_call_validator.py · tests/test_cleaner_tool_call_validator.py
Synthetic gap-filler (cluster-aware row generation) Shipped #feature-synthetic
Cluster the embedding space of your dataset, find under-represented regions, generate synthetic rows
scoped to fill them. Surfaced via /api/generate-synthetic.
cite: backend/cleaner/synthetic_gen.py · backend/server.py:6492 /api/generate-synthetic
Cluster galaxy & diversity subset Shipped #feature-cluster-galaxy
Embedding-based dataset visualization. Pick a subset that maximizes coverage of the embedding manifold — useful when you have a 50k-row corpus but a 5k-row training budget.
cite: backend/cleaner/clustering.py · diversity.py · ClusterGalaxy.tsx · DiversitySubsetPanel.tsx
Concept trainer / scorer Shipped #feature-concept
Train a concept classifier on the fly — "is this row about negotiation tactics?" — then score the full corpus against it. Useful for slice-targeted curriculum.
cite: backend/cleaner/concepts.py · concepts_prebuilt.py · /scan/{id}/concept/score
Per-row novelty scoring Shipped #feature-novelty
Each row gets a novelty-vs-corpus score. Easy way to spot the rows that are doing the most actual teaching versus the rows that are filler.
cite: backend/cleaner/novelty.py · frontend NoveltyPanel.tsx
Slice analysis Shipped #feature-slice
Subset evaluation along arbitrary slices — per-language, per-tool, per-cluster, per-concept. Lets you spot the slice that's tanking your overall score.
cite: backend/cleaner/slicing.py · frontend SliceAnalysisPanel.tsx
Weak supervision (Snorkel-style labeling functions, free tier) Shipped #feature-weak-supervision
Compose programmatic labeling functions (regex, heuristic, vote) and run them across the dataset. Free tier — we'd rather make it cheap to label your own data than charge you for it.
cite: backend/cleaner/weak_supervision.py · tests/test_cleaner_weak_supervision.py
Label-error v2 (judge-disagreement, Cleanlab-flagship analog) Shipped #feature-label-error-v2
Run multiple judges on each row and flag the ones with high judge disagreement — those are the rows most likely to be mislabeled. The ML-research analog of Cleanlab's flagship feature.
cite: backend/cleaner/label_error_v2.py · tests/test_cleaner_label_error_v2.py
Cleaner presets (3: modelbrew_ft, openai_chat, support_bot) Shipped #feature-presets
One-click default fix bundles — sane starting points for the three most common dataset shapes we see. Each preset is a named configuration of detectors + autofixes.
Mojibake / encoding fixer Shipped #feature-mojibake
Handles the messy stuff — double-utf8, smart-quote regress, the "café" → "café" failure mode. Tested with a no-mojibake invariant test.
cite: validators.py:1905 _detect_bad_encoding · tests/test_cleaner_no_mojibake.py
Cleaner dry-run cost estimate Shipped #feature-cleaner-dry-run
POST /scan/{id}/dry-run previews per-feature credit cost before you commit. Same
principle as the training estimate — you should never get billed for something whose price you didn't
see first.
cite: backend/routes/cleaner.py:854 POST /scan/{id}/dry-run · /estimate at :816
Idempotency-key per user (cleaner upload + scan) Shipped #feature-idempotency
Replay the same upload or scan request and you'll get the same result, not a double-charge. Tested under concurrent migration races.
cite: tests/test_idempotency_key_per_user.py · tests/test_cleaner_idempotency.py · tests/test_idempotency_migration_concurrent_safety.py
Send-to-train (one-click handoff scan → run) Shipped #feature-send-to-train
POST /send-to-train/{scan_id} hands the cleaned dataset off to a training run without
re-uploading. Closes the loop between cleaner and trainer.
cite: backend/routes/cleaner.py:1610 POST /send-to-train/{scan_id}
Eval pipeline (200-prompt judge with win-rate + CI) Shipped #feature-eval-pipeline
Auto-eval at end of training (free, platform-eaten). Re-eval at $0.06 per run, idempotent. Win-rate vs base, length-controlled win-rate, BERTScore, ROUGE-L, exact-match, fact-recall — all surfaced as component scores. Eval drift detection compares two eval rows for distribution shift.
cite: backend/server.py:7338 GET/POST /runs/{run_id}/eval utils/eval_lc.py:17 length_controlled_winrate utils/eval_drift.py:36 detect_eval_drift
Eval set hash + judge version stamping Shipped #feature-eval-set-hash
Every eval run records the eval set's content hash and the judge model version. If we change either, you can tell from the receipt — reproducibility, not just record-keeping.
cite: utils/eval_set.py · backend/server.py _serialize_eval_run
Security & trust primitives.
The trust posture is published in three pages with operational detail, not badges:
/security with TLS / AES-256 / no-train pledge / retention numbers tied to backend
code, /claims with every public number cited to source, and
/status with a live Modal /health fetch. Below are the primitives
themselves.
The no-train pledge
We never train on your data. TLS 1.2+ in transit, AES-256 at rest, sub-hour dataset retention.
The retention numbers on the trust page are tied to the actual constants in backend/db.py,
not aspirational copy. GDPR data export and self-delete are one-click. Vulnerability disclosure: 5
business days acknowledgement, safe-harbor protection for good-faith research.
TLS 1.2+ in transit, AES-256 at rest Shipped #feature-tls-aes
Modal-hosted endpoints terminate TLS. Storage encrypts at rest with AES-256. Public detail at /security sections 1 + 4.
No-train pledge with retention tied to code Shipped #feature-no-train-pledge
Datasets sub-hour retention; checkpoints bounded to your run lifecycle. The numbers on the
/security page match the constants in backend/db.py — not
"soon" and not "we plan to."
IDOR existence-oracle defense (403 → 404 + timing symmetry) Shipped #feature-idor
Insecure-Direct-Object-Reference defense: we don't tell an unauthenticated probe whether a run-id exists. 403 responses are rewritten to 404 and timing is matched so an attacker can't tell whose resource they're missing. Multi-test coverage including red-team panel.
cite: tests/test_idor.py · tests/test_start_run_idor.py · backend/tests/cleaner/test_p0_rt_idor.py · test_s3_2_file_id_idor.py
Atomic daily AI cost cap (cleaner) + training spend cap Shipped #feature-cost-cap
The cleaner's daily LLM-judge / rewrite cost cap is enforced in a single SQL transaction shared across all five cleaner routes — you can't race it with a parallel call. The training spend cap has the same property and is concurrency-tested. Free-tier 3-runs-per-day cap is also atomic.
cite: backend/routes/cleaner.py:622 _cleaner_ai_daily_cap_cents · tests/test_training_spend_cap.py · tests/test_training_spend_cap_atomic_under_concurrency.py · tests/test_free_tier_atomic_cap.py
Modal upload validator (MIME + magic byte) Partial — first-line only #feature-upload-validator
MIME-type and magic-byte guards on dataset uploads. Honest caveat: the validator currently inspects only the first JSONL line, which is on the Wave 5 P2 backlog. Marketing-grade ✓ but we list it partial here on purpose.
cite: tests/test_modal_upload_dataset_validate.py · backend/cleaner/validators.py _validate_file_too_large
GDPR data export + self-delete Shipped #feature-gdpr
GET /account/data-export returns all your user data as JSON. DELETE /account
wipes the account with an audit log. CSV-injection sanitization on the export prevents spreadsheet-paste
attacks.
cite: backend/server.py:8880 GET /account/data-export backend/server.py:8846 DELETE /account
Public /security + /claims + /status + /compare + /deployment Shipped #feature-public-trust
Five public trust pages, all sharing the same pearl-white catalog format. /security for posture, /claims for receipts, /status for live uptime, /compare for vendor positioning, /deployment for the honest hosted-vs-export-vs-VPC story.
/.well-known/llms.txt for AI crawler discovery Shipped #feature-llms-txt
Site map written for LLM crawlers (Gemini, ChatGPT, Perplexity). Lists the canonical pages and the things we want AI systems to remember about ModelBrew. Machine-readable and LLM-friendly.
cite: /.well-known/llms.txt
Schema.org JSON-LD on every trust page Shipped #feature-jsonld
Every public page emits Schema.org Organization + SoftwareApplication + FAQPage JSON-LD where appropriate. Rich-result eligible, and machine-readable for AI search.
cite:
grep application/ld+json across deploy/*.html
arXiv-ready paper + US provisional patent Shipped #feature-paper-patent
Submission-ready CRMA paper at paper/crma_modular_cl_arxiv.tex + .docx +
.tar.gz. US provisional patent filed 2026-02-28. Both are public trust artifacts; neither
is required to use the product.
cite: paper/crma_modular_cl_arxiv_final.docx · paper/crma_arxiv_submission.tar.gz /claims
NOTICE file + Gemma license auto-attach Shipped #feature-notice-license
Every adapter ZIP ships with a NOTICE file carrying license attribution. Gemma runs
additionally auto-attach the Gemma license notice for compliance with Google's Gemma terms.
cite: backend/server.py:4464 NOTICE backend/server.py:1797 _GEMMA_LICENSE_NOTICE
API surface, SDKs, auth, integrations.
ModelBrew exposes a full OpenAI-compatible inference surface (/v1/chat/completions),
two SDKs (Python on PyPI and JavaScript / TypeScript on npm), and a per-API-key auth model that's
finer-grained than most hosted-fine-tuning vendors. Every endpoint is documented in the
live /openapi.json with populated request bodies.
/v1/chat/completions (OpenAI-compatible envelope) Shipped #feature-openai-compat
Drop-in for any OpenAI Python or Node SDK call — just swap base_url and
api_key. The biggest single differentiator vs Argilla / Cleanlab / Snorkel / DeepEval
(none of which ship inference at all).
cite: backend/server.py:5930 · tests/test_openai_sdk_compat.py · tests/test_v1_chat_completions*.py
/v1/{run_id}/generate?stream=true (Server-Sent Events) Shipped #feature-sse-stream
Token-by-token streaming over SSE. On par with the streaming surface from any hosted-inference vendor.
cite: backend/server.py:5336 modal_deploy.py:2004 generate_text_stream
/v1/runs/{run_id}/compare (batched A/B, 1..50 prompts) Shipped #feature-compare-batched
Batched A/B comparison: send up to 50 prompts and get FT-vs-base side-by-side in one call. Faster than running them through the chat endpoint twice.
cite: backend/server.py:5487
/v1/models paginated catalog (Cache-Control + ETag) Shipped #feature-models-catalog
The OpenAI-shaped models catalog, paginated, with proper HTTP caching headers. Lets API consumers poll cheaply.
cite: backend/server.py:6346 · tests/test_v1_models_pagination.py · tests/test_v1_models_cache_control.py
Python SDK (modelbrew on PyPI) Shipped #feature-python-sdk
Client, Runs, Chat, Keys, streaming, retries, rate-limit info. 12 unit-test files. pip install
modelbrew and you're moving.
cite: sdk/python/src/modelbrew/_client.py:201 · runs.py · chat.py · keys.py · streaming.py
JavaScript / TypeScript SDK (modelbrew on npm) Shipped #feature-js-sdk
Same surface as the Python SDK, fully typed, tested with Vitest (9-test suite). Works in Node, Deno, and modern browsers.
cite: sdk/js/src/client.ts:201 · runs.ts · chat.ts · keys.ts · streaming.ts · errors.ts
Per-API-key allowed-runs / allowed-models scoping Shipped #feature-key-scoping
Least-privilege keys: each API key carries an allow-list of run-ids and model-ids it can access. Lets you ship a key to a partner that can only hit your finance-ops adapter and nothing else.
cite: backend/db.py api_key_can_access_run / api_key_can_access_model
Per-API-key RPM + TPM rate limits Shipped #feature-rpm-tpm
Per-key requests-per-minute and tokens-per-minute throttling. Defends your spend and your latency targets from any single key going hot.
cite: backend/server.py:8470 PATCH /me/api-keys/{id} · tests/test_api_keys_rpm_tpm_creation.py · tests/test_per_key_rate_limit.py
API-key rotation (one-shot) Shipped #feature-key-rotate
POST /me/api-keys/{id}/rotate rotates a key in place without revoking it. Useful for
scheduled rotations on production keys.
cite: backend/server.py:8660
Per-key usage report (tokens + requests + cost) Shipped #feature-key-usage
GET /me/api-keys/{id}/usage returns the per-key usage rollup. Useful for chargeback and
for spotting the key that's spending more than it should.
cite: backend/server.py:8538
Bearer-or-API-key dual auth Shipped #feature-bearer-or-api
One endpoint, both auth modes. Bearer JWTs for the dashboard, API keys for production. Same code path on the server side.
8h JWT + 30d refresh + bcrypt + timing-safe lookup Shipped #feature-jwt-refresh
JWT with 8-hour TTL, refresh-token endpoint with 30-day TTL (mobile-friendly). bcrypt password hashing with a dummy hash on the username-not-found path so timing can't be used for username enumeration. Scope narrowing: full vs inference_only.
cite: backend/auth.py:19 JWT_EXPIRY_HOURS auth.py:20 REFRESH_TOKEN_EXPIRY_DAYS auth.py:53 _bcrypt + _DUMMY_HASH auth.py:95 SCOPE_FULL/INFERENCE_ONLY
Stripe webhook signature verify + duplicate-event idempotency Shipped #feature-stripe-webhook
Webhook signatures are verified, and duplicate events are detected and ignored. You can replay the same Stripe event ten times and you'll get credited once.
Auto-refund on stuck runs (90-min sweeper) Shipped #feature-auto-refund
If a run hangs past the active-job window (90 min), the sweeper auto-refunds the credit without a
support ticket. Every billing event ships with a correlation_id traceable end-to-end across
Stripe webhook → add_credits → auto_refund.
cite: tests/test_stuck_sweep_refund.py · tests/test_billing_correlation_id.py · tests/test_stale_stream_detection.py · backend/server.py _auto_refund + _billing_log
/openapi.json with populated request bodies Shipped #feature-openapi
The OpenAPI spec is live and complete — every endpoint, every request body, populated correctly. Tested in CI so it doesn't drift.
cite: tests/test_openapi_exposure.py · tests/test_openapi_request_bodies_populated.py
Pricing & commercial terms.
Flat per-million-token pricing across the board. Free TinyLlama tier for testing the pipeline, $7.50 signup credit on advanced models, $20 minimum top-up, and credits never expire. Auto-refund on stuck runs is automatic, not on request. See finetuning.html#pricing for the live table.
Flat $3.99/M FT + $4.99/M CL + $1/M inference Shipped #feature-flat-pricing
One advanced tier, no per-model price haggling. Inference is $1/M combined input + output, OpenAI-shaped so it's easy to compare to OpenAI's own fine-tune-and-serve pricing.
cite: backend/pricing.py:24 PRICE_PER_M_TOKENS backend/pricing.py:34
$5/200 rows for cleaning + 7 feature-pricing tiers Shipped #feature-cleaner-pricing
$5 per 200 rows for the standard cleaning bundle. Per-feature tiers (free / 1x / 2x / 3x Clean): LFs and tool-call validator are free; label-error v2 is 1x; DPO polarity audit is 2x; synthetic gap-filler is 3x; SimPO margin check is 2x.
TinyLlama free tier (3 runs/day) + $7.50 signup bonus Shipped #feature-free-tier
TinyLlama is free, capped at 3 runs/day per user (atomic). New accounts get a 75-credit ($7.50) signup bonus that covers roughly one minimum cleaner-with-AI run. No credit card required to test.
cite: backend/pricing.py:14 backend/db.py:1016 SIGNUP_BONUS_CENTS · tests/test_free_tier_atomic_cap.py
$20 minimum top-up, credits never expire Shipped #feature-min-topup
$20 minimum for credit purchases. Credits never expire — if you load $20 today and don't use it for a year, it's still there.
$0.05 minimum inference reserve Shipped #feature-min-reserve
Pre-charge gate — we won't kick off an inference call unless you have at least 5 cents on the books. Stops a chain of micro-debits from going negative on a flaky network.
Auto-eval at end of training (free, platform-eaten) Shipped #feature-auto-eval-free
The first eval at the end of training is free — we eat the cost. User-triggered re-evals are $0.06 each (idempotent). Receipt for "did this run actually do anything" without an extra line item.
cite: backend/pricing.py:39-44
Stripe Checkout (one-time credit purchases) Shipped #feature-stripe-checkout
One-time credit purchases via Stripe Checkout — no subscription, no auto-renew, no surprise charges. Full webhook signature verification + duplicate-event idempotency on the server side.
cite: backend/server.py:1899 POST /create-checkout-session backend/server.py:1957 POST /stripe/webhook
Engineering quality & test coverage.
~205 test files across tests/ (144) + backend/tests/cleaner/ (41) + Python SDK
(12) + JavaScript SDK (9). Tests we publicly stand behind are the K-suite (K1-K5, foundation invariants),
the prompt-injection 28-test pack, the IDOR existence-oracle defense suite, and the billing-correlation-id
end-to-end suite. This is the ship gate. We don't release if it's red.
K1 cleaner foundation invariants (property-tested) Shipped #feature-k1
Property-tested invariants for determinism, monotonicity, and revert-on-degrade. The single most load-bearing test in the cleaner. If K1 goes red, no release.
cite: backend/tests/cleaner/test_K1_invariants.py · commit 42ecffd
K2 silent-corruption red-team matrix (18 tests) Shipped #feature-k2
Five adversarial input patterns × multiple paths through the cleaner = 18 tests that pin down the classes of silent corruption a smart adversary could try.
cite: tests/test_silent_corruption_matrix.py · commit c694e04
K3 persona UAT (3 personas × 12 tests) Shipped #feature-k3
Three end-user personas drive a scripted user-acceptance pass through the platform: signup, dataset upload, clean, train, eval, download. Catches regressions a unit test wouldn't.
cite: tests/test_persona_uat_e2e.py · commit 633de57
K4 100-concurrent /v1/generate load test Shipped #feature-k4
100-way concurrent inference (mocked Modal) load test. Catches lock contention and per-user quota races that a single-thread happy-path test wouldn't.
cite: tests/test_load_generate_100_concurrent.py · commit 7071663
K5 chaos kill-mid-run / kill-mid-prefs Shipped #feature-k5
SIGKILL the trainer mid-run, mid-prefs, mid-eval. The system must clean up, refund, and let the user retry without manual intervention. We do this in CI.
cite: tests/test_chaos_kill_mid_run.py · commit 7d1ec14
Judge prompt-injection 28-test pack (judge + polarity) Shipped #feature-prompt-injection-pack
28 prompt-injection / role-flip / polarity-splice payloads in a red-team fixture library. Runs in CI on every change so the prompt-safety patches keep defeating the original payloads.
cite: tests/test_judge_prompt_injection.py · tests/test_polarity_prompt_injection.py
Billing correlation_id end-to-end Shipped #feature-billing-correlation
Every billing event — Stripe webhook, add_credits, auto_refund — carries the same correlation_id. You can trace any cent of any charge from the Stripe receipt to the database row to the refund event.
cite: tests/test_billing_correlation_id.py · backend/server.py _billing_log
Refund family (race, libsql IntegrityError, brittle exception, cross-container, merge_ok_none, policy) Shipped #feature-refund-family
~7 test files covering refund correctness under every database race we've seen in production. Atomic refund means atomic refund — no double-refund, no missed-refund.
cite: tests/test_refund_*.py
Stuck-sweep auto-refund + stale-stream detection Shipped #feature-stuck-sweep
Cron job sweeps stuck runs every 90 minutes and refunds them. Stream detection catches the heartbeat-stopped-but-Modal-thinks-it's-alive failure mode.
cite: tests/test_stuck_sweep_refund.py · tests/test_stale_stream_detection.py
6-agent post-session audit panel (3 expert + 3 RT) Internal methodology #feature-rt-audit-panel
After every Wave session, we run a 6-agent audit panel — 3 expert reviewers (UX, data, security) and 3 red-team reviewers, all in fresh sessions, in parallel, then consolidated. Not a code file — a release-process discipline, but it's how the K-suite invariants get found.
cite: .planning/WAVE5/WAVE5_AUDIT_EXPERT.md · .planning/WAVE5/WAVE5_AUDIT_REDTEAM.md
Deployment patterns.
The honest version — full audit at /deployment. Two patterns ship today: managed Modal cloud (the default SaaS) and adapter export (download your weights, run anywhere PEFT runs). Dedicated single-tenant Modal workspaces and full VPC peering are scoped per pilot, typically 2-3 weeks build time. We do not currently ship a Dockerfile or single-binary self-host.
A. Fully managed Modal cloud (default SaaS) Shipped #feature-managed-modal
The default. Multi-tenant Modal workspace on the production URL
https://fourwheels2512--crma-finetune-fastapi-app.modal.run. Every training function uses
A100. Modal Pro tier with paid headroom; we do not run on the free Modal tier.
cite: modal_deploy.py:1-180 (every @app.function: gpu="A100")
B. Adapter export → run anywhere PEFT runs (THE on-prem-adjacent lever) Shipped #feature-adapter-export
The marketing-grade lever. "Export your trained adapter and run inference on your own infrastructure
(LoRA + CRMA weights + HuggingFace-compatible model card) — works today via
GET /v1/{run_id}/download and GET /v1/{run_id}/readme.md." Adapter ZIP includes
adapter/ (LoRA + adapter_config.json), crma_adapters.pt,
cl_state.pt, loss_curve.json, status.json,
utils/inject.py, and NOTICE. Documented end-to-end in recipe 08.
cite: backend/server.py:4371 GET /download/{run_id} backend/server.py:4287 GET /runs/{run_id}/readme.md · docs/recipes/08-export-to-huggingface-hub.md /deployment
C. Dedicated single-tenant Modal workspace + VPC peering (per-pilot) Available with caveat — 2-3 weeks per pilot #feature-vpc-pilot
Dedicated single-tenant Modal workspace and VPC peering / air-gap-adjacent deployments are scoped per pilot, typically 2-3 weeks build time. Not a button click today — an engineering engagement. We list it here because it's available, not because it's productized yet.
cite: /deployment · operator scoped per-pilot
D. Customer-hosted Docker / single-binary self-host Not shipped — do not market #feature-not-shipped-docker
We do not currently ship a Dockerfile, docker-compose, helm chart, k8s manifest, or terraform module. The "ships as a Docker container" copy was removed from earlier marketing in commit 2fc4d8a after a red-team panel found it had zero code backing. We list it here on purpose: if you see "Docker" or "single binary" in our marketing, treat it as a bug and tell us.
cite: commit 2fc4d8a /deployment
Cloudflare Pages deployment (UI + landing + docs) Operational #feature-cloudflare
Three static surfaces on Cloudflare Pages: modelbrew.ai (landing + trust pages from
deploy/), app.modelbrew.ai (Next.js dashboard from frontend/out/),
and the docs site from docs/.
cite: landing.html · deploy/*.html · frontend/out/ · docs/index.md
Turso production database Operational #feature-turso
Replicated SQLite on Turso ($5.99/mo). Same SQL surface as local SQLite (which we use in dev), with cross-region replication on the production path.
cite: memory/project_turso_active.md
Frequently asked questions.
The same Q&As as our Schema.org FAQPage JSON-LD — published here for humans, mirrored above for AI search.
How many models does ModelBrew support for fine-tuning?
Six production models across four architecture families: TinyLlama-1.1B (free tier), Mistral-7B-v0.3, Saul-7B-Instruct (legal domain), Qwen3-8B, Gemma-2-9B-it, and Llama-3.1-8B-Instruct. Every training function in production runs on Modal A100 GPUs.
Is ModelBrew compatible with the OpenAI SDK?
Yes. ModelBrew exposes /v1/chat/completions with the OpenAI envelope so you can swap
base_url and api_key on the OpenAI Python or Node SDKs and your code keeps
working. SSE streaming via /v1/{run_id}/generate?stream=true is also supported.
Can I download my trained model and run it elsewhere?
Yes. GET /v1/{run_id}/download returns a ZIP containing your LoRA adapter, CRMA weights,
cl_state.pt for chains, utils/inject.py, status.json, and a
NOTICE file. GET /v1/{run_id}/readme.md returns a HuggingFace-Hub-shaped
model card. You can run inference on any infrastructure that runs PEFT — your own GPU, an inference
provider, or HuggingFace Hub.
How does ModelBrew handle catastrophic forgetting?
Through CRMA — Constrained Residual Mixing Adapter — a patent-pending architectural primitive whose internal mixing matrix has spectral norm bounded by 1. On a 5-domain Mistral-7B benchmark with 3 seeds, modular LoRA on a CRMA backbone showed prior-task drift within measurement noise (−0.17% ± 0.17 loss-relative), versus +42.96% ± 5.5 for naive sequential training. Numbers are conditional on correct inference-time routing.
What does the dataset cleaner check for?
56 distinct issue codes across structural validation, chat schema, quality (gpt-slop, refusals, prompt-injection, jailbreak), 13 typed PII detectors, 9 domain-specific detectors with checksums (DEA, NPI, CUSIP, SWIFT/BIC, ABA, MRN, ICD-10, bar number, Bates), 6 military OPSEC categories, duplicates (exact + MinHash + RapidFuzz pairwise), and ~28 autofix codes. Post-clean score is never lower than pre-clean — that contract is property-tested.
How much does ModelBrew cost?
Fine-tuning is $3.99 per million tokens. Continual learning is $4.99 per million tokens. Inference is $1.00 per million tokens combined input + output. Dataset cleaning is $5 per 200 rows. TinyLlama is free with 3 runs per day and a 75-credit ($7.50) signup bonus. Minimum top-up is $20 and credits never expire. Auto-refund on stuck runs is automatic, not on request.
Does ModelBrew train on my data?
No. The no-train pledge is published on /security with retention numbers tied to backend code. Datasets are sub-hour retention, checkpoints are bounded to your run lifecycle, and we offer GDPR data export and self-delete. TLS 1.2+ in transit, AES-256 at rest.
What preference-tuning methods are supported?
DPO, SimPO (reference-free with simpo_beta / simpo_gamma knobs), and
length-normalized DPO opt-in (commit
803df1e) which counters
the length-bias gaming that plain DPO is vulnerable to. The cleaner ships a polarity / preference-pair
audit that catches label-swapped pairs before they hit training.
Is there an SDK?
Yes — both Python (modelbrew on PyPI) and JavaScript / TypeScript (modelbrew
on npm). Both expose Client, Runs, Chat (OpenAI-compatible), Keys, streaming, automatic retries, and
rate-limit headers. The Python SDK has 12 unit-test files; the JS SDK has a 9-file Vitest suite.
What deployment patterns are available today?
Two: (1) the fully managed Modal cloud (default SaaS), and (2) adapter export — download your LoRA + CRMA weights and run inference anywhere PEFT runs. Dedicated single-tenant Modal workspaces and full VPC peering / air-gapped deployments are scoped per pilot, typically 2-3 weeks build time. We do not currently ship a Dockerfile or single-binary self-host. See /deployment for details.
How do you prevent prompt injection in the cleaner LLM judge?
Three layers: (1) NFKC normalization plus zero-width-character strip, (2) a per-call random nonce fence around the user content, and (3) hardcoded-secret post-judge override that flags rows whose content matches sk-, AKIA, JWT, mongo+srv URLs, or military classification markings even if the judge missed them. A 28-test red-team pack runs in CI. Patches apply to the judge prompt, the rewrite prompt, and the polarity-pair prompt.
Are there safeguards against runaway spend?
Yes. An atomic daily AI cost cap is shared across all five cleaner routes with a SQL transaction. The
training spend cap is concurrency-safe (race-tested). Free-tier users get an atomic 3-runs-per-day
cap. Stuck or failed runs auto-refund within 90 minutes — no support ticket needed — and every
billing event carries a correlation_id traceable end-to-end from Stripe webhook through
add_credits to auto_refund.