Deployment options
Deployment · Cloud · Hybrid · VPC pilot

Deployment options.
Train on ModelBrew, run anywhere — including inside your VPC.

Three honest options. Cloud-managed and Hybrid Export ship today. Full VPC deployment is available per enterprise pilot in 2–3 weeks — the code is portable; the build is execution time, not architecture work. We are explicit about scope: containerized backend + Postgres swap + license-key auth + customer-controlled data, with the training queue running on the customer's own Modal account or our shared cloud per their security review. Air-gapped install, SOC 2 Type II, and HIPAA BAA stay future-roadmap — we do not claim them.

Last reviewed: 2026-05-07 Sales / enterprise: modelbrewai@gmail.com VPC pilot scope: 2–3 weeks per customer
Available now

Cloud (managed)

The default. Fully managed training + inference on Modal.

  • Zero infrastructure for you — we run the GPUs.
  • Training data and intermediate checkpoints stay in our infrastructure for the run, then sweep on the documented retention schedule.
  • OpenAI-compatible /v1/chat/completions endpoint when each run finishes.
  • TLS 1.2+, AES-256 at rest, no-train pledge — see /security.
Available now

Hybrid Export

Train here, run anywhere. Download the adapter and serve it inside your own VPC.

  • Train on our GPUs, then call GET /download/{run_id} to fetch a ZIP containing the LoRA adapter + CRMA bundle.
  • Load with PEFT / Hugging Face Transformers, vLLM, or TGI on your own GPU. Standard PEFT format — no proprietary loader.
  • Inference traffic never leaves your VPC. The trained adapter is yours to keep.
  • README export at GET /runs/{id}/readme.md documents recommended hyperparameters and load instructions.
VPC pilot · 2–3 weeks

Enterprise VPC deployment

Containerized backend running inside your VPC, scoped per pilot. Honest 2–3 weeks per enterprise customer.

  • What ships in 2–3 weeks: Dockerized backend + Postgres + license-key auth + customer-controlled data + training queue on your own Modal account or our shared cloud.
  • Why we can ship this fast: the code is already portable (FastAPI, SQLite→Postgres swap is a small refactor, Stripe is one billing module replaceable with a license check, and Modal is the only external dependency).
  • Out of 2–3 weeks scope: fully air-gapped install (judge LLM still calls Gemini), SOC 2 Type II cert, HIPAA BAA, and multi-region.
  • Have a target go-live? Email us — we will quote against the actual decomposition.

Download the adapter. Run it where your data already lives.

The single biggest reason regulated buyers reject hosted fine-tuners is "we cannot send our data to your inference endpoint." Hybrid Export answers that. You upload training data once, our cloud GPUs train the adapter, and then you take the artifact home.

The GET /download/{run_id} endpoint streams a ZIP containing the LoRA adapter directory (PEFT format) and, for CRMA continual-learning runs, the spectrally bounded CRMA bundle. The endpoint requires the same authenticated user that owned the run; selected-checkpoint resolution happens server-side before the ZIP is built so you cannot accidentally download a stale or mismatched adapter.

What you get

  • LoRA adapter in standard PEFT format — adapter_config.json + adapter_model.safetensors. Loadable with PeftModel.from_pretrained() from any recent peft + transformers install.
  • CRMA bundle (continual-learning runs only) — the shared spectrally bounded backbone adapter and per-domain modular adapters. Composable at inference time on your infrastructure.
  • README export at GET /runs/{run_id}/readme.md with the safe-to-share hyperparameter allow-list and a load snippet you can paste into a worker.
  • Your weights, your retention. Once downloaded, we no longer need to keep a copy. The cloud copy expires on the documented schedule.

Load snippet (Hugging Face Transformers)

# 1) download the bundle (auth header required)
curl -H "Authorization: Bearer $MB_API_KEY" \
     -o my_adapter.zip \
     https://fourwheels2512--crma-finetune-fastapi-app.modal.run/download/$RUN_ID

unzip my_adapter.zip -d ./my_adapter

# 2) load it next to a base model on your own GPU
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tok  = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
model = PeftModel.from_pretrained(base, "./my_adapter/adapter")
# inference traffic never leaves your VPC from here

For higher-throughput serving, the same adapter directory loads into vLLM or TGI via their respective LoRA / PEFT adapter flags. We do not maintain a custom inference runtime — everything is standard PEFT format on purpose, so you are not locked in.

The default path. We run the GPUs, you ship the model.

Cloud-managed is the fastest way from upload to a callable model. Training runs on Modal serverless GPUs (A100), and as soon as a run finishes you get an OpenAI-compatible /v1/chat/completions endpoint scoped to that run. No infrastructure to provision; no cold-start GPU bill if nobody is calling your model.

Data lifecycle is documented and short by default: uploaded dataset files are swept hourly on the API host and the training-infrastructure copy is removed within 7 days. Account and billing rows persist for the lifetime of your account; you can request deletion at any time via modelbrewai@gmail.com. Concrete numbers, matched to what the code actually enforces, live on /security#retention.

If your concern is "I don't want my training data to leave my infrastructure at all" — that is what the VPC pilot path below solves. Hybrid Export answers the inference half today; the VPC pilot answers the training half on a 2–3 week clock.

VPC deployment, scoped per pilot — 2–3 weeks per enterprise customer.

Honest framing: scoped per pilot, not deferred.

The on-prem-adjacent path that ships in 2–3 weeks is a VPC deployment of the ModelBrew backend — the same FastAPI app you call today, packaged as a Docker stack, with Postgres swapped in for SQLite, Stripe replaced with a license-key check, and the training queue routed to either your own Modal account or our shared cloud (your call, per your security review). Customer data lives in the customer's Postgres; inference traffic stays inside the VPC.

We are not pretending this is shipped today. We are saying: the code is portable. The build is execution time, not architecture work. Concretely, here is the engineering decomposition for one senior engineer doing one pilot:

  • Days 1–2: Dockerfile + docker-compose for backend + Postgres + Caddy/nginx.
  • Days 3–5: SQLite→Postgres swap (the queries are mostly portable; a handful of WITHOUT ROWID-style and Turso-specific call sites need rewriting).
  • Days 6–8: Replace Stripe webhook + add_credits paths with a license-key gate; keep _billing_log for audit, drop the customer-facing wallet UI.
  • Days 9–11: Modal-account routing — either pin training to the customer's own Modal token (env var) or keep training on our shared cloud (their choice, written into the pilot SOW).
  • Days 12–14: Smoke test inside the customer VPC; sign-off; hand over runbook and rotate the license key.

That is a real 2–3 week clock for one focused engineer. We will quote against this decomposition, with milestones and a fixed price per pilot.

If you have a near-term enterprise need, we want to hear about it. Concretely:

  • Need VPC deployment in 2–3 weeks — this is the path above. We will sign a pilot SOW with explicit scope, milestones, and price.
  • Want BYO-Modal-account training only — the smaller engagement; we route your training queue to your Modal token without the full VPC backend ship.
  • Want to evaluate Hybrid Export first — that is exactly what it is for. Train on our cloud, download the adapter, run inference inside your own VPC. Most enterprise data-handling concerns at the inference boundary are satisfiable today with Hybrid Export plus a no-train pledge.

Enterprise contact

Email modelbrewai@gmail.com with your target deployment model (cloud / hybrid / VPC pilot / dedicated Modal), expected volume, and any compliance requirements (SOC 2, HIPAA, FedRAMP, ITAR, etc.). We will respond within 5 business days with a realistic scope, timeline, and price — or a clear "this is not a fit yet."

Honest line in the sand.

The VPC pilot above is a real 2–3 week ship. These adjacent things are not in that window and we will not pretend they are:

  • Fully air-gapped install. The judge LLM in our cleaner pipeline calls Gemini. A truly air-gapped install needs a local-only judge model swap, which is a separate engagement (and probably a different cleaner architecture).
  • SOC 2 Type II certification. Type II is a 6–12 month observation window with a third-party auditor. It is funded and started after the first enterprise contract, not on spec.
  • HIPAA BAA. The BAA process and PHI-isolation engineering are funded by the first enterprise healthcare contract; that customer becomes the design partner for the BAA.
  • Multi-region or HA replication. The VPC pilot ships single-region, single-Postgres. Active/standby and cross-region replication are a separate phase.
  • FedRAMP / IL5 / classified networks. Different compliance regime entirely. Email us if this is your need; we will be honest about what we can and cannot do today.

Everything in the "Available now" and "VPC pilot · 2–3 weeks" cards above — managed cloud training, OpenAI-compatible inference endpoint, adapter ZIP download via /download/{run_id}, README export, the security retention schedule, the no-train pledge, and the per-pilot VPC engineering decomposition — is in the codebase today (or honest engineering time, not architecture work). Cited on the security page. Verify each one against the live API.