Download the adapter. Run it where your data already lives.
The single biggest reason regulated buyers reject hosted fine-tuners is "we cannot send our data to your inference endpoint." Hybrid Export answers that. You upload training data once, our cloud GPUs train the adapter, and then you take the artifact home.
The GET /download/{run_id} endpoint streams a ZIP containing the LoRA adapter directory
(PEFT format) and, for CRMA continual-learning runs, the spectrally bounded CRMA bundle. The endpoint
requires the same authenticated user that owned the run; selected-checkpoint resolution happens server-side
before the ZIP is built so you cannot accidentally download a stale or mismatched adapter.
What you get
- LoRA adapter in standard PEFT format —
adapter_config.json+adapter_model.safetensors. Loadable withPeftModel.from_pretrained()from any recentpeft+transformersinstall. - CRMA bundle (continual-learning runs only) — the shared spectrally bounded backbone adapter and per-domain modular adapters. Composable at inference time on your infrastructure.
- README export at
GET /runs/{run_id}/readme.mdwith the safe-to-share hyperparameter allow-list and a load snippet you can paste into a worker. - Your weights, your retention. Once downloaded, we no longer need to keep a copy. The cloud copy expires on the documented schedule.
Load snippet (Hugging Face Transformers)
# 1) download the bundle (auth header required) curl -H "Authorization: Bearer $MB_API_KEY" \ -o my_adapter.zip \ https://fourwheels2512--crma-finetune-fastapi-app.modal.run/download/$RUN_ID unzip my_adapter.zip -d ./my_adapter # 2) load it next to a base model on your own GPU from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3") tok = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3") model = PeftModel.from_pretrained(base, "./my_adapter/adapter") # inference traffic never leaves your VPC from here
For higher-throughput serving, the same adapter directory loads into vLLM or TGI via their respective LoRA / PEFT adapter flags. We do not maintain a custom inference runtime — everything is standard PEFT format on purpose, so you are not locked in.
The default path. We run the GPUs, you ship the model.
Cloud-managed is the fastest way from upload to a callable model. Training runs on Modal serverless GPUs
(A100), and as soon as a run finishes you get an OpenAI-compatible /v1/chat/completions
endpoint scoped to that run. No infrastructure to provision; no cold-start GPU bill if nobody is calling
your model.
Data lifecycle is documented and short by default: uploaded dataset files are swept hourly on the API host and the training-infrastructure copy is removed within 7 days. Account and billing rows persist for the lifetime of your account; you can request deletion at any time via modelbrewai@gmail.com. Concrete numbers, matched to what the code actually enforces, live on /security#retention.
If your concern is "I don't want my training data to leave my infrastructure at all" — that is what the VPC pilot path below solves. Hybrid Export answers the inference half today; the VPC pilot answers the training half on a 2–3 week clock.
VPC deployment, scoped per pilot — 2–3 weeks per enterprise customer.
Honest framing: scoped per pilot, not deferred.
The on-prem-adjacent path that ships in 2–3 weeks is a VPC deployment of the ModelBrew backend — the same FastAPI app you call today, packaged as a Docker stack, with Postgres swapped in for SQLite, Stripe replaced with a license-key check, and the training queue routed to either your own Modal account or our shared cloud (your call, per your security review). Customer data lives in the customer's Postgres; inference traffic stays inside the VPC.
We are not pretending this is shipped today. We are saying: the code is portable. The build is execution time, not architecture work. Concretely, here is the engineering decomposition for one senior engineer doing one pilot:
- Days 1–2: Dockerfile + docker-compose for backend + Postgres + Caddy/nginx.
- Days 3–5: SQLite→Postgres swap (the queries are mostly portable; a handful
of
WITHOUT ROWID-style and Turso-specific call sites need rewriting). - Days 6–8: Replace Stripe webhook + add_credits paths with a license-key gate;
keep
_billing_logfor audit, drop the customer-facing wallet UI. - Days 9–11: Modal-account routing — either pin training to the customer's own Modal token (env var) or keep training on our shared cloud (their choice, written into the pilot SOW).
- Days 12–14: Smoke test inside the customer VPC; sign-off; hand over runbook and rotate the license key.
That is a real 2–3 week clock for one focused engineer. We will quote against this decomposition, with milestones and a fixed price per pilot.
If you have a near-term enterprise need, we want to hear about it. Concretely:
- Need VPC deployment in 2–3 weeks — this is the path above. We will sign a pilot SOW with explicit scope, milestones, and price.
- Want BYO-Modal-account training only — the smaller engagement; we route your training queue to your Modal token without the full VPC backend ship.
- Want to evaluate Hybrid Export first — that is exactly what it is for. Train on our cloud, download the adapter, run inference inside your own VPC. Most enterprise data-handling concerns at the inference boundary are satisfiable today with Hybrid Export plus a no-train pledge.
Enterprise contact
Email modelbrewai@gmail.com with your target deployment model (cloud / hybrid / VPC pilot / dedicated Modal), expected volume, and any compliance requirements (SOC 2, HIPAA, FedRAMP, ITAR, etc.). We will respond within 5 business days with a realistic scope, timeline, and price — or a clear "this is not a fit yet."
Honest line in the sand.
The VPC pilot above is a real 2–3 week ship. These adjacent things are not in that window and we will not pretend they are:
- Fully air-gapped install. The judge LLM in our cleaner pipeline calls Gemini. A truly air-gapped install needs a local-only judge model swap, which is a separate engagement (and probably a different cleaner architecture).
- SOC 2 Type II certification. Type II is a 6–12 month observation window with a third-party auditor. It is funded and started after the first enterprise contract, not on spec.
- HIPAA BAA. The BAA process and PHI-isolation engineering are funded by the first enterprise healthcare contract; that customer becomes the design partner for the BAA.
- Multi-region or HA replication. The VPC pilot ships single-region, single-Postgres. Active/standby and cross-region replication are a separate phase.
- FedRAMP / IL5 / classified networks. Different compliance regime entirely. Email us if this is your need; we will be honest about what we can and cannot do today.
Everything in the "Available now" and "VPC pilot · 2–3 weeks" cards above — managed cloud
training, OpenAI-compatible inference endpoint, adapter ZIP download via /download/{run_id},
README export, the security retention schedule, the no-train pledge, and the per-pilot VPC engineering
decomposition — is in the codebase today (or honest engineering time, not architecture work). Cited on
the security page. Verify each one against the live API.