Supervised fine-tuning, DPO, and SimPO on Mistral, Llama, Saul, Qwen, Gemma. LoRA + QLoRA. Flat $3.99 per million tokens across every paid model. No reference model required for SimPO. No RLHF reward model. No infrastructure to manage.
Pick any. Train any. Same per-token rate. TinyLlama is free for prototyping (3 runs/day). Everything else is a flat $3.99/M tokens for fine-tuning. Continual learning is in closed beta — request access.
Free
TinyLlama-1.1B
Free tier — 3 runs per day. Great for prototyping training pipelines and testing your data.
Pro
Mistral-7B
The benchmark workhorse. Most of our continual-learning research is validated on Mistral-7B-Instruct.
Pro
Llama-3.1-8B
The most popular Hugging Face base model. Strong instruction-following out of the box.
Pro
Saul-7B
Legal-domain specialist. We've validated Saul-7B across 18/18 legal sub-domains.
Pro
Qwen3-8B
Strong multilingual capabilities. Good fit for non-English fine-tunes and code.
Pro
Gemma-2-9B
Google's open weights. Validated on our 5-domain continual-learning benchmark.
Quick Start
Three lines to your first training run.
Or skip the API entirely and use the web UI — no code needed.
If you have preference pairs — a prompt with one chosen response and one rejected response — you can align an open-source model directly. No reward model. No PPO loop. No RLHF infrastructure. ModelBrew runs both Direct Preference Optimization (DPO) and SimPO end-to-end on the same hosted GPU pipeline as supervised fine-tuning.
Direct Preference Optimization (DPO)
DPO (Rafailov et al., 2023) treats the language model itself as the reward model. You hand it pairs of (prompt, chosen, rejected) and it adjusts log-probabilities so the model prefers chosen responses, regularized by a frozen reference copy of the base model.
Use DPO when:
You are reproducing an existing DPO baseline.
You need sigmoid, ipo, or hinge loss variants.
You want the implicit-reward formulation paired with a reference model for stability.
How to run it:
Upload a JSONL with prompt / chosen / rejected on the Dataset Optimizer.
SimPO (Meng, Xia & Chen, Princeton, 2024) replaces DPO's frozen reference model with the length-normalized average log-probability of the response itself. No reference model. No KL term against a frozen copy. Two hyperparameters: β (reward sharpness) and γ (target margin).
Why SimPO:
One model in memory at training time instead of two — lower GPU footprint.
Length-normalized loss reduces the verbosity bias that vanilla DPO is known to introduce.
Strong reported scores on AlpacaEval 2 and Arena-Hard in the published paper.
When to choose SimPO over DPO: new training runs where you don't need to match an existing DPO baseline, you want lower training cost, or your data shows a length skew between chosen and rejected responses.
Match SimPO paper recipe: a toggle on the train page strict-bounds γ/β to the published configuration range (γ/β ≤ 1.0, β ≥ 0.5). Across the six configs in the paper, the maximum published γ/β ratio is 0.8; the strict bound keeps you inside that envelope. Toggle it off to explore beyond the paper.
No. Both DPO and SimPO learn directly from preference pairs. There is no separate reward model and no PPO loop. SimPO additionally drops the frozen reference model that vanilla DPO relies on.
What data format does ModelBrew expect?
JSONL with prompt, chosen, and rejected fields per row. The Dataset Optimizer auto-detects pair-shaped data and runs structural validators plus a judge-based polarity sample before training.
How is preference fine-tuning priced?
Same flat rate as supervised fine-tuning: $3.99 per million training tokens, billed against prepaid credits with no subscription. Prompt plus chosen plus rejected tokens are counted across epochs. A token estimate is shown before you click train. Failed runs are auto-refunded.
Can I combine preference fine-tuning with the CRMA backbone?
Yes. Both DPO and SimPO runs train a fresh per-task LoRA adapter, which is what CRMA wraps for continual learning. See the continual learning page for how new domains stack onto the same model without overwriting prior fine-tunes.
When should I pick SFT instead of DPO or SimPO?
If your data is single-response (prompt → one ideal answer), SFT is the right choice and the dropdown defaults to it. DPO and SimPO need pair data — the train page hides the preference selector when no pairs are detected.
Pricing
Pay only for what you use.
No subscriptions. Sign up and get 75 credits free ($7.50). Load $20 in credits when you're ready. 3 free runs per day on TinyLlama.
All 7–9B models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)
Fine-Tuning (SFT)
$3.99 / M tokens
DPO (preference fine-tuning)
$3.99 / M tokens
SimPO (reference-free preference)
$3.99 / M tokens
Continual Learning
Closed beta — contact us
Clean with AI (Dataset Optimizer)
50 credits per 200 rows
Credits & Balance
Minimum credit purchase
$20
Credits roll over
Never expire
Failed jobs
Auto-refunded
Example: Fine-tune Mistral-7B on 500 medical Q&A pairs
Estimated tokens
~135K tokens
Rate (Fine-Tuning)
$3.99 / M tokens
Computed cost
$0.54
Deducted from balance
$0.54
Example: Continual learning on Mistral-7B — 5 domains
Continual learning is currently in closed beta and not available for self-serve purchase. Request access if you'd like to evaluate it on your data.
Refund Policy: If a training job fails due to a system error, your credits are automatically refunded — no action needed. Unused credits are non-refundable and non-transferable. All payments processed securely by Stripe — we never see your card details. By purchasing credits, you agree to our Terms of Service.
Security
Designed for regulated-industry workflows.
Practical security defaults — encryption, RBAC, audit logging, and security headers — useful for teams handling sensitive data.
Encryption at Rest
All model checkpoints and training data encrypted with AES-256 (Fernet). Secure delete enabled — no residual data on disk.
Security Headers
HSTS, X-Frame-Options DENY, Content-Type nosniff, XSS protection, strict Referrer-Policy, and Permissions-Policy on every response.
Audit Logging
Every API call logged with user, action, IP, and timestamp. Full audit trail for compliance reviews and incident response.
Role-Based Access
RBAC with granular permissions. Admin, user, and read-only roles. API keys separated from session tokens.
GDPR & Data Rights
One-click data export and account deletion. Your data, your control. Full compliance with data protection regulations.
Hardened Runtime
Non-root containers, health checks, safe model loading (no arbitrary code execution), and sanitized error responses.
On-Premises Deployment Available
Need to keep data inside your network? ModelBrew ships as a Docker container you can run on your own infrastructure — air-gapped, on-prem, or in your private cloud. Same API, same results, zero data leaves your environment.