02 · Fine-tune — SFT, DPO, SimPO — From $3.99/M tokens

Self-Serve Fine-tuning for Open-Source LLMs

Supervised fine-tuning, DPO, and SimPO on Mistral, Llama, Saul, Qwen, Gemma. LoRA + QLoRA. Flat $3.99 per million tokens across every paid model. No reference model required for SimPO. No RLHF reward model. No infrastructure to manage.


Six leading open-source LLMs. One flat rate.

Pick any. Train any. Same per-token rate. TinyLlama is free for prototyping (3 runs/day). Everything else is a flat $3.99/M tokens for fine-tuning. Continual learning is in closed beta — request access.

Free

TinyLlama-1.1B

Free tier — 3 runs per day. Great for prototyping training pipelines and testing your data.

Pro

Mistral-7B

The benchmark workhorse. Most of our continual-learning research is validated on Mistral-7B-Instruct.

Pro

Llama-3.1-8B

The most popular Hugging Face base model. Strong instruction-following out of the box.

Pro

Saul-7B

Legal-domain specialist. We've validated Saul-7B across 18/18 legal sub-domains.

Pro

Qwen3-8B

Strong multilingual capabilities. Good fit for non-English fine-tunes and code.

Pro

Gemma-2-9B

Google's open weights. Validated on our 5-domain continual-learning benchmark.


Three lines to your first training run.

Or skip the API entirely and use the web UI — no code needed.

# Fine-tune with CRMA — near-zero forgetting import requests response = requests.post( "https://fourwheels2512--crma-finetune-fastapi-app.modal.run/start_run", data={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "epochs": "3", "use_crma": "true", }, files={"file": open("my_data.jsonl", "rb")}, headers={"Authorization": "Bearer YOUR_TOKEN"} ) print(response.json()) # {"run_id": "abc123", "status": "running", "model": "TinyLlama-1.1B"} # → Downloads: crma_adapter.zip (PEFT-compatible, plug into Transformers)

Works with any JSONL dataset. Or use the web UI — no code needed.


DPO & SimPO on open-source LLMs.

If you have preference pairs — a prompt with one chosen response and one rejected response — you can align an open-source model directly. No reward model. No PPO loop. No RLHF infrastructure. ModelBrew runs both Direct Preference Optimization (DPO) and SimPO end-to-end on the same hosted GPU pipeline as supervised fine-tuning.

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) treats the language model itself as the reward model. You hand it pairs of (prompt, chosen, rejected) and it adjusts log-probabilities so the model prefers chosen responses, regularized by a frozen reference copy of the base model.

Use DPO when:

  • You are reproducing an existing DPO baseline.
  • You need sigmoid, ipo, or hinge loss variants.
  • You want the implicit-reward formulation paired with a reference model for stability.

How to run it:

  1. Upload a JSONL with prompt / chosen / rejected on the Dataset Optimizer.
  2. Open app.modelbrew.ai/train. The pair format is auto-detected.
  3. Pick "SFT + DPO — compatibility mode" in the method dropdown. Defaults: β = 0.1, sigmoid loss.
  4. Pick a base model (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B).
  5. Click train. You get loss curves, evals, and a downloadable PEFT adapter ZIP.

Billed at the standard $3.99 per million training tokens. Prepaid credits, transparent metering, failed runs auto-refunded.

SimPO — reference-free preference optimization

SimPO (Meng, Xia & Chen, Princeton, 2024) replaces DPO's frozen reference model with the length-normalized average log-probability of the response itself. No reference model. No KL term against a frozen copy. Two hyperparameters: β (reward sharpness) and γ (target margin).

Why SimPO:

  • One model in memory at training time instead of two — lower GPU footprint.
  • Length-normalized loss reduces the verbosity bias that vanilla DPO is known to introduce.
  • Strong reported scores on AlpacaEval 2 and Arena-Hard in the published paper.

When to choose SimPO over DPO: new training runs where you don't need to match an existing DPO baseline, you want lower training cost, or your data shows a length skew between chosen and rejected responses.

Match SimPO paper recipe: a toggle on the train page strict-bounds γ/β to the published configuration range (γ/β ≤ 1.0, β ≥ 0.5). Across the six configs in the paper, the maximum published γ/β ratio is 0.8; the strict bound keeps you inside that envelope. Toggle it off to explore beyond the paper.

Same flat $3.99 per million training tokens as DPO and SFT. Run it from app.modelbrew.ai/train.

DPO vs SimPO — quick reference

Reference model requiredDPO: yes · SimPO: no
Reward model / PPONeither method needs one
Hyperparameters exposedDPO: β, loss type · SimPO: β, γ
Length normalizationDPO: no · SimPO: yes
Default βDPO 0.1 · SimPO 2.5
Loss variantsDPO: sigmoid / IPO / hinge
Models supportedMistral, Llama, Saul, Qwen, Gemma
Price$3.99 / M tokens (both)

Preference fine-tuning FAQ

Do I need to run RLHF or train a reward model?

No. Both DPO and SimPO learn directly from preference pairs. There is no separate reward model and no PPO loop. SimPO additionally drops the frozen reference model that vanilla DPO relies on.

What data format does ModelBrew expect?

JSONL with prompt, chosen, and rejected fields per row. The Dataset Optimizer auto-detects pair-shaped data and runs structural validators plus a judge-based polarity sample before training.

How is preference fine-tuning priced?

Same flat rate as supervised fine-tuning: $3.99 per million training tokens, billed against prepaid credits with no subscription. Prompt plus chosen plus rejected tokens are counted across epochs. A token estimate is shown before you click train. Failed runs are auto-refunded.

Can I combine preference fine-tuning with the CRMA backbone?

Yes. Both DPO and SimPO runs train a fresh per-task LoRA adapter, which is what CRMA wraps for continual learning. See the continual learning page for how new domains stack onto the same model without overwriting prior fine-tunes.

When should I pick SFT instead of DPO or SimPO?

If your data is single-response (prompt → one ideal answer), SFT is the right choice and the dropdown defaults to it. DPO and SimPO need pair data — the train page hides the preference selector when no pairs are detected.


Pay only for what you use.

No subscriptions. Sign up and get 75 credits free ($7.50). Load $20 in credits when you're ready. 3 free runs per day on TinyLlama.

Free
$0
75 credits at signup · no card
  • 75 credits free at signup ($7.50)
  • 3 runs per day on TinyLlama-1.1B
  • Fine-tuning mode
  • Download adapter ZIPs
  • Real-time training progress
Get Started

All 7–9B models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)

Fine-Tuning (SFT)$3.99 / M tokens
DPO (preference fine-tuning)$3.99 / M tokens
SimPO (reference-free preference)$3.99 / M tokens
Continual LearningClosed beta — contact us
Clean with AI (Dataset Optimizer)50 credits per 200 rows

Credits & Balance

Minimum credit purchase$20
Credits roll overNever expire
Failed jobsAuto-refunded

Example: Fine-tune Mistral-7B on 500 medical Q&A pairs

Estimated tokens~135K tokens
Rate (Fine-Tuning)$3.99 / M tokens
Computed cost$0.54
Deducted from balance$0.54

Example: Continual learning on Mistral-7B — 5 domains

Continual learning is currently in closed beta and not available for self-serve purchase. Request access if you'd like to evaluate it on your data.

Refund Policy: If a training job fails due to a system error, your credits are automatically refunded — no action needed. Unused credits are non-refundable and non-transferable. All payments processed securely by Stripe — we never see your card details. By purchasing credits, you agree to our Terms of Service.


Designed for regulated-industry workflows.

Practical security defaults — encryption, RBAC, audit logging, and security headers — useful for teams handling sensitive data.

Encryption at Rest

All model checkpoints and training data encrypted with AES-256 (Fernet). Secure delete enabled — no residual data on disk.

Security Headers

HSTS, X-Frame-Options DENY, Content-Type nosniff, XSS protection, strict Referrer-Policy, and Permissions-Policy on every response.

Audit Logging

Every API call logged with user, action, IP, and timestamp. Full audit trail for compliance reviews and incident response.

Role-Based Access

RBAC with granular permissions. Admin, user, and read-only roles. API keys separated from session tokens.

GDPR & Data Rights

One-click data export and account deletion. Your data, your control. Full compliance with data protection regulations.

Hardened Runtime

Non-root containers, health checks, safe model loading (no arbitrary code execution), and sanitized error responses.

On-Premises Deployment Available

Need to keep data inside your network? ModelBrew ships as a Docker container you can run on your own infrastructure — air-gapped, on-prem, or in your private cloud. Same API, same results, zero data leaves your environment.

Contact us for on-prem pricing →

Ship a fine-tune today.

Free tier: 3 runs per day on TinyLlama. Pro: $20 to start, $3.99/M tokens after.