What is DPO fine-tuning?

Direct Preference Optimization (DPO) is a method for aligning a language model from preference pairs (prompt, chosen response, rejected response) without running a reward model or PPO loop. It was introduced by Rafailov et al. (2023). On ModelBrew, DPO runs on Llama, Mistral, Qwen, Gemma, and Saul base models.

What is SimPO and how is it different from DPO?

SimPO (Simple Preference Optimization, Meng et al. 2024) is a reference-free variant of preference optimization. It uses the length-normalized average log-probability of a response as the implicit reward, so no separate frozen reference model is needed during training. That typically reduces memory and time compared to DPO.

Do I need RLHF or a reward model to use ModelBrew?

No. Both DPO and SimPO on ModelBrew operate directly on preference pairs. There is no separate reward model and no PPO. SimPO additionally avoids the frozen reference model that vanilla DPO uses.

How is preference fine-tuning priced on ModelBrew?

Preference fine-tuning is billed at the same flat $3.99 per million training tokens as supervised fine-tuning. Tokens are counted from the chosen and rejected responses plus the prompt. Failed runs are auto-refunded.

02 · Fine-tune — SFT, DPO, SimPO — From $3.99/M tokens

Self-Serve Fine-tuning for Open-Source LLMs

Name: ModelBrew Fine-tuning
Brand: ModelBrew

Supervised fine-tuning, DPO, and SimPO on Mistral, Llama, Saul, Qwen, Gemma. LoRA + QLoRA. Flat $3.99 per million tokens across every paid model. No reference model required for SimPO. No RLHF reward model. No infrastructure to manage.

Start Free — No Credit Card DPO & SimPO → See pricing →

Models

Six leading open-source LLMs. One flat rate.

Pick any. Train any. Same per-token rate. TinyLlama is free for prototyping (3 runs/day). Everything else is a flat $3.99/M tokens for fine-tuning. Continual learning is in closed beta — request access.

Free

TinyLlama-1.1B

Free tier — 3 runs per day. Great for prototyping training pipelines and testing your data.

Pro

Mistral-7B

The benchmark workhorse. Most of our continual-learning research is validated on Mistral-7B-Instruct.

Pro

Llama-3.1-8B

The most popular Hugging Face base model. Strong instruction-following out of the box.

Pro

Saul-7B

Legal-domain specialist. We've validated Saul-7B across 18/18 legal sub-domains.

Pro

Qwen3-8B

Strong multilingual capabilities. Good fit for non-English fine-tunes and code.

Pro

Gemma-2-9B

Google's open weights. Validated on our 5-domain continual-learning benchmark.

Quick Start

Three lines to your first training run.

Or skip the API entirely and use the web UI — no code needed.

# Fine-tune with CRMA — near-zero forgetting import requests response = requests.post( "https://fourwheels2512--crma-finetune-fastapi-app.modal.run/start_run", data={ "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "epochs": "3", "use_crma": "true", }, files={"file": open("my_data.jsonl", "rb")}, headers={"Authorization": "Bearer YOUR_TOKEN"} ) print(response.json()) # {"run_id": "abc123", "status": "running", "model": "TinyLlama-1.1B"} # → Downloads: crma_adapter.zip (PEFT-compatible, plug into Transformers)

Works with any JSONL dataset. Or use the web UI — no code needed.

Preference Fine-tuning

DPO & SimPO on open-source LLMs.

If you have preference pairs — a prompt with one chosen response and one rejected response — you can align an open-source model directly. No reward model. No PPO loop. No RLHF infrastructure. ModelBrew runs both Direct Preference Optimization (DPO) and SimPO end-to-end on the same hosted GPU pipeline as supervised fine-tuning.

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) treats the language model itself as the reward model. You hand it pairs of (prompt, chosen, rejected) and it adjusts log-probabilities so the model prefers chosen responses, regularized by a frozen reference copy of the base model.

Use DPO when:

You are reproducing an existing DPO baseline.
You need sigmoid, ipo, or hinge loss variants.
You want the implicit-reward formulation paired with a reference model for stability.

How to run it:

Upload a JSONL with prompt / chosen / rejected on the Dataset Optimizer.
Open app.modelbrew.ai/train. The pair format is auto-detected.
Pick "SFT + DPO — compatibility mode" in the method dropdown. Defaults: β = 0.1, sigmoid loss.
Pick a base model (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B).
Click train. You get loss curves, evals, and a downloadable PEFT adapter ZIP.

Billed at the standard $3.99 per million training tokens. Prepaid credits, transparent metering, failed runs auto-refunded.

SimPO — reference-free preference optimization

SimPO (Meng, Xia & Chen, Princeton, 2024) replaces DPO's frozen reference model with the length-normalized average log-probability of the response itself. No reference model. No KL term against a frozen copy. Two hyperparameters: β (reward sharpness) and γ (target margin).

Why SimPO:

One model in memory at training time instead of two — lower GPU footprint.
Length-normalized loss reduces the verbosity bias that vanilla DPO is known to introduce.
Strong reported scores on AlpacaEval 2 and Arena-Hard in the published paper.

When to choose SimPO over DPO: new training runs where you don't need to match an existing DPO baseline, you want lower training cost, or your data shows a length skew between chosen and rejected responses.

Match SimPO paper recipe: a toggle on the train page strict-bounds γ/β to the published configuration range (γ/β ≤ 1.0, β ≥ 0.5). Across the six configs in the paper, the maximum published γ/β ratio is 0.8; the strict bound keeps you inside that envelope. Toggle it off to explore beyond the paper.

Same flat $3.99 per million training tokens as DPO and SFT. Run it from app.modelbrew.ai/train.

DPO vs SimPO — quick reference

Reference model required	DPO: yes · SimPO: no
Reward model / PPO	Neither method needs one
Hyperparameters exposed	DPO: β, loss type · SimPO: β, γ
Length normalization	DPO: no · SimPO: yes
Default β	DPO 0.1 · SimPO 2.5
Loss variants	DPO: sigmoid / IPO / hinge
Models supported	Mistral, Llama, Saul, Qwen, Gemma
Price	$3.99 / M tokens (both)

Preference fine-tuning FAQ

Do I need to run RLHF or train a reward model?

No. Both DPO and SimPO learn directly from preference pairs. There is no separate reward model and no PPO loop. SimPO additionally drops the frozen reference model that vanilla DPO relies on.

What data format does ModelBrew expect?

JSONL with prompt, chosen, and rejected fields per row. The Dataset Optimizer auto-detects pair-shaped data and runs structural validators plus a judge-based polarity sample before training.

How is preference fine-tuning priced?

Same flat rate as supervised fine-tuning: $3.99 per million training tokens, billed against prepaid credits with no subscription. Prompt plus chosen plus rejected tokens are counted across epochs. A token estimate is shown before you click train. Failed runs are auto-refunded.

Can I combine preference fine-tuning with the CRMA backbone?

Yes. Both DPO and SimPO runs train a fresh per-task LoRA adapter, which is what CRMA wraps for continual learning. See the continual learning page for how new domains stack onto the same model without overwriting prior fine-tunes.

When should I pick SFT instead of DPO or SimPO?

If your data is single-response (prompt → one ideal answer), SFT is the right choice and the dropdown defaults to it. DPO and SimPO need pair data — the train page hides the preference selector when no pairs are detected.

Pricing

Pay only for what you use.

No subscriptions. Sign up and get 75 credits free ($7.50). Load $20 in credits when you're ready. 3 free runs per day on TinyLlama.

Free

75 credits at signup · no card

75 credits free at signup ($7.50)
3 runs per day on TinyLlama-1.1B
Fine-tuning mode
Download adapter ZIPs
Real-time training progress

Get Started

Pro

From $20

Load credits, pay only for tokens used

All models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)
SFT, DPO, and SimPO methods
Fine-tuning + continual learning
Priority GPU access
Cost estimates before each run
Credits never expire — balance rolls over

Buy Credits

All 7–9B models (Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B)

Fine-Tuning (SFT)	$3.99 / M tokens
DPO (preference fine-tuning)	$3.99 / M tokens
SimPO (reference-free preference)	$3.99 / M tokens
Continual Learning	Closed beta — contact us
Clean with AI (Dataset Optimizer)	50 credits per 200 rows

Credits & Balance

Minimum credit purchase	$20
Credits roll over	Never expire
Failed jobs	Auto-refunded

Example: Fine-tune Mistral-7B on 500 medical Q&A pairs

Estimated tokens	~135K tokens
Rate (Fine-Tuning)	$3.99 / M tokens
Computed cost	$0.54
Deducted from balance	$0.54

Example: Continual learning on Mistral-7B — 5 domains

Continual learning is currently in closed beta and not available for self-serve purchase. Request access if you'd like to evaluate it on your data.

Refund Policy: If a training job fails due to a system error, your credits are automatically refunded — no action needed. Unused credits are non-refundable and non-transferable. All payments processed securely by Stripe — we never see your card details. By purchasing credits, you agree to our Terms of Service.

Security

Designed for regulated-industry workflows.

Practical security defaults — encryption, RBAC, audit logging, and security headers — useful for teams handling sensitive data.

Encryption at Rest

All model checkpoints and training data encrypted with AES-256 (Fernet). Secure delete enabled — no residual data on disk.

Security Headers

HSTS, X-Frame-Options DENY, Content-Type nosniff, XSS protection, strict Referrer-Policy, and Permissions-Policy on every response.

Audit Logging

Every API call logged with user, action, IP, and timestamp. Full audit trail for compliance reviews and incident response.

Role-Based Access

RBAC with granular permissions. Admin, user, and read-only roles. API keys separated from session tokens.

GDPR & Data Rights

One-click data export and account deletion. Your data, your control. Full compliance with data protection regulations.

Hardened Runtime

Non-root containers, health checks, safe model loading (no arbitrary code execution), and sanitized error responses.

Ship a fine-tune today.

Free tier: 3 runs per day on TinyLlama. Pro: $20 to start, $3.99/M tokens after.

Start Free → See continual learning →