ModelBrew AI Blog · May 2026 · 8 min read

DPO vs SimPO in 2026: Which Preference Fine-tuning Method Should You Use?

Direct Preference Optimization (DPO) and SimPO both align language models from preference pairs without running an RLHF reward model or a PPO loop. Here is what each one is, the math at a high level, and when to choose which.

The setup: alignment without RLHF

For a long time, "aligning" a language model to human preferences meant reinforcement learning from human feedback (RLHF). That pipeline has three moving parts: a base model, a separately trained reward model that scores responses, and a PPO trainer that nudges the base model to score higher under the reward model. It works, but it is brittle, hyperparameter-sensitive, and operationally expensive.

Both methods we are comparing replace that pipeline with a single training run that consumes preference pairs directly. A preference pair is one prompt with two responses: a chosen one and a rejected one. The model learns to assign higher likelihood to the chosen response than to the rejected one, full stop. No reward model. No PPO.

That is the family of methods. The two we will compare are the two you are most likely to encounter in 2026.

What is DPO?

Direct Preference Optimization was introduced by Rafailov, Sharma, Mitchell et al. (2023). It is the original "RLHF without RL" method.

The math, at a high level: DPO assumes there is some implicit reward function that the preference data follows under a Bradley-Terry model (chosen response is preferred over rejected with probability proportional to the difference in their rewards). DPO shows that if you parameterize that reward as the log-ratio between the policy and a frozen reference policy, you can optimize the preference objective directly with a simple cross-entropy-flavored loss, no PPO required.

The practical consequences:

Two models in memory. The training run holds the trainable policy and a frozen copy of the same base model as a reference. The KL term to that reference is what keeps training stable.
One key hyperparameter. β (beta) controls how much the policy is allowed to drift from the reference. Typical values are 0.01 to 0.5; the canonical default is 0.1.
Loss variants exist. The original sigmoid loss is the canonical formulation. IPO (Azar et al., 2023) replaces the sigmoid with a squared loss to address regularization issues. Hinge losses also appear in the literature.

DPO has been the default preference fine-tuning method since 2023. Most of the open-weight aligned models you see on Hugging Face — Zephyr, Tulu, the various OpenChat releases — trace at least part of their post-training pipeline to it.

What is SimPO?

SimPO (Simple Preference Optimization) was introduced by Meng, Xia, and Chen at Princeton (2024). The point of SimPO is in the name: simpler than DPO, with the headline simplification being that there is no reference model.

The math swap is small but consequential. DPO's implicit reward is the log-ratio between policy and reference. SimPO's implicit reward is the length-normalized average log-probability of the response under the policy itself. No reference, no log-ratio. Just: how likely does the current model think this response is, divided by how many tokens long it is.

That swap has three effects:

One model in memory instead of two. No frozen reference copy. Training is leaner and a bit faster on the same hardware.
Length-normalization is built in. DPO is known to drift toward longer responses because longer responses accumulate more log-probability. SimPO divides by length, which removes that bias by construction.
One extra hyperparameter. SimPO adds γ (gamma), a target margin between chosen and rejected rewards. The paper uses β = 2.5 and γ = 1.6 as a representative recipe.

The SimPO paper reports strong results on AlpacaEval 2 and Arena-Hard against DPO baselines. Independent reproductions have largely held up, though as with any new method the relative ranking depends on data and base model. We will not litigate the rankings here — the interesting question for a practitioner is not "which one wins on this benchmark" but "which one fits my data and budget."

The math, side by side

Both methods share the same structure: take a preference pair, score chosen and rejected, optimize a loss that pushes chosen above rejected. They differ only in how each response is scored.

DPO scoring. Score(response) = β · [log π_policy(response | prompt) − log π_reference(response | prompt)]. The training objective is −log σ(Score(chosen) − Score(rejected)).

SimPO scoring. Score(response) = (β / |response|) · log π_policy(response | prompt). The training objective is −log σ(Score(chosen) − Score(rejected) − γ).

Two practical things fall out of the comparison. First, DPO's reference model is what regularizes the training run; remove it and you would need something else to keep the policy honest. SimPO's "something else" is the γ margin term. Second, SimPO's division by response length is what makes it robust to length bias and is also what makes β need a different range from DPO — SimPO's β sits around 2 to 5 because it is multiplying an average log-probability rather than a sum.

When to pick which

If your situation is…	Pick
You are reproducing or extending an existing DPO baseline	DPO
You need IPO or hinge loss variants specifically	DPO
You are starting a fresh preference run on new data	SimPO
Your data has a length skew between chosen and rejected	SimPO
You want lower training memory and a faster run	SimPO
You want to stay close to the original Rafailov-style formulation	DPO
You need both methods to A/B compare on your own data	Both

None of these are absolutes. They are the tendencies you will see in the literature and in practice. If you have time, run both. If you do not, SimPO is a defensible default for fresh runs in 2026 and DPO is a defensible default for reproducing existing work.

What ModelBrew offers, just the facts

ModelBrew runs both methods through the same hosted training pipeline as supervised fine-tuning. There is no separate billing tier, no separate API. Just a method dropdown on the /train page that shows up when the dataset optimizer detects pair-shaped data (rows with prompt, chosen, rejected fields).

Models supported. Mistral-7B, Llama-3.1-8B, Saul-7B, Qwen3-8B, Gemma-2-9B for paid runs. TinyLlama-1.1B is on the free tier (3 runs per day) for prototyping.

DPO knobs. β (default 0.1), loss variant (sigmoid, IPO, hinge).

SimPO knobs. β (default 2.5), γ (default 1.6). A "Match SimPO paper recipe" toggle strict-bounds γ/β to the published range from the paper's six configurations — the maximum γ/β ratio across those six is 0.8, so the toggle keeps you inside that envelope. Toggle it off to explore beyond the paper.

Output. A PEFT-compatible LoRA adapter ZIP. Plug it into the Transformers library, serve it through the OpenAI-compatible inference endpoint, or load it side-by-side with adapters from prior runs through the continual-learning engine.

Pricing. Flat $3.99 per million training tokens for both DPO and SimPO — the same rate as supervised fine-tuning. Tokens are counted from the prompt plus chosen plus rejected fields, multiplied by epochs. A token estimate is shown before you click train. Failed runs are auto-refunded. Credits are prepaid; the minimum top-up is $20. Full pricing breakdown.

Three practical tips before you train

Clean your pairs first

Both methods are sensitive to label noise. If a meaningful fraction of your pairs have the chosen and rejected responses inverted — or if chosen and rejected are near-duplicates with no real signal — the training objective is fighting itself. Run a structural pair audit and a judge-based polarity sample before you train. ModelBrew's dataset optimizer ships both, but the underlying validators are cheap to run anywhere.

Watch the length distribution

For DPO specifically, look at the average length of your chosen vs rejected responses. If chosen responses are systematically longer, DPO will learn that "longer = better" as a shortcut. Either rebalance the data or pick SimPO, which divides by length explicitly.

Start with the published defaults

For DPO: β = 0.1, sigmoid loss, learning rate around 1e-6 for QLoRA. For SimPO: β = 2.5, γ = 1.6, same learning rate range. Resist the urge to grid-search over a dozen configurations on your first run. Get one clean baseline first, then move one knob at a time.

Bottom line

DPO and SimPO are two different cuts of the same idea: align a language model from preference pairs without a reward model and without PPO. DPO does it with a reference model; SimPO does it with a length-normalized score and an explicit margin. Both work. Pick SimPO for fresh runs and lower memory; pick DPO for reproducibility and loss-variant flexibility.

If you want to skip the infrastructure setup, both methods run end-to-end on ModelBrew's hosted training page at a flat $3.99 per million tokens. The full marketing-page write-up of the offering is on the fine-tuning page if you want the long form.

Either way: read the original papers. Rafailov et al. 2023 (DPO) and Meng et al. 2024 (SimPO) are both readable in an afternoon and worth the time before you spend any tokens.