SimPO
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,220 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,220 words
Add missing citations, update stale details, or suggest a clearer explanation.
SimPO (Simple Preference Optimization) is an offline preference learning algorithm for aligning [[large_language_model|large language models]] with human preferences. It was introduced in May 2024 by Yu Meng, Mengzhou Xia, and Danqi Chen in the paper SimPO: Simple Preference Optimization with a Reference-Free Reward, accepted to NeurIPS 2024.[^1] Building on [[direct_preference_optimization_dpo|Direct Preference Optimization]] (DPO), SimPO modifies the loss in two ways: it replaces the reference-model-relative reward of DPO with the length-normalized average log probability of a response under the policy, and it introduces a target reward margin parameter that explicitly widens the gap between preferred and rejected responses. Removing the reference model lowers training memory and runtime, and the authors report that SimPO outperforms DPO and several variants on AlpacaEval 2, Arena-Hard, and MT-Bench across Mistral 7B, Llama 3 8B, and Gemma 2 9B configurations.[^1][^2]
By 2024, fine-tuning [[instruction_tuning|instruction-tuned]] language models with human preference data had largely shifted from full [[rlhf|reinforcement learning from human feedback]] pipelines to direct alignment algorithms. The dominant such algorithm, DPO (Rafailov et al., 2023), reparameterizes the standard [[rlhf|RLHF]] objective so that the reward is implicitly defined by a log ratio between the policy and a fixed reference model.[^3] DPO removes the need to fit a separate reward model and the need to run on-policy [[rlhf|RL]] rollouts, but it still requires holding two copies of the model in memory at training time: the trainable policy and the frozen reference.[^3]
Several variants of DPO appeared in 2023 and 2024, including Identity Preference Optimization (IPO), Kahneman-Tversky Optimization ([[kto|KTO]]), Sequence Likelihood Calibration with Human Feedback (SLiC-HF), Rank Responses to align Human Feedback (RRHF), Contrastive Preference Optimization (CPO), Reference-Free DPO (R-DPO), and Odds Ratio Preference Optimization ([[orpo|ORPO]]). These methods variously modify the loss to address overfitting, length bias, or reference-model dependence.[^1] SimPO sits inside this wave of post-DPO algorithms and argues that the implicit reward used during DPO training is mismatched with the average-log-probability quantity that actually drives generation at inference time, and that this mismatch is partly responsible for length exploitation and inconsistent reward margins.[^1]
The work was produced at Princeton University, where Mengzhou Xia and Danqi Chen are affiliated with Princeton NLP and Princeton Language and Intelligence (PLI), with Yu Meng now at the University of Virginia.[^2][^4] The original arXiv preprint appeared on 23 May 2024, with subsequent revisions on 8 July 2024 and 1 November 2024 adding new baselines, Gemma 2 results, and expanded discussion of length normalization and KL regularization.[^1][^5] The paper was presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) in Vancouver in December 2024.[^6]
For a preference dataset (\mathcal{D} = {(x, y_w, y_l)}) of prompts (x), preferred responses (y_w), and dispreferred responses (y_l), the DPO loss is:
[ \mathcal{L}{\text{DPO}}(\pi\theta;\pi_{\text{ref}}) = -\mathbb{E}{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right] ]
where (\pi_\theta) is the trainable policy, (\pi_{\text{ref}}) is a frozen reference policy (typically the post-SFT checkpoint), (\beta) is a temperature hyperparameter, and (\sigma) is the logistic sigmoid.[^3] The implicit DPO reward for a response (y) is (r(x,y) = \beta\log\big(\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)\big)), a log ratio of the policy and reference probabilities of the full sequence.[^3]
SimPO replaces this implicit reward with the average per-token log probability of the response under the policy alone:
[ r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|}\log\pi_\theta(y\mid x) = \frac{\beta}{|y|}\sum_{i=1}^{|y|}\log\pi_\theta(y_i\mid x, y_{<i}) ]
and inserts a constant target margin (\gamma > 0) into a Bradley-Terry ranking objective:
[ \mathcal{L}{\text{SimPO}}(\pi\theta) = -\mathbb{E}{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\frac{\beta}{|y_w|}\log\pi\theta(y_w\mid x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l\mid x) - \gamma\right)\right] ]
where (|y|) is the token length of the response and (\gamma) is the target reward margin.[^1][^7] The policy must therefore drive the gap between the average per-token log probabilities of the chosen and rejected responses to at least (\gamma) before the loss is satisfied.[^1]
The authors motivate two design choices.[^1] First, the average-log-probability reward matches the quantity that beam search and likelihood-based decoding actually optimize at inference, so the training reward and the generation criterion are aligned. Second, length normalization (dividing by (|y|)) decouples reward magnitude from sequence length, which without it would let the model game the loss by elongating or shortening responses to artificially raise or lower their cumulative log probability.[^1] The margin (\gamma) generalizes the Bradley-Terry objective so that ties (and small positive differences) are penalized, forcing the model to push winning rewards above losing rewards by at least (\gamma) on the average-log-prob scale.[^1]
Because the SimPO reward (r_{\text{SimPO}}(x,y)) depends only on (\pi_\theta), the training loop never queries (\pi_{\text{ref}}). The authors note that this eliminates the second forward pass that DPO requires through the frozen reference for both (y_w) and (y_l) on every batch, and it eliminates the need to hold (\pi_{\text{ref}}) in GPU memory during training.[^1] Empirically the paper reports that SimPO cuts training run time by roughly 20% and reduces GPU memory by about 10% relative to DPO at matched batch size on the authors' configuration.[^7]
DPO's reference model is sometimes interpreted as providing implicit [[kl_divergence|KL divergence]] regularization toward the SFT distribution; removing it raises the question of whether the trained policy will drift too far from the supervised pretraining behavior. The SimPO paper addresses this empirically rather than theoretically, observing that the length-normalized average log probability reward and the explicit margin together produce policies whose response lengths and content remain comparable to SFT or DPO-trained baselines rather than collapsing or diverging.[^7] The authors also report KL-divergence trajectories during training and argue that SimPO does not exhibit pathological drift in the regimes they evaluate.[^7] Follow-up analyses, discussed in the Limitations section below, examine whether this conclusion survives more aggressive hyperparameter exploration.[^14]
A central empirical claim of the paper is that the length normalization term is what prevents SimPO from devolving into length exploitation. Without normalization, the implicit reward of a longer response can grow purely as a function of its length, biasing the model toward longer outputs that may not be substantively better.[^1] The paper reports that the Spearman correlation between response length and likelihood drops from 0.82 without length normalization to 0.34 with it, and that an ablation removing length normalization from SimPO drops AlpacaEval 2 length-controlled (LC) win rate on Mistral-Base from 21.5 to 11.9 and Arena-Hard from 16.6 to 9.4.[^7] The same ablation drops Mistral-Instruct AlpacaEval 2 LC from 32.1 to 19.1.[^7] The authors describe the length-normalization-removed variant as producing "long and repetitive patterns" rather than substantively better responses.[^7]
SimPO introduces no architectural changes; tuning is concentrated in three scalars:[^1][^8]
The paper's general recommendation for new setups is (\beta) between 2.0 and 2.5 and (\gamma) between 0.5 and 1.5, with the caveat that performance is sensitive to these choices and that win rate is non-monotone in (\gamma): reward accuracy increases with (\gamma) while win rate first rises then falls, indicating an interior optimum.[^7]
| Property | DPO | SimPO |
|---|---|---|
| Reference model required at training | Yes | No |
| Implicit reward per response | (\beta \log\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)) | ((\beta/ |
| Length-normalized | No (by default) | Yes |
| Explicit reward margin | No | Yes, parameter (\gamma) |
| Typical (\beta) range | 0.01 to 0.1 | 2.0 to 10 |
| GPU memory during training | Two model copies | One model copy |
| Reported runtime overhead | Baseline | About 20% lower than DPO |
| Reported memory overhead | Baseline | About 10% lower than DPO |
The table summarizes the relevant differences from the SimPO paper and accompanying repository.[^1][^7][^8]
The paper evaluates SimPO and seven baseline methods (SFT, DPO, IPO, KTO, [[orpo|ORPO]], R-DPO, plus RRHF and SLiC-HF in some settings) across four backbone configurations: Mistral 7B Base (with SFT on UltraChat-200k followed by alignment on UltraFeedback Binarized), Mistral 7B Instruct, [[llama_3|Llama 3]] 8B Base, and Llama 3 8B Instruct.[^1] Evaluation is on AlpacaEval 2 (length-controlled and raw win rates), Arena-Hard v0.1 win rate, and MT-Bench scored with GPT-4.[^1] A revised v3 of the paper extends the evaluation to Gemma 2 9B-it.[^7]
| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| SFT | 8.4% | 6.2% | 1.3% | 4.8 |
| DPO | 15.1% | 12.5% | 10.4% | 5.9 |
| IPO | 11.8% | 9.4% | 7.5% | 5.5 |
| KTO | 13.1% | 9.1% | 5.6% | 5.4 |
| ORPO | 14.7% | 12.2% | 7.0% | 5.8 |
| R-DPO | 17.4% | 12.8% | 8.0% | 5.9 |
| SimPO | 21.5% | 20.8% | 16.6% | 6.0 |
Source: SimPO paper, Table 4.[^1][^7]
| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| SFT | 17.1% | 14.7% | 12.6% | 6.2 |
| DPO | 26.8% | 24.9% | 16.3% | 6.3 |
| IPO | 20.3% | 20.3% | 16.2% | 6.4 |
| KTO | 24.5% | 23.6% | 17.9% | 6.4 |
| ORPO | 24.5% | 24.9% | 20.8% | 6.4 |
| R-DPO | 27.3% | 24.5% | 16.1% | 6.2 |
| SimPO | 32.1% | 34.8% | 21.0% | 6.6 |
Source: SimPO paper, Table 4.[^1][^7]
| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| DPO | 18.2% | 15.5% | 15.9% | 7.7 |
| IPO | 14.4% | 14.2% | 17.8% | 7.4 |
| KTO | 14.2% | 12.4% | 12.5% | 7.8 |
| ORPO | 12.2% | 10.6% | 10.8% | 7.6 |
| SimPO | 22.0% | 20.3% | 23.4% | 7.7 |
Source: SimPO paper, Table 4.[^1][^7]
| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| DPO | 40.3% | 37.9% | 32.6% | 8.0 |
| IPO | 35.6% | 35.6% | 30.5% | 8.3 |
| KTO | 33.1% | 31.8% | 26.4% | 8.2 |
| ORPO | 28.5% | 27.4% | 25.8% | 8.0 |
| SimPO | 44.7% | 40.5% | 33.8% | 8.0 |
Source: SimPO paper, Table 4.[^1][^7]
The paper summarizes its main result as: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."[^1] More granularly, SimPO beats the best non-SimPO baseline by 3.6 to 4.8 points on AlpacaEval 2 LC win rate and by 0.2 to 6.2 points on Arena-Hard across the four backbone configurations the paper studies.[^7] Across all four setups (Mistral 7B Base, Mistral 7B Instruct, Llama 3 8B Base, Llama 3 8B Instruct), SimPO ranks first on AlpacaEval 2 LC, AlpacaEval 2 raw WR, and Arena-Hard, and is comparable on MT-Bench (where, as noted earlier, the differences between methods are tight).[^1][^7]
The paper also reports that the SimPO gains do not come at the cost of inflated response lengths: SimPO outputs are comparable in length to those of the SFT model and to DPO-trained baselines, indicating that the length-normalized reward is not silently rewarding longer responses.[^7] The 44.7% AlpacaEval 2 LC score on Llama 3 8B Instruct was, at the time of v2 of the preprint (July 2024), the highest reported score on the AlpacaEval 2 leaderboard among 8B-class open-source models, surpassing some closed models including the reported number for Claude 3 Opus on the same leaderboard.[^1][^9] The model checkpoint backing that number was released as princeton-nlp/Llama-3-Instruct-8B-SimPO on [[hugging_face|Hugging Face]].[^9]
A later revision applied SimPO to google/gemma-2-9b-it and released princeton-nlp/gemma-2-9b-it-SimPO. The reported numbers are 72.4% AlpacaEval 2 LC, 65.9% raw win rate, and 59.1% on Arena-Hard, ranked first on Chatbot Arena among models under 10 billion parameters as of 16 September 2024 (as recorded in v3 of the paper).[^7][^8] The baseline gemma-2-9b-it model is reported at 51.1% AlpacaEval 2 LC, so SimPO adds more than 20 absolute LC points on this backbone.[^7]
The reference implementation is released at github.com/princeton-nlp/SimPO under the MIT license, built on top of the [[hugging_face|Hugging Face]] alignment-handbook scaffolding and trained on UltraFeedback Binarized with UltraChat-200k for SFT in the "Base" setting.[^8] Training in the paper used 4xH100 GPUs with DeepSpeed ZeRO-3 and a total batch size of 128.[^8]
Released checkpoints on [[hugging_face|Hugging Face]] include:[^8][^9]
princeton-nlp/Llama-3-Instruct-8B-SimPO (v0.1)princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2princeton-nlp/Mistral-7B-Base-SimPOprinceton-nlp/Mistral-7B-Instruct-SimPOprinceton-nlp/Llama-3-Base-8B-SFT-SimPOprinceton-nlp/gemma-2-9b-it-SimPOSibling repositories under the same v0.2 release provide DPO, IPO, [[kto|KTO]], [[orpo|ORPO]], CPO, RRHF, SLiC-HF, and R-DPO checkpoints trained under matched conditions for fair comparison, which makes the SimPO release one of the more thorough open benchmarks of preference optimization methods.[^8]
SimPO is implemented inside the [[hugging_face|Hugging Face]] TRL library as a loss option on the CPOTrainer. The user enables SimPO by setting loss_type="simpo", cpo_alpha=0.0, and a target simpo_gamma (default 0.5) in CPOConfig.[^10] The TRL documentation explains: "SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization."[^10] A hybrid CPO-SimPO mode is also supported by keeping cpo_alpha nonzero alongside the SimPO loss; the project at github.com/fe1ixxu/CPO_SIMPO documents this combination.[^10]
Community-distributed quantizations of the SimPO Llama 3 8B checkpoint appear on [[hugging_face|Hugging Face]] in formats such as GGUF (for example bartowski/Llama-3-Instruct-8B-SimPO-GGUF), enabling local inference through runners like [[llama_cpp|llama.cpp]] and [[ollama|Ollama]].[^11] The Princeton release also contributed checkpoints to chatbot-arena style head-to-head leaderboards where the Gemma-2-9B-it-SimPO entry ranked at the top of its size class.[^8]
SimPO sits inside a family of direct alignment algorithms that, like DPO, optimize a loss over preference pairs without an explicit reward model or on-policy [[rlhf|RL]] rollouts.[^1] The following table summarizes how the closest neighbors differ.
| Method | Reference model | Reward form | Distinguishing feature |
|---|---|---|---|
| DPO | Required | Log policy/reference ratio of full sequence | KL-style implicit constraint to reference |
| IPO | Required | Same as DPO with squared loss | Avoids overfitting via bounded loss; averaged over tokens |
| [[kto | KTO]] | Required | Prospect-theory-derived utility |
| [[orpo | ORPO]] | Not required | Log odds ratio combined with SFT NLL |
| CPO | Not required | DPO-style reward with SFT regularizer | Approximates DPO without reference; used for translation |
| R-DPO | Required | DPO reward with length regularizer | Adds an explicit length penalty term |
| SimPO | Not required | Length-normalized average log probability | Adds explicit margin (\gamma); reference-free |
Sources: the cited SimPO paper and the TRL CPOTrainer documentation, which catalogs these losses as configurable options.[^1][^10]
A direct successor is AlphaPO (Gupta et al., January 2025), which leaves the SimPO loss structure intact but applies a parametric transformation (r=(1-p^{-\alpha})/\alpha) to reshape the reward function. The AlphaPO authors describe SimPO and DPO as both suffering from "likelihood displacement" (where the absolute probability of the chosen response can fall during training) and argue that the reward shape, not just its functional form, controls how strongly this happens.[^10] AlphaPO is integrated into the same TRL CPOTrainer and reports 7-10% relative gains over SimPO on Mistral 7B Instruct and Llama 3 8B Instruct.[^10]
Other follow-ups include (\alpha)-DPO (Wu et al., 2024), which generalizes SimPO's fixed margin to an adaptive instance-specific margin, and SimPER (Xiao et al., ICLR 2025), which removes hyperparameters from SimPO-style training.[^12] Reference-free multi-preference variants such as REFA (December 2024) extend the SimPO recipe to settings with more than two ranked responses per prompt.[^13]
The SimPO paper and its companion repository both flag that SimPO is sensitive to its three main hyperparameters (learning rate, (\beta), (\gamma)) and that values that work well on one base model do not transfer to others.[^1][^8] Released recipes use (\beta) values that vary by a factor of five across configurations (2.0 for Mistral-Base, up to 10 for Gemma and Llama-3-Instruct v0.2), and (\gamma/\beta) ratios from 0.1 to 0.8.[^8] Tuning therefore requires more search than DPO, where a single (\beta) around 0.01 to 0.1 is often adequate.[^8]
The most substantive critique is that SimPO's gains over DPO may be attributable largely to length normalization rather than to dropping the reference model. The paper Understanding Reference Policies in Direct Preference Optimization (Liu, Liu, and Cohan, July 2024) argues that DPO's KL constraint can be configured with a much smaller (\beta) (around 0.01) than the values reported by some SimPO baselines, and at that setting DPO becomes competitive with [[orpo|ORPO]] and other reference-free methods. The authors note that other forms of regularization remain necessary even in reference-free methods.[^14]
A related line of work introduces LN-DPO, a length-normalized variant of DPO, and reports that the reference-free SimPO and reference-dependent LN-DPO "perform similarly at their peak" once each is tuned.[^15] The implication is that length normalization, rather than reference freeness or the explicit margin, accounts for much of the gap that SimPO opens over plain DPO.[^15] The open GitHub issue Length normalization in DPO and other variants on the Princeton SimPO repository explicitly raises this question without a public resolution.[^16]
AlphaPO and contemporaneous work observe that, like DPO, SimPO can drive down the absolute probability of preferred responses during training even while the relative margin to dispreferred responses grows. The shape of the implicit reward influences how strongly this happens, and the SimPO log-probability reward is not optimal in this respect.[^10] In domains where preserving the policy's likelihood of good responses matters (for example, reasoning chains where exact phrasings matter), this can hurt downstream performance.
The headline AlpacaEval 2 and Arena-Hard numbers come from automatic LLM-as-judge benchmarks scored by GPT-4-class judges. The SimPO paper itself notes that MT-Bench scores cluster tightly across methods because of MT-Bench's small scale and single-instance scoring protocol, limiting its discriminative power.[^7] More broadly, AlpacaEval 2's length-controlled win rate corrects for some length bias but not all, and the SimPO authors acknowledge that "benchmark evaluations have limitations, including restricted query space and potential biases from model-based evaluations."[^7]
Reproducing the published numbers requires pinning specific package versions, notably alpaca-eval==0.6.2 (the repository notes that versions 0.6.3 and later changed scoring in ways that cause discrepancies).[^8] The repository also notes that exact results vary with hardware and CUDA versions, common but worth flagging.[^8] The released training scripts target 4xH100 nodes; running on smaller hardware requires scaling down per-device batch size while keeping the total batch size at 128 through gradient accumulation, which can subtly alter optimization dynamics.[^8]
A subtler concern is that "removing the reference model" is sometimes presented as a strict simplification, but SimPO compensates by introducing the margin hyperparameter (\gamma), enlarging the effective (\beta) range (which now ranges over an order of magnitude across setups), and demanding more careful learning-rate tuning.[^8] Where DPO has effectively one alignment-specific hyperparameter ((\beta)), SimPO has three ((\beta), (\gamma), and an alignment learning rate that often differs from the SFT learning rate). For practitioners with limited compute for hyperparameter search, this can offset the per-step memory and runtime savings.[^8]
SimPO is one of the clearest demonstrations that direct alignment can be simplified beyond DPO without obviously sacrificing quality. The combination of dropping the reference model, normalizing by length, and adding an explicit margin reduces the algorithm to a single forward pass per minibatch and one set of model weights, while keeping the loss in the same Bradley-Terry family that DPO and its variants use.[^1] That has practical consequences: smaller GPU memory footprint and faster steps make alignment feasible on more constrained hardware, and the [[hugging_face|Hugging Face]] TRL integration makes the algorithm accessible through a one-line configuration change.[^10]
The wider research conversation that followed SimPO sharpened the question of why preference optimization works, isolating the contributions of (a) reference-model regularization, (b) length normalization, and (c) explicit margin terms. Subsequent work that introduces length-normalized DPO variants, identity transformations on the implicit reward (AlphaPO), and hyperparameter-free analogs (SimPER) treats SimPO as the central reference point for that decomposition, even when the conclusion is that several of SimPO's design choices interact and that pure ablation results depend on careful hyperparameter retuning of each baseline.[^14][^15][^10][^12]
In open-source instruction tuning, SimPO checkpoints became, briefly, frontier-quality entries on AlpacaEval 2 for their size class: Llama-3-Instruct-8B-SimPO was the top 8B open model on AlpacaEval 2 LC at release, and gemma-2-9b-it-SimPO topped Chatbot Arena among sub-10B models in mid-September 2024.[^7][^8] Those rankings were quickly disputed and overtaken by later checkpoints and by methodology revisions to the benchmarks themselves, but the SimPO recipe (length-normalized average log probability, explicit margin, no reference) is now a standard option in the alignment toolkit.[^10]
| Concept | Relationship to SimPO |
|---|---|
| [[direct_preference_optimization_dpo | DPO]] |
| [[dpo | DPO]] (short slug) |
| [[kto | KTO]] |
| [[orpo | ORPO]] |
| [[rlhf | RLHF]] |
| [[rlaif | RLAIF]] |
| [[constitutional_ai | Constitutional AI]] |
| [[llama_3 | Llama 3]] |
| [[mistral_7b | Mistral 7B]] |
| [[gemma_2 | Gemma 2]] |
| [[alpacaeval | AlpacaEval]] |
| [[arena_hard | Arena-Hard]] |
| [[mt_bench | MT-Bench]] |
| [[kl_divergence | KL Divergence]] |
| [[hugging_face | Hugging Face]] |
| [[transformers_library | Hugging Face Transformers]] |
| [[instruction_tuning | Instruction Tuning]] |
| [[supervised_fine-tuning | Supervised fine-tuning]] |
| [[claude_3_opus | Claude 3 Opus]] |