# SimPO

> Source: https://aiwiki.ai/wiki/simpo
> Updated: 2026-07-11
> Categories: AI Alignment, Large Language Models, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SimPO** (Simple Preference Optimization) is a reference-free offline preference learning algorithm for aligning [large language models](/wiki/large_language_model) with human preferences. It was introduced in May 2024 by Yu Meng, Mengzhou Xia, and Danqi Chen in the paper *SimPO: Simple Preference Optimization with a Reference-Free Reward*, accepted to NeurIPS 2024.[1] Building on [Direct Preference Optimization](/wiki/direct_preference_optimization_dpo) (DPO), SimPO changes the loss in two ways: it replaces DPO's reference-model-relative reward with the length-normalized average log probability of a response under the policy, and it adds a target reward margin that explicitly widens the gap between preferred and rejected responses. The paper attributes SimPO's effectiveness to "using the average log probability of a sequence as the implicit reward," a formulation that "better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient."[1] Removing the reference model lets SimPO train with a single set of model weights, and the authors report that it outperforms DPO and several variants on AlpacaEval 2, Arena-Hard, and MT-Bench across Mistral 7B, Llama 3 8B, and Gemma 2 9B configurations. SimPO's strongest model, built on Gemma-2-9B-it, reaches a 72.4% length-controlled win rate on AlpacaEval 2 and ranked first among models under 10 billion parameters on Chatbot Arena at release.[1][2]

## Background

By 2024, fine-tuning [instruction-tuned](/wiki/instruction_tuning) language models with human preference data had largely shifted from full [reinforcement learning from human feedback](/wiki/rlhf) pipelines to direct alignment algorithms. The dominant such algorithm, DPO (Rafailov et al., 2023), reparameterizes the standard [RLHF](/wiki/rlhf) objective so that the reward is implicitly defined by a log ratio between the policy and a fixed reference model.[3] DPO removes the need to fit a separate [reward model](/wiki/reward_model) and the need to run on-policy [RL](/wiki/rlhf) rollouts, but it still requires holding two copies of the model in memory at training time: the trainable policy and the frozen reference.[3]

Several variants of DPO appeared in 2023 and 2024, including Identity Preference Optimization (IPO), Kahneman-Tversky Optimization ([KTO](/wiki/kto)), Sequence Likelihood Calibration with Human Feedback (SLiC-HF), Rank Responses to align Human Feedback (RRHF), Contrastive Preference Optimization (CPO), Reference-Free DPO (R-DPO), and Odds Ratio Preference Optimization ([ORPO](/wiki/orpo)). These methods variously modify the loss to address overfitting, length bias, or reference-model dependence.[1] SimPO sits inside this wave of post-DPO algorithms and argues that the implicit reward used during DPO training is mismatched with the average-log-probability quantity that actually drives generation at inference time, and that this mismatch is partly responsible for length exploitation and inconsistent reward margins.[1]

The work was produced at Princeton University, where Mengzhou Xia and Danqi Chen are affiliated with Princeton NLP and Princeton Language and Intelligence (PLI); Yu Meng, a visiting researcher with the Princeton NLP group during the project, is now a tenure-track assistant professor at the University of Virginia.[2][4] The original arXiv preprint appeared on 23 May 2024, with subsequent revisions on 8 July 2024 and 1 November 2024 adding new baselines, Gemma 2 results, and expanded discussion of length normalization and KL regularization.[1][5] The paper was presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) in Vancouver in December 2024.[6]

### When was SimPO released?

| Date | Milestone |
|---|---|
| 2024-05-23 | arXiv v1 preprint posted [1] |
| 2024-07-08 | arXiv v2 adds baselines and expanded analysis [5] |
| 2024-07-17 | `gemma-2-9b-it-SimPO` released, ranks #1 on the AlpacaEval 2 leaderboard with a 72.4% LC win rate [8] |
| 2024-09-16 | `gemma-2-9b-it-SimPO` recorded as #1 on Chatbot Arena among sub-10B models [7] |
| 2024-11-01 | arXiv v3 adds Gemma 2 results and expanded length-normalization discussion [7] |
| 2024-12 | Presented at NeurIPS 2024, Vancouver [6] |

## Technical details

### The DPO loss as a starting point

For a preference dataset \(\mathcal{D} = \{(x, y_w, y_l)\}\) of prompts \(x\), preferred responses \(y_w\), and dispreferred responses \(y_l\), the DPO loss is:

\[ \mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right] \]

where \(\pi_\theta\) is the trainable policy, \(\pi_{\text{ref}}\) is a frozen reference policy (typically the post-SFT checkpoint), \(\beta\) is a temperature hyperparameter, and \(\sigma\) is the logistic sigmoid.[3] The implicit DPO reward for a response \(y\) is \(r(x,y) = \beta\log\big(\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)\big)\), a log ratio of the policy and reference probabilities of the full sequence.[3]

### The SimPO loss

SimPO replaces this implicit reward with the **average per-token log probability** of the response under the policy alone:

\[ r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|}\log\pi_\theta(y\mid x) = \frac{\beta}{|y|}\sum_{i=1}^{|y|}\log\pi_\theta(y_i\mid x, y_{<i}) \]

and inserts a constant target margin \(\gamma > 0\) into a Bradley-Terry ranking objective:

\[ \mathcal{L}_{\text{SimPO}}(\pi_\theta) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\frac{\beta}{|y_w|}\log\pi_\theta(y_w\mid x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l\mid x) - \gamma\right)\right] \]

where \(|y|\) is the token length of the response and \(\gamma\) is the target reward margin.[1][7] The policy must therefore drive the gap between the average per-token log probabilities of the chosen and rejected responses to at least \(\gamma\) before the loss is satisfied.[1]

The authors motivate two design choices.[1] First, the average-log-probability reward matches the quantity that beam search and likelihood-based decoding actually optimize at inference, so the training reward and the generation criterion are aligned. Second, length normalization (dividing by \(|y|\)) decouples reward magnitude from sequence length, which without it would let the model game the loss by elongating or shortening responses to artificially raise or lower their cumulative log probability.[1] The margin \(\gamma\) generalizes the Bradley-Terry objective so that ties (and small positive differences) are penalized, forcing the model to push winning rewards above losing rewards by at least \(\gamma\) on the average-log-prob scale.[1]

### Why does SimPO drop the reference model?

Because the SimPO reward \(r_{\text{SimPO}}(x,y)\) depends only on \(\pi_\theta\), the training loop never queries \(\pi_{\text{ref}}\). The authors note that this eliminates the second forward pass that DPO requires through the frozen reference for both \(y_w\) and \(y_l\) on every batch, and it eliminates the need to hold \(\pi_{\text{ref}}\) in GPU memory during training.[1] Empirically the paper reports that "SimPO cuts run time by roughly 20% and reduces GPU memory usage by about 10%, thanks to eliminating forward passes with the reference model."[7]

DPO's reference model is sometimes interpreted as providing implicit [KL divergence](/wiki/kl_divergence) regularization toward the SFT distribution; removing it raises the question of whether the trained policy will drift too far from the supervised pretraining behavior. The SimPO paper addresses this empirically rather than theoretically, observing that the length-normalized average log probability reward and the explicit margin together produce policies whose response lengths and content remain comparable to SFT or DPO-trained baselines rather than collapsing or diverging.[7] The authors also report KL-divergence trajectories during training and argue that SimPO does not exhibit pathological drift in the regimes they evaluate.[7] Follow-up analyses, discussed in the limitations section below, examine whether this conclusion survives more aggressive hyperparameter exploration.[14]

### Does length normalization prevent length exploitation?

A central empirical claim of the paper is that the length normalization term is what prevents SimPO from devolving into length exploitation. Without normalization, the implicit reward of a longer response can grow purely as a function of its length, biasing the model toward longer outputs that may not be substantively better.[1] The paper reports that the Spearman correlation between response length and likelihood drops from 0.82 without length normalization to 0.34 with it, with plain DPO sitting at 0.59 on the same measure; an ablation removing length normalization from SimPO drops AlpacaEval 2 length-controlled (LC) win rate on Mistral-Base from 21.5 to 11.9 and Arena-Hard from 16.6 to 9.4.[7] The same ablation drops Mistral-Instruct AlpacaEval 2 LC from 32.1 to 19.1.[7] The authors report that without length normalization "this leads to the generation of long and repetitive patterns, substantially degrading the overall quality of the output" rather than substantively better responses.[7]

### Hyperparameters

SimPO introduces no architectural changes; tuning is concentrated in three scalars:[1][8]

- **Learning rate**: described by the official repository as the most critical hyperparameter, with grid searches over values like 3e-7, 5e-7, 6e-7, 8e-7, and 1e-6 recommended.
- **\(\beta\)**: the reward scaling temperature. In SimPO it is typically much larger than in DPO; the official repository's released recipes use values from 2.0 (Mistral-Base, Llama 3 8B-Base) up to 10 (Gemma 2 9B-it, Llama 3 8B-Instruct v0.2).
- **\(\gamma\)**: the target reward margin. The authors recommend tuning by the ratio \(\gamma/\beta\) on a normalized scale between 0.1 and 0.8.[8] The default `simpo_gamma` in the Hugging Face TRL implementation is 0.5.[10]

The official repository publishes the exact recipe used for each backbone:[8]

| Setting | \(\beta\) | \(\gamma/\beta\) | Learning rate |
|---|---|---|---|
| Mistral-Base | 2.0 | 0.8 | 3e-7 |
| Mistral-Instruct | 2.5 | 0.1 | 5e-7 |
| Llama3-Base | 2.0 | 0.5 | 6e-7 |
| Llama3-Instruct | 2.5 | 0.55 | 1e-6 |
| Llama3-Instruct v0.2 | 10 | 0.3 | 1e-6 |
| Gemma2-9B-it | 10 | 0.5 | 8e-7 |

The paper's general recommendation for new setups is \(\beta\) between 2.0 and 2.5 and \(\gamma\) between 0.5 and 1.5, with the caveat that performance is sensitive to these choices and that win rate is non-monotone in \(\gamma\): reward accuracy increases with \(\gamma\) while win rate first rises then falls, indicating an interior optimum.[7]

### How does SimPO differ from DPO?

| Property | DPO | SimPO |
|---|---|---|
| Reference model required at training | Yes | No |
| Implicit reward per response | \(\beta \log\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)\) | \((\beta/|y|)\log\pi_\theta(y\mid x)\) |
| Length-normalized | No (by default) | Yes |
| Explicit reward margin | No | Yes, parameter \(\gamma\) |
| Typical \(\beta\) range | 0.01 to 0.1 | 2.0 to 10 |
| GPU memory during training | Two model copies | One model copy |
| Reported runtime overhead | Baseline | About 20% lower than DPO |
| Reported memory overhead | Baseline | About 10% lower than DPO |

The table summarizes the relevant differences from the SimPO paper and accompanying repository.[1][7][8]

## How well does SimPO perform on benchmarks?

The paper evaluates SimPO and seven baseline methods (SFT, DPO, IPO, KTO, [ORPO](/wiki/orpo), R-DPO, plus RRHF and SLiC-HF in some settings) across four backbone configurations: Mistral 7B Base (with SFT on UltraChat-200k followed by alignment on UltraFeedback Binarized), Mistral 7B Instruct, [Llama 3](/wiki/llama_3) 8B Base, and Llama 3 8B Instruct.[1] Evaluation is on AlpacaEval 2 (length-controlled and raw win rates), Arena-Hard v0.1 win rate, and MT-Bench scored with GPT-4 and GPT-4-Turbo (GPT-4-Preview-1106) as judges.[1][7] A revised v3 of the paper extends the evaluation to Gemma 2 9B-it.[7]

### Mistral 7B Base

| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| SFT | 8.4% | 6.2% | 1.3% | 4.8 |
| DPO | 15.1% | 12.5% | 10.4% | 5.9 |
| IPO | 11.8% | 9.4% | 7.5% | 5.5 |
| KTO | 13.1% | 9.1% | 5.6% | 5.4 |
| ORPO | 14.7% | 12.2% | 7.0% | 5.8 |
| R-DPO | 17.4% | 12.8% | 8.0% | 5.9 |
| SimPO | 21.5% | 20.8% | 16.6% | 6.0 |

Source: SimPO paper, Table 4.[1][7]

### Mistral 7B Instruct

| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| SFT | 17.1% | 14.7% | 12.6% | 6.2 |
| DPO | 26.8% | 24.9% | 16.3% | 6.3 |
| IPO | 20.3% | 20.3% | 16.2% | 6.4 |
| KTO | 24.5% | 23.6% | 17.9% | 6.4 |
| ORPO | 24.5% | 24.9% | 20.8% | 6.4 |
| R-DPO | 27.3% | 24.5% | 16.1% | 6.2 |
| SimPO | 32.1% | 34.8% | 21.0% | 6.6 |

Source: SimPO paper, Table 4.[1][7]

### Llama 3 8B Base

| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| DPO | 18.2% | 15.5% | 15.9% | 7.7 |
| IPO | 14.4% | 14.2% | 17.8% | 7.4 |
| KTO | 14.2% | 12.4% | 12.5% | 7.8 |
| ORPO | 12.2% | 10.6% | 10.8% | 7.6 |
| SimPO | 22.0% | 20.3% | 23.4% | 7.7 |

Source: SimPO paper, Table 4.[1][7]

### Llama 3 8B Instruct

| Method | AlpacaEval 2 LC | AlpacaEval 2 WR | Arena-Hard WR | MT-Bench |
|---|---|---|---|---|
| DPO | 40.3% | 37.9% | 32.6% | 8.0 |
| IPO | 35.6% | 35.6% | 30.5% | 8.3 |
| KTO | 33.1% | 31.8% | 26.4% | 8.2 |
| ORPO | 28.5% | 27.4% | 25.8% | 8.0 |
| SimPO | 44.7% | 40.5% | 33.8% | 8.0 |

Source: SimPO paper, Table 4.[1][7]

### Headline gains and across-method ranking

The paper summarizes its main result as: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."[1] More granularly, SimPO beats the best non-SimPO baseline by 3.6 to 4.8 points on AlpacaEval 2 LC win rate and by 0.2 to 6.2 points on Arena-Hard across the four backbone configurations the paper studies.[7] Across all four setups (Mistral 7B Base, Mistral 7B Instruct, Llama 3 8B Base, Llama 3 8B Instruct), SimPO ranks first on AlpacaEval 2 LC, AlpacaEval 2 raw WR, and Arena-Hard, and is comparable on MT-Bench (where, as noted earlier, the differences between methods are tight).[1][7]

The paper also reports that the SimPO gains do not come at the cost of inflated response lengths: SimPO outputs are comparable in length to those of the SFT model and to DPO-trained baselines, indicating that the length-normalized reward is not silently rewarding longer responses.[7] The 44.7% AlpacaEval 2 LC score on Llama 3 8B Instruct was, at the time of v2 of the preprint (July 2024), the highest reported score on the AlpacaEval 2 leaderboard among 8B-class open-source models, surpassing some closed models including the reported number for Claude 3 Opus on the same leaderboard.[1][9] The model checkpoint backing that number was released as `princeton-nlp/Llama-3-Instruct-8B-SimPO` on [Hugging Face](/wiki/hugging_face).[9]

### Gemma 2 9B

A later revision applied SimPO to `google/gemma-2-9b-it` and released `princeton-nlp/gemma-2-9b-it-SimPO`. The reported numbers are 72.4% AlpacaEval 2 LC, 65.9% raw win rate, and 59.1% on Arena-Hard, ranked first on Chatbot Arena among models under 10 billion parameters as of 16 September 2024 (as recorded in v3 of the paper).[7][8] The baseline `gemma-2-9b-it` model is reported at 51.1% AlpacaEval 2 LC, so SimPO adds more than 20 absolute LC points on this backbone.[8] According to the Princeton blog, human raters on Chatbot Arena preferred the SimPO-tuned Gemma model over several much larger systems, and the checkpoint reached the top of its size class on real user votes.[2]

## Implementations and adoption

### Official code and checkpoints

The reference implementation is released at `github.com/princeton-nlp/SimPO` under the MIT license, built on top of the [Hugging Face](/wiki/hugging_face) alignment-handbook scaffolding and trained on UltraFeedback Binarized with UltraChat-200k for SFT in the "Base" setting.[8] Training in the paper used 4xH100 GPUs with DeepSpeed ZeRO-3 and a total batch size of 128.[8]

Released checkpoints on [Hugging Face](/wiki/hugging_face) include:[8][9]

- `princeton-nlp/Llama-3-Instruct-8B-SimPO` (v0.1)
- `princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2`
- `princeton-nlp/Mistral-7B-Base-SimPO`
- `princeton-nlp/Mistral-7B-Instruct-SimPO`
- `princeton-nlp/Llama-3-Base-8B-SFT-SimPO`
- `princeton-nlp/gemma-2-9b-it-SimPO`

Sibling repositories under the same v0.2 release provide DPO, IPO, [KTO](/wiki/kto), [ORPO](/wiki/orpo), CPO, RRHF, SLiC-HF, and R-DPO checkpoints trained under matched conditions for fair comparison, which makes the SimPO release one of the more thorough open benchmarks of preference optimization methods.[8]

### Hugging Face TRL integration

SimPO is implemented inside the [Hugging Face](/wiki/hugging_face) TRL library as a loss option on the CPOTrainer. The user enables SimPO by setting `loss_type="simpo"`, `cpo_alpha=0.0`, and a target `simpo_gamma` (default 0.5) in `CPOConfig`.[10] The TRL documentation explains: "SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization."[10] A hybrid CPO-SimPO mode is also supported by keeping `cpo_alpha` nonzero alongside the SimPO loss; the project at `github.com/fe1ixxu/CPO_SIMPO` documents this combination.[10]

### Quantized and downstream redistributions

Community-distributed quantizations of the SimPO Llama 3 8B checkpoint appear on [Hugging Face](/wiki/hugging_face) in formats such as GGUF (for example `bartowski/Llama-3-Instruct-8B-SimPO-GGUF`), enabling local inference through runners like [llama.cpp](/wiki/llama_cpp) and [Ollama](/wiki/ollama).[11] The Princeton release also contributed checkpoints to chatbot-arena style head-to-head leaderboards where the Gemma-2-9B-it-SimPO entry ranked at the top of its size class.[8]

## Variants and related methods

SimPO sits inside a family of direct alignment algorithms that, like DPO, optimize a loss over preference pairs without an explicit reward model or on-policy [RL](/wiki/rlhf) rollouts.[1] The following table summarizes how the closest neighbors differ.

| Method | Reference model | Reward form | Distinguishing feature |
|---|---|---|---|
| DPO | Required | Log policy/reference ratio of full sequence | KL-style implicit constraint to reference |
| IPO | Required | Same as DPO with squared loss | Avoids overfitting via bounded loss; averaged over tokens |
| [KTO](/wiki/kto) | Required | Prospect-theory-derived utility | Trains on unpaired desirable/undesirable signals, not pairs |
| [ORPO](/wiki/orpo) | Not required | Log odds ratio combined with SFT NLL | Joint SFT + odds-ratio penalty |
| CPO | Not required | DPO-style reward with SFT regularizer | Approximates DPO without reference; used for translation |
| R-DPO | Required | DPO reward with length regularizer | Adds an explicit length penalty term |
| SimPO | Not required | Length-normalized average log probability | Adds explicit margin \(\gamma\); reference-free |

Sources: the cited SimPO paper and the TRL CPOTrainer documentation, which catalogs these losses as configurable options.[1][10]

A direct successor is **AlphaPO** (Gupta et al., January 2025), which leaves the SimPO loss structure intact but applies a parametric transformation \(r=(1-p^{-\alpha})/\alpha\) to reshape the reward function. The AlphaPO authors describe SimPO and DPO as both suffering from "likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably," and argue that the reward shape, not just its functional form, controls how strongly this happens.[12] AlphaPO reports "about 7% to 10% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B" over SimPO, and it is integrated into the same TRL CPOTrainer.[10][12]

Other follow-ups include **\(\alpha\)-DPO** (Wu et al., 2024), which generalizes SimPO's fixed target margin to an adaptive, instance-specific margin,[17] and **SimPER** (Xiao et al., ICLR 2025), which removes hyperparameters entirely from SimPO-style training by optimizing the inverse perplexity of chosen and rejected responses.[18] Reference-free multi-preference variants such as **REFA** (Gupta et al., December 2024) extend reference-free, length-normalized alignment to settings with more than two ranked responses per prompt.[13]

## What are the limitations and criticisms of SimPO?

### Hyperparameter sensitivity

The SimPO paper and its companion repository both flag that SimPO is sensitive to its three main hyperparameters (learning rate, \(\beta\), \(\gamma\)) and that values that work well on one base model do not transfer to others.[1][8] Released recipes use \(\beta\) values that vary by a factor of five across configurations (2.0 for Mistral-Base, up to 10 for Gemma and Llama-3-Instruct v0.2), and \(\gamma/\beta\) ratios from 0.1 to 0.8.[8] Tuning therefore requires more search than DPO, where a single \(\beta\) around 0.01 to 0.1 is often adequate.[8]

### Are SimPO's gains really from length normalization?

The most substantive critique is that SimPO's gains over DPO may be attributable largely to length normalization rather than to dropping the reference model. The paper *Understanding Reference Policies in Direct Preference Optimization* (Liu, Liu, and Cohan, July 2024) reports that the optimal KL-constraint strength for DPO is far smaller than the values used in prior work: at \(\beta = 0.01\) their Mistral-7B DPO run scores 16.25 on AlpacaEval 2 LC versus 13.42 at \(\beta = 0.1\), and at that setting "DPO outperforms the reference-policy-free ORPO method."[14] The same authors caution that reference-freeness is not a free lunch: "other forms of regularization are still required in these methods," noting that ORPO leans on an SFT maximum-likelihood term while SimPO relies on length normalization.[14]

A related line of work introduces **LN-DPO**, a length-normalized variant of DPO, and reports that comparing "reference-free (i.e., SimPO) and reference-dependent (i.e., DPO and LN-DPO) methods reveals that they perform similarly at their peak."[15] The implication is that length normalization, rather than reference freeness or the explicit margin, accounts for much of the gap that SimPO opens over plain DPO.[15] The open GitHub issue *Length normalization in DPO and other variants* on the Princeton SimPO repository explicitly raises this question without a public resolution.[16]

### Likelihood displacement

AlphaPO and contemporaneous work observe that, like DPO, SimPO can drive down the absolute probability of preferred responses during training even while the relative margin to dispreferred responses grows. The shape of the implicit reward influences how strongly this happens, and the SimPO log-probability reward is not optimal in this respect.[12] In domains where preserving the policy's likelihood of good responses matters (for example, reasoning chains where exact phrasings matter), this can hurt downstream performance.

### Benchmark caveats

The headline AlpacaEval 2 and Arena-Hard numbers come from automatic LLM-as-judge benchmarks scored by GPT-4-class judges. The SimPO paper itself notes that MT-Bench scores cluster tightly across methods because of MT-Bench's small scale and single-instance scoring protocol, limiting its discriminative power.[7] More broadly, AlpacaEval 2's length-controlled win rate corrects for some length bias but not all, and the SimPO authors acknowledge that their models' benchmark "evaluations have limitations, including restricted query space and potential biases from model-based evaluations."[7]

### Reproducibility

Reproducing the published numbers requires pinning specific package versions, notably `alpaca-eval==0.6.2` (the repository notes that versions 0.6.3 and later changed scoring in ways that cause discrepancies).[8] The repository also notes that exact results vary with hardware and CUDA versions, common but worth flagging.[8] The released training scripts target 4xH100 nodes; running on smaller hardware requires scaling down per-device batch size while keeping the total batch size at 128 through gradient accumulation, which can subtly alter optimization dynamics.[8]

### Reference-free is not unconditionally simpler

A subtler concern is that "removing the reference model" is sometimes presented as a strict simplification, but SimPO compensates by introducing the margin hyperparameter \(\gamma\), enlarging the effective \(\beta\) range (which now ranges over an order of magnitude across setups), and demanding more careful learning-rate tuning.[8] Where DPO has effectively one alignment-specific hyperparameter (\(\beta\)), SimPO has three (\(\beta\), \(\gamma\), and an alignment learning rate that often differs from the SFT learning rate). For practitioners with limited compute for hyperparameter search, this can offset the per-step memory and runtime savings.[8]

## Significance

SimPO is one of the clearest demonstrations that direct alignment can be simplified beyond DPO without obviously sacrificing quality. The combination of dropping the reference model, normalizing by length, and adding an explicit margin reduces the algorithm to a single forward pass per minibatch and one set of model weights, while keeping the loss in the same Bradley-Terry family that DPO and its variants use.[1] That has practical consequences: smaller GPU memory footprint and faster steps make alignment feasible on more constrained hardware, and the [Hugging Face](/wiki/hugging_face) TRL integration makes the algorithm accessible through a one-line configuration change.[10]

The wider research conversation that followed SimPO sharpened the question of *why* preference optimization works, isolating the contributions of (a) reference-model regularization, (b) length normalization, and (c) explicit margin terms. Subsequent work that introduces length-normalized DPO variants, identity transformations on the implicit reward (AlphaPO), and hyperparameter-free analogs (SimPER) treats SimPO as the central reference point for that decomposition, even when the conclusion is that several of SimPO's design choices interact and that pure ablation results depend on careful hyperparameter retuning of each baseline.[12][14][15][18]

In open-source instruction tuning, SimPO checkpoints became, briefly, frontier-quality entries on AlpacaEval 2 for their size class: `Llama-3-Instruct-8B-SimPO` was the top 8B open model on AlpacaEval 2 LC at release, and `gemma-2-9b-it-SimPO` topped Chatbot Arena among sub-10B models in mid-September 2024.[7][8] Those rankings were quickly disputed and overtaken by later checkpoints and by methodology revisions to the benchmarks themselves, but the SimPO recipe (length-normalized average log probability, explicit margin, no reference) is now a standard option in the alignment toolkit.[10]

## Comparison

| Concept | Relationship to SimPO |
|---|---|
| [DPO](/wiki/direct_preference_optimization_dpo) | Direct ancestor; SimPO drops the reference model and adds length normalization and a margin |
| [DPO](/wiki/dpo) (short slug) | Alternate entry point for the same algorithm SimPO replaces |
| [KTO](/wiki/kto) | Direct alignment using prospect-theory utility, not paired preferences |
| [ORPO](/wiki/orpo) | Reference-free competitor combining odds ratio with SFT loss |
| [RLHF](/wiki/rlhf) | Broader paradigm SimPO and DPO both simplify away from |
| [RLAIF](/wiki/rlaif) | Variant of RLHF using AI-generated preference labels; orthogonal to SimPO's loss design |
| [Constitutional AI](/wiki/constitutional_ai) | Provides preference labels via principle-based AI feedback that can feed SimPO training |
| [Llama 3](/wiki/llama_3) | Primary backbone used in the SimPO paper for the strongest reported results |
| [Mistral 7B](/wiki/mistral_7b) | Second backbone used for SimPO experiments |
| [Gemma 2](/wiki/gemma_2) | Backbone for the v3 SimPO update and the Chatbot Arena sub-10B leader |
| [AlpacaEval](/wiki/alpacaeval) | Primary benchmark on which SimPO's headline gains over DPO are reported |
| [Arena-Hard](/wiki/arena_hard) | Second benchmark used to verify SimPO's gains |
| [MT-Bench](/wiki/mt_bench) | Third benchmark; results are tight across methods due to small scale |
| [KL Divergence](/wiki/kl_divergence) | DPO implicitly regularizes the policy toward the reference under a KL constraint; SimPO removes this constraint |
| [Hugging Face](/wiki/hugging_face) | Hosts the official SimPO checkpoints and the TRL implementation |
| [Hugging Face Transformers](/wiki/transformers_library) | Underlies the SimPO training stack |
| [Instruction Tuning](/wiki/instruction_tuning) | SFT phase preceding SimPO in the "Base" experimental setting |
| [Supervised fine-tuning](/wiki/supervised_fine-tuning) | The SFT step that precedes SimPO when starting from a base model |
| [Claude 3 Opus](/wiki/claude_3_opus) | Reference closed model that SimPO Llama 3 8B Instruct surpassed on AlpacaEval 2 LC at release |

## See also

- [Direct Preference Optimization (DPO)](/wiki/direct_preference_optimization_dpo)
- [DPO](/wiki/dpo)
- [KTO](/wiki/kto)
- [ORPO](/wiki/orpo)
- [Reinforcement Learning from Human Feedback (RLHF)](/wiki/rlhf)
- [RLAIF](/wiki/rlaif)
- [Constitutional AI](/wiki/constitutional_ai)
- [Llama 3](/wiki/llama_3)
- [Mistral 7B](/wiki/mistral_7b)
- [Gemma 2](/wiki/gemma_2)
- [AlpacaEval](/wiki/alpacaeval)
- [Arena-Hard](/wiki/arena_hard)
- [MT-Bench](/wiki/mt_bench)
- [KL Divergence](/wiki/kl_divergence)
- [Hugging Face](/wiki/hugging_face)
- [Instruction Tuning](/wiki/instruction_tuning)
- [Supervised fine-tuning](/wiki/supervised_fine-tuning)
- [Large Language Model](/wiki/large_language_model)
- [Claude 3 Opus](/wiki/claude_3_opus)

## References

1. Meng, Yu; Xia, Mengzhou; Chen, Danqi. "SimPO: Simple Preference Optimization with a Reference-Free Reward". arXiv:2405.14734, 2024-05-23. https://arxiv.org/abs/2405.14734. Accessed 2026-07-12.
2. Princeton Language and Intelligence. "SimPO: A New Way to Teach AI Models to Follow Human Preferences". Princeton University, 2024. https://pli.princeton.edu/blog/2024/simpo-new-way-teach-ai-models-follow-human-preferences. Accessed 2026-07-12.
3. Rafailov, Rafael; Sharma, Archit; Mitchell, Eric; Ermon, Stefano; Manning, Christopher D.; Finn, Chelsea. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model". arXiv:2305.18290, 2023-05-29. https://arxiv.org/abs/2305.18290. Accessed 2026-05-20.
4. Xia, Mengzhou. "Personal Website". xiamengzhou.github.io, 2025. https://xiamengzhou.github.io/. Accessed 2026-05-20.
5. arXiv. "[2405.14734v2] SimPO: Simple Preference Optimization with a Reference-Free Reward". 2024-07-08. https://arxiv.org/abs/2405.14734v2. Accessed 2026-05-20.
6. NeurIPS. "SimPO: Simple Preference Optimization with a Reference-Free Reward". NeurIPS 2024 Proceedings, 2024. https://papers.nips.cc/paper_files/paper/2024/hash/e099c1c9699814af0be873a175361713-Abstract-Conference.html. Accessed 2026-05-20.
7. Meng, Yu; Xia, Mengzhou; Chen, Danqi. "SimPO: Simple Preference Optimization with a Reference-Free Reward (HTML version)". arXiv:2405.14734v3, 2024-11-01. https://arxiv.org/html/2405.14734v3. Accessed 2026-07-12.
8. Meng, Yu; Xia, Mengzhou; Chen, Danqi. "princeton-nlp/SimPO: [NeurIPS 2024] Code repository". GitHub, 2024. https://github.com/princeton-nlp/SimPO. Accessed 2026-07-12.
9. Meng, Yu; Xia, Mengzhou; Chen, Danqi. "princeton-nlp/Llama-3-Instruct-8B-SimPO". Hugging Face, 2024-05-23. https://huggingface.co/princeton-nlp/Llama-3-Instruct-8B-SimPO. Accessed 2026-05-20.
10. Hugging Face. "CPO Trainer (TRL documentation)". huggingface.co/docs/trl, 2024 onward. https://huggingface.co/docs/trl/main/en/cpo_trainer. Accessed 2026-07-12.
11. Bartowski. "Llama-3-Instruct-8B-SimPO-GGUF". Hugging Face, 2024. https://huggingface.co/bartowski/Llama-3-Instruct-8B-SimPO-GGUF. Accessed 2026-05-20.
12. Gupta, Aman; Tang, Shao; Song, Qingquan; et al. "AlphaPO: Reward Shape Matters for LLM Alignment". arXiv:2501.03884, 2025-01-07. https://arxiv.org/abs/2501.03884. Accessed 2026-07-12.
13. Gupta, Taneesh; Madhavan, Rahul; Zhang, Xuchao; Bansal, Chetan; Rajmohan, Saravan. "REFA: Reference Free Alignment for Multi-Preference Optimization". arXiv:2412.16378, 2024-12-20. https://arxiv.org/abs/2412.16378. Accessed 2026-07-12.
14. Liu, Yixin; Liu, Pengfei; Cohan, Arman. "Understanding Reference Policies in Direct Preference Optimization". arXiv:2407.13709, 2024-07-18. https://arxiv.org/abs/2407.13709. Accessed 2026-07-12.
15. Ahrabian, Kian; Lin, Xihui; Patra, Barun; Chaudhary, Vishrav; Benhaim, Alon; Pujara, Jay; Song, Xia. "A Practical Analysis of Human Alignment with *PO". arXiv:2407.15229, 2024-07-21. https://arxiv.org/abs/2407.15229. Accessed 2026-07-12.
16. yakazimir. "Length normalization in DPO and other variants (Issue #20)". GitHub: princeton-nlp/SimPO, 2024-06-05. https://github.com/princeton-nlp/SimPO/issues/20. Accessed 2026-05-20.
17. Wu, Junkang; et al. "α-DPO / AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization". arXiv:2410.10148, 2024-10. https://arxiv.org/abs/2410.10148. Accessed 2026-07-12.
18. Xiao, Teng; et al. "SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters". ICLR 2025; arXiv:2502.00883, 2025-02. https://arxiv.org/abs/2502.00883. Accessed 2026-07-12.