SimPO

SimPO (Simple Preference Optimization) is an offline preference learning algorithm for aligning [[large_language_model|large language models]] with human preferences. It was introduced in May 2024 by Yu Meng, Mengzhou Xia, and Danqi Chen in the paper SimPO: Simple Preference Optimization with a Reference-Free Reward, accepted to NeurIPS 2024.[^1] Building on [[direct_preference_optimization_dpo|Direct Preference Optimization]] (DPO), SimPO modifies the loss in two ways: it replaces the reference-model-relative reward of DPO with the length-normalized average log probability of a response under the policy, and it introduces a target reward margin parameter that explicitly widens the gap between preferred and rejected responses. Removing the reference model lowers training memory and runtime, and the authors report that SimPO outperforms DPO and several variants on AlpacaEval 2, Arena-Hard, and MT-Bench across Mistral 7B, Llama 3 8B, and Gemma 2 9B configurations.[^1][^2]

Background

By 2024, fine-tuning [[instruction_tuning|instruction-tuned]] language models with human preference data had largely shifted from full [[rlhf|reinforcement learning from human feedback]] pipelines to direct alignment algorithms. The dominant such algorithm, DPO (Rafailov et al., 2023), reparameterizes the standard [[rlhf|RLHF]] objective so that the reward is implicitly defined by a log ratio between the policy and a fixed reference model.[^3] DPO removes the need to fit a separate reward model and the need to run on-policy [[rlhf|RL]] rollouts, but it still requires holding two copies of the model in memory at training time: the trainable policy and the frozen reference.[^3]

Several variants of DPO appeared in 2023 and 2024, including Identity Preference Optimization (IPO), Kahneman-Tversky Optimization ([[kto|KTO]]), Sequence Likelihood Calibration with Human Feedback (SLiC-HF), Rank Responses to align Human Feedback (RRHF), Contrastive Preference Optimization (CPO), Reference-Free DPO (R-DPO), and Odds Ratio Preference Optimization ([[orpo|ORPO]]). These methods variously modify the loss to address overfitting, length bias, or reference-model dependence.[^1] SimPO sits inside this wave of post-DPO algorithms and argues that the implicit reward used during DPO training is mismatched with the average-log-probability quantity that actually drives generation at inference time, and that this mismatch is partly responsible for length exploitation and inconsistent reward margins.[^1]

The work was produced at Princeton University, where Mengzhou Xia and Danqi Chen are affiliated with Princeton NLP and Princeton Language and Intelligence (PLI), with Yu Meng now at the University of Virginia.[^2][^4] The original arXiv preprint appeared on 23 May 2024, with subsequent revisions on 8 July 2024 and 1 November 2024 adding new baselines, Gemma 2 results, and expanded discussion of length normalization and KL regularization.[^1][^5] The paper was presented at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) in Vancouver in December 2024.[^6]

Technical Details

The DPO loss as a starting point

For a preference dataset (\mathcal{D} = {(x, y_w, y_l)}) of prompts (x), preferred responses (y_w), and dispreferred responses (y_l), the DPO loss is:

[ \mathcal{L}{\text{DPO}}(\pi\theta;\pi_{\text{ref}}) = -\mathbb{E}{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right] ]

where (\pi_\theta) is the trainable policy, (\pi_{\text{ref}}) is a frozen reference policy (typically the post-SFT checkpoint), (\beta) is a temperature hyperparameter, and (\sigma) is the logistic sigmoid.[^3] The implicit DPO reward for a response (y) is (r(x,y) = \beta\log\big(\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)\big)), a log ratio of the policy and reference probabilities of the full sequence.[^3]

The SimPO loss

SimPO replaces this implicit reward with the average per-token log probability of the response under the policy alone:

[ r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|}\log\pi_\theta(y\mid x) = \frac{\beta}{|y|}\sum_{i=1}^{|y|}\log\pi_\theta(y_i\mid x, y_{<i}) ]

and inserts a constant target margin (\gamma > 0) into a Bradley-Terry ranking objective:

[ \mathcal{L}{\text{SimPO}}(\pi\theta) = -\mathbb{E}{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\frac{\beta}{|y_w|}\log\pi\theta(y_w\mid x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l\mid x) - \gamma\right)\right] ]

where (|y|) is the token length of the response and (\gamma) is the target reward margin.[^1][^7] The policy must therefore drive the gap between the average per-token log probabilities of the chosen and rejected responses to at least (\gamma) before the loss is satisfied.[^1]

The authors motivate two design choices.[^1] First, the average-log-probability reward matches the quantity that beam search and likelihood-based decoding actually optimize at inference, so the training reward and the generation criterion are aligned. Second, length normalization (dividing by (|y|)) decouples reward magnitude from sequence length, which without it would let the model game the loss by elongating or shortening responses to artificially raise or lower their cumulative log probability.[^1] The margin (\gamma) generalizes the Bradley-Terry objective so that ties (and small positive differences) are penalized, forcing the model to push winning rewards above losing rewards by at least (\gamma) on the average-log-prob scale.[^1]

Why no reference model

Because the SimPO reward (r_{\text{SimPO}}(x,y)) depends only on (\pi_\theta), the training loop never queries (\pi_{\text{ref}}). The authors note that this eliminates the second forward pass that DPO requires through the frozen reference for both (y_w) and (y_l) on every batch, and it eliminates the need to hold (\pi_{\text{ref}}) in GPU memory during training.[^1] Empirically the paper reports that SimPO cuts training run time by roughly 20% and reduces GPU memory by about 10% relative to DPO at matched batch size on the authors' configuration.[^7]

DPO's reference model is sometimes interpreted as providing implicit [[kl_divergence|KL divergence]] regularization toward the SFT distribution; removing it raises the question of whether the trained policy will drift too far from the supervised pretraining behavior. The SimPO paper addresses this empirically rather than theoretically, observing that the length-normalized average log probability reward and the explicit margin together produce policies whose response lengths and content remain comparable to SFT or DPO-trained baselines rather than collapsing or diverging.[^7] The authors also report KL-divergence trajectories during training and argue that SimPO does not exhibit pathological drift in the regimes they evaluate.[^7] Follow-up analyses, discussed in the Limitations section below, examine whether this conclusion survives more aggressive hyperparameter exploration.[^14]

Length normalization and length exploitation

A central empirical claim of the paper is that the length normalization term is what prevents SimPO from devolving into length exploitation. Without normalization, the implicit reward of a longer response can grow purely as a function of its length, biasing the model toward longer outputs that may not be substantively better.[^1] The paper reports that the Spearman correlation between response length and likelihood drops from 0.82 without length normalization to 0.34 with it, and that an ablation removing length normalization from SimPO drops AlpacaEval 2 length-controlled (LC) win rate on Mistral-Base from 21.5 to 11.9 and Arena-Hard from 16.6 to 9.4.[^7] The same ablation drops Mistral-Instruct AlpacaEval 2 LC from 32.1 to 19.1.[^7] The authors describe the length-normalization-removed variant as producing "long and repetitive patterns" rather than substantively better responses.[^7]

Hyperparameters

SimPO introduces no architectural changes; tuning is concentrated in three scalars:[^1][^8]

Learning rate: described by the official repository as the most critical hyperparameter, with grid searches over values like 3e-7, 5e-7, 8e-7, and 1e-6 recommended.
(\beta): the reward scaling temperature. In SimPO it is typically much larger than in DPO; the official repository's released recipes use values from 2.0 (Mistral-Base, Llama 3 8B-Base) up to 10 (Gemma 2 9B-it, Llama 3 8B-Instruct v0.2).
(\gamma): the target reward margin. The authors recommend tuning by the ratio (\gamma/\beta) on a normalized scale between 0.1 and 0.8, and the default value in the Hugging Face TRL implementation is 0.5.[^8]

The paper's general recommendation for new setups is (\beta) between 2.0 and 2.5 and (\gamma) between 0.5 and 1.5, with the caveat that performance is sensitive to these choices and that win rate is non-monotone in (\gamma): reward accuracy increases with (\gamma) while win rate first rises then falls, indicating an interior optimum.[^7]

Comparison to DPO at a glance

Property	DPO	SimPO
Reference model required at training	Yes	No
Implicit reward per response	(\beta \log\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x))	((\beta/
Length-normalized	No (by default)	Yes
Explicit reward margin	No	Yes, parameter (\gamma)
Typical (\beta) range	0.01 to 0.1	2.0 to 10
GPU memory during training	Two model copies	One model copy
Reported runtime overhead	Baseline	About 20% lower than DPO
Reported memory overhead	Baseline	About 10% lower than DPO

The table summarizes the relevant differences from the SimPO paper and accompanying repository.[^1][^7][^8]

Empirical Results

The paper evaluates SimPO and seven baseline methods (SFT, DPO, IPO, KTO, [[orpo|ORPO]], R-DPO, plus RRHF and SLiC-HF in some settings) across four backbone configurations: Mistral 7B Base (with SFT on UltraChat-200k followed by alignment on UltraFeedback Binarized), Mistral 7B Instruct, [[llama_3|Llama 3]] 8B Base, and Llama 3 8B Instruct.[^1] Evaluation is on AlpacaEval 2 (length-controlled and raw win rates), Arena-Hard v0.1 win rate, and MT-Bench scored with GPT-4.[^1] A revised v3 of the paper extends the evaluation to Gemma 2 9B-it.[^7]

Mistral 7B Base

Method	AlpacaEval 2 LC	AlpacaEval 2 WR	Arena-Hard WR	MT-Bench
SFT	8.4%	6.2%	1.3%	4.8
DPO	15.1%	12.5%	10.4%	5.9
IPO	11.8%	9.4%	7.5%	5.5
KTO	13.1%	9.1%	5.6%	5.4
ORPO	14.7%	12.2%	7.0%	5.8
R-DPO	17.4%	12.8%	8.0%	5.9
SimPO	21.5%	20.8%	16.6%	6.0

Source: SimPO paper, Table 4.[^1][^7]

Mistral 7B Instruct

Method	AlpacaEval 2 LC	AlpacaEval 2 WR	Arena-Hard WR	MT-Bench
SFT	17.1%	14.7%	12.6%	6.2
DPO	26.8%	24.9%	16.3%	6.3
IPO	20.3%	20.3%	16.2%	6.4
KTO	24.5%	23.6%	17.9%	6.4
ORPO	24.5%	24.9%	20.8%	6.4
R-DPO	27.3%	24.5%	16.1%	6.2
SimPO	32.1%	34.8%	21.0%	6.6

Source: SimPO paper, Table 4.[^1][^7]

Llama 3 8B Base

Method	AlpacaEval 2 LC	AlpacaEval 2 WR	Arena-Hard WR	MT-Bench
DPO	18.2%	15.5%	15.9%	7.7
IPO	14.4%	14.2%	17.8%	7.4
KTO	14.2%	12.4%	12.5%	7.8
ORPO	12.2%	10.6%	10.8%	7.6
SimPO	22.0%	20.3%	23.4%	7.7

Source: SimPO paper, Table 4.[^1][^7]

Llama 3 8B Instruct

Method	AlpacaEval 2 LC	AlpacaEval 2 WR	Arena-Hard WR	MT-Bench
DPO	40.3%	37.9%	32.6%	8.0
IPO	35.6%	35.6%	30.5%	8.3
KTO	33.1%	31.8%	26.4%	8.2
ORPO	28.5%	27.4%	25.8%	8.0
SimPO	44.7%	40.5%	33.8%	8.0

Source: SimPO paper, Table 4.[^1][^7]

Headline gains and across-method ranking

The paper summarizes its main result as: "SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard."[^1] More granularly, SimPO beats the best non-SimPO baseline by 3.6 to 4.8 points on AlpacaEval 2 LC win rate and by 0.2 to 6.2 points on Arena-Hard across the four backbone configurations the paper studies.[^7] Across all four setups (Mistral 7B Base, Mistral 7B Instruct, Llama 3 8B Base, Llama 3 8B Instruct), SimPO ranks first on AlpacaEval 2 LC, AlpacaEval 2 raw WR, and Arena-Hard, and is comparable on MT-Bench (where, as noted earlier, the differences between methods are tight).[^1][^7]

The paper also reports that the SimPO gains do not come at the cost of inflated response lengths: SimPO outputs are comparable in length to those of the SFT model and to DPO-trained baselines, indicating that the length-normalized reward is not silently rewarding longer responses.[^7] The 44.7% AlpacaEval 2 LC score on Llama 3 8B Instruct was, at the time of v2 of the preprint (July 2024), the highest reported score on the AlpacaEval 2 leaderboard among 8B-class open-source models, surpassing some closed models including the reported number for Claude 3 Opus on the same leaderboard.[^1][^9] The model checkpoint backing that number was released as princeton-nlp/Llama-3-Instruct-8B-SimPO on [[hugging_face|Hugging Face]].[^9]

Gemma 2 9B

A later revision applied SimPO to google/gemma-2-9b-it and released princeton-nlp/gemma-2-9b-it-SimPO. The reported numbers are 72.4% AlpacaEval 2 LC, 65.9% raw win rate, and 59.1% on Arena-Hard, ranked first on Chatbot Arena among models under 10 billion parameters as of 16 September 2024 (as recorded in v3 of the paper).[^7][^8] The baseline gemma-2-9b-it model is reported at 51.1% AlpacaEval 2 LC, so SimPO adds more than 20 absolute LC points on this backbone.[^7]

Implementations and Adoption

Official code and checkpoints

The reference implementation is released at github.com/princeton-nlp/SimPO under the MIT license, built on top of the [[hugging_face|Hugging Face]] alignment-handbook scaffolding and trained on UltraFeedback Binarized with UltraChat-200k for SFT in the "Base" setting.[^8] Training in the paper used 4xH100 GPUs with DeepSpeed ZeRO-3 and a total batch size of 128.[^8]

Released checkpoints on [[hugging_face|Hugging Face]] include:[^8][^9]

princeton-nlp/Llama-3-Instruct-8B-SimPO (v0.1)
princeton-nlp/Llama-3-Instruct-8B-SimPO-v0.2
princeton-nlp/Mistral-7B-Base-SimPO
princeton-nlp/Mistral-7B-Instruct-SimPO
princeton-nlp/Llama-3-Base-8B-SFT-SimPO
princeton-nlp/gemma-2-9b-it-SimPO

Sibling repositories under the same v0.2 release provide DPO, IPO, [[kto|KTO]], [[orpo|ORPO]], CPO, RRHF, SLiC-HF, and R-DPO checkpoints trained under matched conditions for fair comparison, which makes the SimPO release one of the more thorough open benchmarks of preference optimization methods.[^8]

Hugging Face TRL integration

SimPO is implemented inside the [[hugging_face|Hugging Face]] TRL library as a loss option on the CPOTrainer. The user enables SimPO by setting loss_type="simpo", cpo_alpha=0.0, and a target simpo_gamma (default 0.5) in CPOConfig.[^10] The TRL documentation explains: "SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization."[^10] A hybrid CPO-SimPO mode is also supported by keeping cpo_alpha nonzero alongside the SimPO loss; the project at github.com/fe1ixxu/CPO_SIMPO documents this combination.[^10]

Quantized and downstream redistributions

Community-distributed quantizations of the SimPO Llama 3 8B checkpoint appear on [[hugging_face|Hugging Face]] in formats such as GGUF (for example bartowski/Llama-3-Instruct-8B-SimPO-GGUF), enabling local inference through runners like [[llama_cpp|llama.cpp]] and [[ollama|Ollama]].[^11] The Princeton release also contributed checkpoints to chatbot-arena style head-to-head leaderboards where the Gemma-2-9B-it-SimPO entry ranked at the top of its size class.[^8]

SimPO sits inside a family of direct alignment algorithms that, like DPO, optimize a loss over preference pairs without an explicit reward model or on-policy [[rlhf|RL]] rollouts.[^1] The following table summarizes how the closest neighbors differ.

Method	Reference model	Reward form	Distinguishing feature
DPO	Required	Log policy/reference ratio of full sequence	KL-style implicit constraint to reference
IPO	Required	Same as DPO with squared loss	Avoids overfitting via bounded loss; averaged over tokens
[[kto	KTO]]	Required	Prospect-theory-derived utility
[[orpo	ORPO]]	Not required	Log odds ratio combined with SFT NLL
CPO	Not required	DPO-style reward with SFT regularizer	Approximates DPO without reference; used for translation
R-DPO	Required	DPO reward with length regularizer	Adds an explicit length penalty term
SimPO	Not required	Length-normalized average log probability	Adds explicit margin (\gamma); reference-free

Sources: the cited SimPO paper and the TRL CPOTrainer documentation, which catalogs these losses as configurable options.[^1][^10]

A direct successor is AlphaPO (Gupta et al., January 2025), which leaves the SimPO loss structure intact but applies a parametric transformation (r=(1-p^{-\alpha})/\alpha) to reshape the reward function. The AlphaPO authors describe SimPO and DPO as both suffering from "likelihood displacement" (where the absolute probability of the chosen response can fall during training) and argue that the reward shape, not just its functional form, controls how strongly this happens.[^10] AlphaPO is integrated into the same TRL CPOTrainer and reports 7-10% relative gains over SimPO on Mistral 7B Instruct and Llama 3 8B Instruct.[^10]

Other follow-ups include (\alpha)-DPO (Wu et al., 2024), which generalizes SimPO's fixed margin to an adaptive instance-specific margin, and SimPER (Xiao et al., ICLR 2025), which removes hyperparameters from SimPO-style training.[^12] Reference-free multi-preference variants such as REFA (December 2024) extend the SimPO recipe to settings with more than two ranked responses per prompt.[^13]

Limitations and Criticisms

Hyperparameter sensitivity

The SimPO paper and its companion repository both flag that SimPO is sensitive to its three main hyperparameters (learning rate, (\beta), (\gamma)) and that values that work well on one base model do not transfer to others.[^1][^8] Released recipes use (\beta) values that vary by a factor of five across configurations (2.0 for Mistral-Base, up to 10 for Gemma and Llama-3-Instruct v0.2), and (\gamma/\beta) ratios from 0.1 to 0.8.[^8] Tuning therefore requires more search than DPO, where a single (\beta) around 0.01 to 0.1 is often adequate.[^8]

Are the gains really from length normalization

The most substantive critique is that SimPO's gains over DPO may be attributable largely to length normalization rather than to dropping the reference model. The paper Understanding Reference Policies in Direct Preference Optimization (Liu, Liu, and Cohan, July 2024) argues that DPO's KL constraint can be configured with a much smaller (\beta) (around 0.01) than the values reported by some SimPO baselines, and at that setting DPO becomes competitive with [[orpo|ORPO]] and other reference-free methods. The authors note that other forms of regularization remain necessary even in reference-free methods.[^14]

A related line of work introduces LN-DPO, a length-normalized variant of DPO, and reports that the reference-free SimPO and reference-dependent LN-DPO "perform similarly at their peak" once each is tuned.[^15] The implication is that length normalization, rather than reference freeness or the explicit margin, accounts for much of the gap that SimPO opens over plain DPO.[^15] The open GitHub issue Length normalization in DPO and other variants on the Princeton SimPO repository explicitly raises this question without a public resolution.[^16]

Likelihood displacement

AlphaPO and contemporaneous work observe that, like DPO, SimPO can drive down the absolute probability of preferred responses during training even while the relative margin to dispreferred responses grows. The shape of the implicit reward influences how strongly this happens, and the SimPO log-probability reward is not optimal in this respect.[^10] In domains where preserving the policy's likelihood of good responses matters (for example, reasoning chains where exact phrasings matter), this can hurt downstream performance.

Benchmark caveats

The headline AlpacaEval 2 and Arena-Hard numbers come from automatic LLM-as-judge benchmarks scored by GPT-4-class judges. The SimPO paper itself notes that MT-Bench scores cluster tightly across methods because of MT-Bench's small scale and single-instance scoring protocol, limiting its discriminative power.[^7] More broadly, AlpacaEval 2's length-controlled win rate corrects for some length bias but not all, and the SimPO authors acknowledge that "benchmark evaluations have limitations, including restricted query space and potential biases from model-based evaluations."[^7]

Reproducibility

Reproducing the published numbers requires pinning specific package versions, notably alpaca-eval==0.6.2 (the repository notes that versions 0.6.3 and later changed scoring in ways that cause discrepancies).[^8] The repository also notes that exact results vary with hardware and CUDA versions, common but worth flagging.[^8] The released training scripts target 4xH100 nodes; running on smaller hardware requires scaling down per-device batch size while keeping the total batch size at 128 through gradient accumulation, which can subtly alter optimization dynamics.[^8]

Reference-free is not unconditionally simpler

A subtler concern is that "removing the reference model" is sometimes presented as a strict simplification, but SimPO compensates by introducing the margin hyperparameter (\gamma), enlarging the effective (\beta) range (which now ranges over an order of magnitude across setups), and demanding more careful learning-rate tuning.[^8] Where DPO has effectively one alignment-specific hyperparameter ((\beta)), SimPO has three ((\beta), (\gamma), and an alignment learning rate that often differs from the SFT learning rate). For practitioners with limited compute for hyperparameter search, this can offset the per-step memory and runtime savings.[^8]

Significance

SimPO is one of the clearest demonstrations that direct alignment can be simplified beyond DPO without obviously sacrificing quality. The combination of dropping the reference model, normalizing by length, and adding an explicit margin reduces the algorithm to a single forward pass per minibatch and one set of model weights, while keeping the loss in the same Bradley-Terry family that DPO and its variants use.[^1] That has practical consequences: smaller GPU memory footprint and faster steps make alignment feasible on more constrained hardware, and the [[hugging_face|Hugging Face]] TRL integration makes the algorithm accessible through a one-line configuration change.[^10]

The wider research conversation that followed SimPO sharpened the question of why preference optimization works, isolating the contributions of (a) reference-model regularization, (b) length normalization, and (c) explicit margin terms. Subsequent work that introduces length-normalized DPO variants, identity transformations on the implicit reward (AlphaPO), and hyperparameter-free analogs (SimPER) treats SimPO as the central reference point for that decomposition, even when the conclusion is that several of SimPO's design choices interact and that pure ablation results depend on careful hyperparameter retuning of each baseline.[^14][^15][^10][^12]

In open-source instruction tuning, SimPO checkpoints became, briefly, frontier-quality entries on AlpacaEval 2 for their size class: Llama-3-Instruct-8B-SimPO was the top 8B open model on AlpacaEval 2 LC at release, and gemma-2-9b-it-SimPO topped Chatbot Arena among sub-10B models in mid-September 2024.[^7][^8] Those rankings were quickly disputed and overtaken by later checkpoints and by methodology revisions to the benchmarks themselves, but the SimPO recipe (length-normalized average log probability, explicit margin, no reference) is now a standard option in the alignment toolkit.[^10]

Comparison

Concept	Relationship to SimPO
[[direct_preference_optimization_dpo	DPO]]
[[dpo	DPO]] (short slug)
[[kto	KTO]]
[[orpo	ORPO]]
[[rlhf	RLHF]]
[[rlaif	RLAIF]]
[[constitutional_ai	Constitutional AI]]
[[llama_3	Llama 3]]
[[mistral_7b	Mistral 7B]]
[[gemma_2	Gemma 2]]
[[alpacaeval	AlpacaEval]]
[[arena_hard	Arena-Hard]]
[[mt_bench	MT-Bench]]
[[kl_divergence	KL Divergence]]
[[hugging_face	Hugging Face]]
[[transformers_library	Hugging Face Transformers]]
[[instruction_tuning	Instruction Tuning]]
[[supervised_fine-tuning	Supervised fine-tuning]]
[[claude_3_opus	Claude 3 Opus]]

SimPO

Background

Technical Details

The DPO loss as a starting point

The SimPO loss

Why no reference model

Length normalization and length exploitation

Hyperparameters

Comparison to DPO at a glance

Empirical Results

Mistral 7B Base

Mistral 7B Instruct

Llama 3 8B Base

Llama 3 8B Instruct

Headline gains and across-method ranking

Gemma 2 9B

Implementations and Adoption

Official code and checkpoints

Hugging Face TRL integration

Quantized and downstream redistributions

Variants and Related Methods

Limitations and Criticisms

Hyperparameter sensitivity

Are the gains really from length normalization

Likelihood displacement

Benchmark caveats

Reproducibility

Reference-free is not unconditionally simpler

Significance

Comparison

See also

References

Improve this article

SimPO

Background

Technical Details

The DPO loss as a starting point

The SimPO loss

Why no reference model

Length normalization and length exploitation

Hyperparameters

Comparison to DPO at a glance

Empirical Results

Mistral 7B Base

Mistral 7B Instruct

Llama 3 8B Base

Llama 3 8B Instruct

Headline gains and across-method ranking

Gemma 2 9B

Implementations and Adoption

Official code and checkpoints

Hugging Face TRL integration

Quantized and downstream redistributions

Variants and Related Methods

Limitations and Criticisms

Hyperparameter sensitivity

Are the gains really from length normalization

Likelihood displacement

Benchmark caveats

Reproducibility

Reference-free is not unconditionally simpler

Significance

Comparison

See also

References