Direct Preference Optimization (DPO)

AI Alignment Deep Learning Machine Learning Natural Language Processing

19 min read

Updated Jun 21, 2026

Suggest edit History

RawGraph

Last reviewed

Jun 21, 2026

Sources

15 citations

Review status

Source-backed

Revision

v7 · 3,808 words

Direct Preference Optimization (DPO) is a method for aligning large language models with human preferences that replaces the multi-stage reinforcement learning from human feedback (RLHF) pipeline with a single supervised classification step, eliminating the need for a separate reward model or reinforcement learning. Introduced by Rafailov et al. in 2023, DPO reparameterizes the standard RLHF objective so that the optimal policy can be extracted in closed form, reducing alignment to a simple loss function over pairs of preferred and dispreferred responses.^[1] The paper, titled "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," was first posted to arXiv on 29 May 2023 and received an Outstanding Main Track Runner-Up award at NeurIPS 2023.^[1]^[15]

The authors describe the resulting algorithm as "stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning."^[1] DPO has since become one of the most widely adopted post-training alignment techniques in modern natural language processing, used in the training pipelines of models such as Llama 3, Mixtral, and Phi-3.^[14]

ELI5 (Explain like I'm 5)

Imagine you are teaching a dog two tricks at the same time. You show the dog two different tricks and say "this one is better" each time. Eventually the dog learns which tricks you like more. DPO works the same way: instead of giving a language model a score for every single answer (which is complicated), you just show it pairs of answers and tell it which one you prefer. The model then adjusts itself to give more answers like the ones you picked. It skips the step of building a separate "scoring machine" and learns directly from your choices.

What problem does DPO solve?

The RLHF pipeline

Reinforcement learning from human feedback (RLHF) is the standard approach for aligning language models with human preferences.^[2] The conventional RLHF pipeline consists of three stages:

Supervised fine-tuning (SFT): A pretrained language model is fine-tuned on high-quality demonstration data to produce a reference policy, denoted π_ref.^[2]
Reward model training: A separate reward model r_φ(x, y) is trained on a dataset of human preference comparisons using the Bradley-Terry model. Given a prompt x and two completions y_w (preferred) and y_l (dispreferred), the reward model is trained to assign higher reward to the preferred completion.^[2]
RL optimization: The language model policy π_θ is optimized via a reinforcement learning algorithm, typically Proximal Policy Optimization (PPO), to maximize the learned reward while staying close to the reference policy through a KL divergence penalty.^[3]

Problems with RLHF

While RLHF has been used successfully in systems like ChatGPT and Claude, the authors note that "RLHF is a complex and often unstable procedure."^[1] The pipeline has several practical drawbacks:^[1]

Complexity: The three-stage pipeline requires training and maintaining multiple models (policy, reward model, value function, reference policy), which is difficult to implement and debug.^[1]
Computational cost: PPO-based training requires keeping four copies of the language model in memory simultaneously: the active policy, the reference policy, the reward model, and the value function.^[1]
Instability: PPO is sensitive to hyperparameter choices and can exhibit training instabilities, making reproducible results difficult to achieve.^[3]
Reward hacking: The policy can learn to exploit weaknesses in the reward model rather than genuinely improving quality, a phenomenon known as reward hacking.

DPO was designed to address these issues by collapsing the reward modeling and RL optimization stages into a single supervised learning step.^[1]

Mathematical formulation

The RLHF objective

The standard RLHF objective seeks a policy π_θ that maximizes expected reward while remaining close to the reference policy π_ref:^[1]

max_π  E_{x~D, y~π(y|x)} [r(x, y)] - β · D_KL[π(y|x) || π_ref(y|x)]

Here, β > 0 controls the strength of the KL constraint. A larger β keeps the policy closer to the reference, while a smaller β allows more aggressive optimization of the reward.

The Bradley-Terry preference model

The Bradley-Terry model expresses the probability that completion y_1 is preferred over y_2 given prompt x as:

p*(y_1 ≻ y_2 | x) = σ(r*(x, y_1) - r*(x, y_2))

where σ is the sigmoid function and r* is the true (latent) reward function. This model assumes that preferences depend only on the difference in rewards between the two completions.^[1]

The reward model r_φ is typically trained by minimizing the negative log-likelihood under this model:

L_R(r_φ, D) = -E_{(x, y_w, y_l) ~ D} [log σ(r_φ(x, y_w) - r_φ(x, y_l))]

Deriving the optimal policy

The central insight of DPO comes from solving the KL-constrained optimization problem analytically. The optimal policy for the RLHF objective has a closed-form solution:^[1]

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x, y))

where Z(x) = Σ_y π_ref(y|x) · exp((1/β) · r(x, y)) is the partition function (a normalizing constant that depends only on x). This result follows from the calculus of variations and the Gibbs distribution.^[1]

Reparameterizing the reward

By rearranging the optimal policy expression, the reward function can be written in terms of the policy:^[1]

r(x, y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)

This is the reparameterization step: instead of learning an explicit reward function, the reward is defined implicitly through the ratio of the current policy to the reference policy.^[1]

The DPO loss function

Substituting the reparameterized reward into the Bradley-Terry model yields:^[1]

p(y_w ≻ y_l | x) = σ(β · log(π_θ(y_w|x) / π_ref(y_w|x)) - β · log(π_θ(y_l|x) / π_ref(y_l|x)))

The partition function Z(x) cancels out because it appears identically in both terms. This cancellation is what makes DPO tractable: the intractable normalizing constant disappears from the objective.^[1]

The final DPO training loss is the negative log-likelihood of the observed preferences:^[1]

L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ(β · (log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x))))]

This can be interpreted as increasing the log-probability ratio of preferred completions relative to dispreferred completions, adjusted by how much the reference policy already favors each completion.

Gradient analysis

The gradient of the DPO loss with respect to the policy parameters θ has an intuitive form. It consists of three components:^[1]

A weighting coefficient that scales the gradient based on how wrong the implicit reward estimate currently is. Examples where the model assigns incorrect relative rankings receive larger gradient updates.
A positive term that increases the likelihood of the preferred completion y_w.
A negative term that decreases the likelihood of the dispreferred completion y_l.

This weighting mechanism is essential for stability. Without it (as in naive "unlikelihood training"), the model tends to degenerate by uniformly suppressing all outputs rather than learning fine-grained preferences.^[1]

How do you train a model with DPO?

The practical DPO training procedure involves the following steps:

Prepare a reference model: Start with a supervised fine-tuned (SFT) model, which serves as π_ref. This model is frozen and used only for computing reference log-probabilities.^[1]
Collect preference data: Obtain a dataset of triplets (x, y_w, y_l), where x is a prompt, y_w is the preferred completion, and y_l is the dispreferred completion. This data can come from human annotators, AI feedback, or existing preference datasets like Anthropic HH-RLHF or UltraFeedback.
Compute log-probabilities: For each training example, compute log π_θ(y_w|x), log π_θ(y_l|x), log π_ref(y_w|x), and log π_ref(y_l|x).
Minimize the DPO loss: Update the policy parameters θ using standard gradient descent (e.g., AdamW) to minimize L_DPO.
Evaluate: Assess alignment quality on held-out data using metrics such as win rate against the reference policy, reward model scores, or human evaluation.

Key hyperparameters

Hyperparameter	Typical range	Effect
β (beta)	0.1 to 0.5	Controls divergence from reference policy. Lower values allow more aggressive adaptation; higher values keep the model closer to the SFT baseline.
Learning rate	1e-7 to 5e-6	Standard learning rate for fine-tuning. DPO typically uses lower rates than SFT.
Batch size	32 to 128	Larger batches provide more stable gradient estimates for the preference loss.
Epochs	1 to 3	DPO usually requires few epochs. Overtraining can lead to overfitting on the preference dataset.
Label smoothing	0.0 to 0.1	Optional smoothing applied to the preference labels to mitigate noise in human judgments.

Implementation

DPO is straightforward to implement using standard deep learning frameworks. The Hugging Face TRL (Transformer Reinforcement Learning) library provides a DPOTrainer class that handles the training loop:

from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    output_dir="./dpo_output",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=preference_dataset,
)
trainer.train()

How does DPO differ from RLHF and PPO?

The following table summarizes the main differences between DPO and the standard RLHF pipeline using PPO:

Aspect	RLHF with PPO	DPO
Reward model	Requires training a separate explicit reward model	Implicit; reward is defined through the policy itself
Optimization method	Reinforcement learning (PPO)	Supervised learning (gradient descent)
Models in memory	Four (policy, reference, reward model, value function)	Two (policy and reference)
Data generation	Online: generates new samples during training	Offline: uses a fixed preference dataset
Hyperparameter sensitivity	High; many RL-specific hyperparameters (clip ratio, GAE lambda, etc.)	Low; primarily β and standard training hyperparameters
Implementation complexity	High; requires RL infrastructure	Low; standard supervised training loop
Training stability	Can be unstable; reward hacking is common	Generally more stable, though overfitting is possible
Computational cost	Expensive (sampling, reward evaluation, value estimation)	Cheaper (forward passes through two models only)
Theoretical optimality	Optimal policy under certain conditions	Equivalent to RLHF when Bradley-Terry model holds and data is sufficient

Experimental results from the original paper

Rafailov et al. (2023) evaluated DPO against PPO-based RLHF on three tasks. The paper reports that DPO matches or exceeds existing methods, "notably exceeding RLHF's ability to control sentiment of generations and improving response quality in summarization and single-turn dialogue while being substantially simpler to implement and train."^[1]

Task	DPO performance vs. PPO
Sentiment control (IMDb)	DPO exceeded PPO in controlling generation sentiment
Summarization (TL;DR)	DPO matched or slightly improved summarization quality
Single-turn dialogue (Anthropic HH)	DPO matched or improved response quality

Across all tasks, DPO achieved comparable or better performance while being simpler to implement and more computationally efficient.^[1]

Theoretical properties

Equivalence to RLHF

Rafailov et al. proved that when the Bradley-Terry model perfectly fits the true preference distribution and the preference dataset has sufficient coverage, the global optimum of the DPO objective coincides with the global optimum of the RLHF objective.^[1] In other words, DPO and RLHF converge to the same optimal policy under ideal conditions.

Implicit reward model

The DPO policy implicitly defines a reward function:^[1]

r_implicit(x, y) = β · log(π_θ(y|x) / π_ref(y|x))

This implicit reward can be extracted and used for evaluation or as a reward signal for other purposes, which is the basis of the paper's subtitle: "Your Language Model is Secretly a Reward Model."^[1]

Connection to maximum entropy RL

The KL-constrained RLHF objective is a special case of the maximum entropy reinforcement learning framework, where the entropy bonus is replaced by a KL penalty relative to a reference policy. The closed-form solution for the optimal policy in this setting is a Gibbs (Boltzmann) distribution, which is well studied in statistical physics and Bayesian inference.^[1]

What are the main variants of DPO?

Since the publication of DPO, numerous variants have been proposed to address its limitations or extend its capabilities.^[10] The following table summarizes the most notable ones:

Method	Authors	Year	Venue	Key innovation
IPO (Identity Preference Optimization)	Azar et al.	2023	arXiv	Removes the Bradley-Terry assumption; uses a general preference objective (ΨPO) based directly on pairwise preference probabilities
KTO (Kahneman-Tversky Optimization)	Ethayarajh et al.	2024	ICML 2024	Uses unpaired binary feedback (desirable/undesirable) instead of paired preferences; grounded in prospect theory
ORPO (Odds Ratio Preference Optimization)	Hong et al.	2024	EMNLP 2024	Monolithic method that combines SFT and preference alignment in a single step; no reference model needed
SimPO (Simple Preference Optimization)	Meng et al.	2024	NeurIPS 2024	Uses average log-probability as implicit reward; eliminates reference model; adds target reward margin
RSO (Rejection Sampling Optimization)	Liu et al.	2023	ICLR 2024	Sources preference data from the target optimal policy using rejection sampling for better estimation
Online DPO	Guo et al.	2024	Various	Generates new preference pairs on-the-fly during training to reduce distribution shift
Self-Play Preference Optimization (SPPO)	Wu et al.	2024	Various	Treats alignment as a two-player game; iteratively improves through self-play

IPO (Identity Preference Optimization)

Azar et al. (2023) identified that DPO relies on the Bradley-Terry assumption, which may not hold for real human preferences.^[4] They proposed a more general objective called ΨPO that expresses the loss directly in terms of pairwise preference probabilities rather than pointwise rewards.^[4] IPO is a specific instance of ΨPO using the identity mapping, which bypasses both the reward model approximation and the Bradley-Terry assumption.^[4] IPO has demonstrated empirical improvements over DPO in settings where the Bradley-Terry model is a poor fit for the data.^[4]

KTO (Kahneman-Tversky Optimization)

Ethayarajh et al. (2024) proposed KTO, which draws on Kahneman and Tversky's prospect theory from behavioral economics.^[5] Unlike DPO, which requires paired preference data (y_w, y_l), KTO works with unpaired binary signals: each response is independently labeled as either desirable or undesirable.^[5] This makes KTO applicable to datasets where paired comparisons are unavailable. The loss function applies asymmetric penalties, reflecting the human tendency to weigh losses more heavily than gains.^[5] KTO matches or exceeds DPO performance at model scales from 1B to 30B parameters, even when it is trained on data derived by splitting DPO preference pairs.^[5]

ORPO (Odds Ratio Preference Optimization)

Hong et al. (2024) introduced ORPO as a monolithic approach that eliminates both the reference model and the separate SFT stage.^[6] ORPO appends a log odds ratio term to the standard negative log-likelihood loss, applying a weak penalty to rejected responses and a strong adaptation signal to chosen responses.^[6] This allows alignment to be performed during the SFT phase in a single training run.^[6]

SimPO (Simple Preference Optimization)

Meng et al. (2024) proposed SimPO, which replaces the log-likelihood ratio used in DPO with the average log-probability of the sequence.^[7] This change serves two purposes: it eliminates the need for a reference model (reducing memory requirements), and it better aligns the training objective with how models actually generate text (via average token-level probabilities rather than total sequence log-probability).^[7] SimPO also introduces a target reward margin γ to encourage a larger gap between winning and losing responses.^[7] In experiments with Llama 3 and Gemma 2, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and up to 7.5 points on Arena-Hard.^[7] The best SimPO model, built on Gemma-2-9B-it, reached a 72.4% length-controlled win rate on AlpacaEval 2 and a 59.1% win rate on Arena-Hard, ranking first on Chatbot Arena among models under 10B parameters at the time.^[7]

Limitations and criticisms

Despite its popularity, DPO has several known limitations:

Offline data and distribution shift

DPO trains on a fixed, pre-collected preference dataset. If the distribution of responses in the training data differs significantly from the responses the policy generates at inference time, the learned preferences may not transfer well. This distribution shift problem is less severe in PPO-based RLHF because PPO generates new samples during training (on-policy learning).^[10]

Overoptimization

Even though DPO does not train an explicit reward model, it is still susceptible to overoptimization. Empirical studies have observed a "hump-shaped" curve: as the policy diverges further from the reference (at higher KL budgets), the true quality of generations initially improves but then degrades, while the implicit reward continues to increase monotonically.^[10] This pattern mirrors the overoptimization behavior seen in RLHF with explicit reward models.

Reward hacking

Rashidinejad and Tian (2025) identified two types of reward hacking in DPO.^[11] Type I arises from over-optimizing out-of-distribution actions due to partial data coverage. Type II stems from performance degradation of the initial model when preference data poorly covers high-reward actions. Maintaining a small KL divergence from the reference model is insufficient to prevent Type I reward hacking.^[11]

Verbosity bias

Models trained with DPO tend to generate increasingly longer responses throughout training. This length increase does not always correspond to improved quality, and in many cases the win-rate plateaus or decreases while response length continues to grow.^[7] Length-Desensitized DPO (LD-DPO) has been proposed to separate explicit length preference from other implicit preferences.

Implicit reward degradation

Yan et al. (2025) identified what they call the "3D-Properties" of DPO's implicit reward modeling: (1) drastic drop in rejected response likelihood, (2) degradation into response suppression rather than genuine preference learning, and (3) dispersion effects on unseen responses.^[12] These properties can cause the model to learn shallow heuristics rather than meaningful preference distinctions.^[12]

Dependence on the Bradley-Terry assumption

DPO's derivation assumes that human preferences follow the Bradley-Terry model.^[1] Real human preferences can be intransitive, context-dependent, or influenced by factors not captured by a scalar reward. When this assumption is violated, DPO's theoretical guarantees weaken. IPO was specifically designed to address this limitation.^[4]

What is DPO used for?

DPO has been adopted across a wide range of language model training pipelines:

Application	Description
Llama 3 post-training	Meta used DPO alongside SFT, rejection sampling, and PPO in the Llama 3 post-training pipeline
Mixtral 8x7B	Mistral AI applied DPO to the Mixtral mixture-of-experts model for instruction-following alignment
Phi-3	Microsoft used DPO in the optimization of its small language model series
Gemma fine-tuning	Google's Gemma models can be aligned using SFT followed by DPO
Zephyr	The Zephyr model series used DPO on UltraFeedback data as a key alignment step
Research and open-source models	Widely used in the open-source community via Hugging Face TRL, LLaMA Factory, and PyTorch Torchtune

Beyond text generation, DPO has been applied to other modalities, including text-to-image diffusion models, code generation, and mathematical reasoning. The simplicity of the DPO loss function makes it adaptable to any setting where paired preference data is available.

Tooling and frameworks

Several open-source frameworks provide production-ready implementations of DPO:

Framework	Description
Hugging Face TRL	DPOTrainer with full integration into the Hugging Face ecosystem, supporting LoRA, quantization, and multi-GPU training
LLaMA Factory	Unified fine-tuning framework supporting DPO for Llama, Mistral, Gemma, and 100+ models
PyTorch Torchtune	Native PyTorch recipes for DPO training with Llama 2, Llama 3, Mistral, and Gemma
Axolotl	Community-driven fine-tuning tool with DPO support

When was DPO released? (timeline)

The following table traces the development of DPO and related preference optimization methods:

Date	Event
2017	Schulman et al. introduce Proximal Policy Optimization (PPO), which becomes the standard RL algorithm for RLHF
March 2022	Ouyang et al. publish InstructGPT, demonstrating RLHF with PPO for aligning GPT-3
29 May 2023	Rafailov et al. post the DPO paper to arXiv (arXiv:2305.18290)
October 2023	Azar et al. publish IPO, generalizing beyond the Bradley-Terry assumption
December 2023	DPO is presented at NeurIPS 2023 and named an Outstanding Main Track Runner-Up
February 2024	Ethayarajh et al. publish KTO, enabling alignment with unpaired feedback
March 2024	Hong et al. publish ORPO, a monolithic reference-free approach
April 2024	Meta releases Llama 3, which includes DPO in its post-training pipeline
May 2024	Meng et al. publish SimPO, achieving strong results without a reference model
2024-2025	Continued research into online DPO, self-play methods, and hybrid approaches

References

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." *Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)*. https://arxiv.org/abs/2305.18290 ↩
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems, 35*. https://arxiv.org/abs/2203.02155 ↩
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." *arXiv preprint arXiv:1707.06347*. https://arxiv.org/abs/1707.06347 ↩
Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." *arXiv preprint arXiv:2310.12036*. https://arxiv.org/abs/2310.12036 ↩
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." *Proceedings of the 41st International Conference on Machine Learning (ICML 2024)*. https://arxiv.org/abs/2402.01306 ↩
Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)*. https://arxiv.org/abs/2403.07691 ↩
Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)*. https://arxiv.org/abs/2405.14734 ↩
Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., & Liu, J. (2023). "Statistical Rejection Sampling Improves Preference Optimization." *Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)*. https://arxiv.org/abs/2309.06657
Wu, J., Xie, Z., Wang, X., & Lin, Z. (2024). "Self-Play Preference Optimization for Language Model Alignment." *arXiv preprint arXiv:2405.00675*. https://arxiv.org/abs/2405.00675
Lambert, N. (2024). "Direct Alignment Algorithms." *RLHF Book*, Chapter 12. https://rlhfbook.com/c/12-direct-alignment ↩
Rashidinejad, P. & Tian, A. (2025). "Understanding Reward Hacking in Direct Alignment Algorithms." *Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)*. ↩
Yan, Y., et al. (2025). "3D-Properties of DPO: Drastic Drop, Degradation, and Dispersion." *arXiv preprint*. ↩
Tunstall, L., Beeching, E., Lambert, N., et al. (2023). "Zephyr: Direct Distillation of LM Alignment." *arXiv preprint arXiv:2310.16944*. https://arxiv.org/abs/2310.16944
Dubey, A., et al. (2024). "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*. https://arxiv.org/abs/2407.21783 ↩
NeurIPS (2023). "Announcing the NeurIPS 2023 Paper Awards." *NeurIPS Blog*, 11 December 2023. https://blog.neurips.cc/2023/12/11/announcing-the-neurips-2023-paper-awards/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit