# Direct Preference Optimization (DPO)

> Source: https://aiwiki.ai/wiki/direct_preference_optimization_dpo
> Updated: 2026-06-10
> Categories: AI Alignment, Deep Learning, Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Direct Preference Optimization (DPO)** is a method for aligning [large language models](/wiki/large_language_model) with human preferences without requiring explicit [reward model](/wiki/reward_hacking) training or [reinforcement learning](/wiki/reinforcement_learning_rl). Introduced by Rafailov et al. in 2023, DPO reparameterizes the standard reinforcement learning from human feedback (RLHF) objective so that the optimal policy can be extracted in closed form, reducing the alignment problem to a simple classification [loss function](/wiki/loss_function) over preference pairs.[1] The method was published at [NeurIPS](/wiki/neurips) 2023 under the title "Your Language Model is Secretly a Reward Model."[1]

DPO has become one of the most widely adopted post-training alignment techniques in modern [natural language processing](/wiki/natural_language_understanding), used in the training pipelines of models such as [Llama 3](/wiki/llama_3), [Mixtral](/wiki/mixtral), and [Phi-3](/wiki/phi).[14]

## ELI5 (Explain like I'm 5)

Imagine you are teaching a dog two tricks at the same time. You show the dog two different tricks and say "this one is better" each time. Eventually the dog learns which tricks you like more. DPO works the same way: instead of giving a language model a score for every single answer (which is complicated), you just show it pairs of answers and tell it which one you prefer. The model then adjusts itself to give more answers like the ones you picked. It skips the step of building a separate "scoring machine" and learns directly from your choices.

## Background and motivation

### The RLHF pipeline

[Reinforcement learning from human feedback](/wiki/reinforcement_learning) (RLHF) is the standard approach for aligning language models with human preferences.[2] The conventional RLHF pipeline consists of three stages:

1. **Supervised [fine-tuning](/wiki/fine_tuning) (SFT):** A pretrained language model is fine-tuned on high-quality demonstration data to produce a reference policy, denoted π_ref.[2]
2. **Reward model training:** A separate reward model r_φ(x, y) is trained on a dataset of human preference comparisons using the [Bradley-Terry model](/wiki/logistic_regression). Given a prompt x and two completions y_w (preferred) and y_l (dispreferred), the reward model is trained to assign higher reward to the preferred completion.[2]
3. **RL optimization:** The language model policy π_θ is optimized via a reinforcement learning algorithm, typically [Proximal Policy Optimization](/wiki/reinforcement_learning_rl) (PPO), to maximize the learned reward while staying close to the reference policy through a KL divergence penalty.[3]

### Problems with RLHF

While RLHF has been used successfully in systems like [ChatGPT](/wiki/chatgpt) and [Claude](/wiki/claude), the pipeline has several practical drawbacks:[1]

- **Complexity:** The three-stage pipeline requires training and maintaining multiple models (policy, reward model, value function, reference policy), which is difficult to implement and debug.[1]
- **Computational cost:** PPO-based training requires keeping four copies of the language model in memory simultaneously: the active policy, the reference policy, the reward model, and the value function.[1]
- **Instability:** PPO is sensitive to hyperparameter choices and can exhibit training instabilities, making reproducible results difficult to achieve.[3]
- **Reward hacking:** The policy can learn to exploit weaknesses in the reward model rather than genuinely improving quality, a phenomenon known as [reward hacking](/wiki/reward_hacking).

DPO was designed to address these issues by collapsing the reward modeling and RL optimization stages into a single supervised learning step.[1]

## Mathematical formulation

### The RLHF objective

The standard RLHF objective seeks a policy π_θ that maximizes expected reward while remaining close to the reference policy π_ref:[1]

```
max_π  E_{x~D, y~π(y|x)} [r(x, y)] - β · D_KL[π(y|x) || π_ref(y|x)]
```

Here, β > 0 controls the strength of the KL constraint. A larger β keeps the policy closer to the reference, while a smaller β allows more aggressive optimization of the reward.

### The Bradley-Terry preference model

The [Bradley-Terry model](/wiki/logistic_regression) expresses the probability that completion y_1 is preferred over y_2 given prompt x as:

```
p*(y_1 ≻ y_2 | x) = σ(r*(x, y_1) - r*(x, y_2))
```

where σ is the [sigmoid function](/wiki/sigmoid_function) and r* is the true (latent) reward function. This model assumes that preferences depend only on the difference in rewards between the two completions.[1]

The reward model r_φ is typically trained by minimizing the negative log-likelihood under this model:

```
L_R(r_φ, D) = -E_{(x, y_w, y_l) ~ D} [log σ(r_φ(x, y_w) - r_φ(x, y_l))]
```

### Deriving the optimal policy

The central insight of DPO comes from solving the KL-constrained optimization problem analytically. The optimal policy for the RLHF objective has a closed-form solution:[1]

```
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x, y))
```

where Z(x) = Σ_y π_ref(y|x) · exp((1/β) · r(x, y)) is the partition function (a normalizing constant that depends only on x). This result follows from the calculus of variations and the Gibbs distribution.[1]

### Reparameterizing the reward

By rearranging the optimal policy expression, the reward function can be written in terms of the policy:[1]

```
r(x, y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)
```

This is the reparameterization step: instead of learning an explicit reward function, the reward is defined implicitly through the ratio of the current policy to the reference policy.[1]

### The DPO loss function

Substituting the reparameterized reward into the Bradley-Terry model yields:[1]

```
p(y_w ≻ y_l | x) = σ(β · log(π_θ(y_w|x) / π_ref(y_w|x)) - β · log(π_θ(y_l|x) / π_ref(y_l|x)))
```

The partition function Z(x) cancels out because it appears identically in both terms. This cancellation is what makes DPO tractable: the intractable normalizing constant disappears from the objective.[1]

The final DPO training loss is the negative log-likelihood of the observed preferences:[1]

```
L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ(β · (log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x))))]
```

This can be interpreted as increasing the log-probability ratio of preferred completions relative to dispreferred completions, adjusted by how much the reference policy already favors each completion.

### Gradient analysis

The gradient of the DPO loss with respect to the policy parameters θ has an intuitive form. It consists of three components:[1]

1. **A weighting coefficient** that scales the gradient based on how wrong the implicit reward estimate currently is. Examples where the model assigns incorrect relative rankings receive larger gradient updates.
2. **A positive term** that increases the likelihood of the preferred completion y_w.
3. **A negative term** that decreases the likelihood of the dispreferred completion y_l.

This weighting mechanism is essential for stability. Without it (as in naive "unlikelihood training"), the model tends to degenerate by uniformly suppressing all outputs rather than learning fine-grained preferences.[1]

## Training procedure

The practical DPO training procedure involves the following steps:

1. **Prepare a reference model:** Start with a supervised fine-tuned (SFT) model, which serves as π_ref. This model is frozen and used only for computing reference log-probabilities.[1]
2. **Collect preference data:** Obtain a dataset of triplets (x, y_w, y_l), where x is a prompt, y_w is the preferred completion, and y_l is the dispreferred completion. This data can come from human annotators, AI feedback, or existing preference datasets like Anthropic HH-RLHF or UltraFeedback.
3. **Compute log-probabilities:** For each training example, compute log π_θ(y_w|x), log π_θ(y_l|x), log π_ref(y_w|x), and log π_ref(y_l|x).
4. **Minimize the DPO loss:** Update the policy parameters θ using standard [gradient descent](/wiki/gradient_descent) (e.g., [AdamW](/wiki/optimizer)) to minimize L_DPO.
5. **Evaluate:** Assess alignment quality on held-out data using metrics such as win rate against the reference policy, reward model scores, or human evaluation.

### Key hyperparameters

| Hyperparameter | Typical range | Effect |
|---|---|---|
| β (beta) | 0.1 to 0.5 | Controls divergence from [reference policy](/wiki/fine_tuning). Lower values allow more aggressive adaptation; higher values keep the model closer to the SFT baseline. |
| Learning rate | 1e-7 to 5e-6 | Standard learning rate for fine-tuning. DPO typically uses lower rates than SFT. |
| Batch size | 32 to 128 | Larger batches provide more stable gradient estimates for the preference loss. |
| Epochs | 1 to 3 | DPO usually requires few epochs. Overtraining can lead to overfitting on the preference dataset. |
| Label smoothing | 0.0 to 0.1 | Optional smoothing applied to the preference labels to mitigate noise in human judgments. |

### Implementation

DPO is straightforward to implement using standard [deep learning](/wiki/deep_learning) frameworks. The Hugging Face TRL (Transformer Reinforcement Learning) library provides a DPOTrainer class that handles the training loop:

```python
from trl import DPOConfig, DPOTrainer

training_args = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    output_dir="./dpo_output",
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=preference_dataset,
)
trainer.train()
```

## Comparison with RLHF and PPO

The following table summarizes the main differences between DPO and the standard RLHF pipeline using PPO:

| Aspect | RLHF with PPO | DPO |
|---|---|---|
| Reward model | Requires training a separate explicit reward model | Implicit; reward is defined through the policy itself |
| Optimization method | Reinforcement learning (PPO) | Supervised learning ([gradient descent](/wiki/gradient_descent)) |
| Models in memory | Four (policy, reference, reward model, value function) | Two (policy and reference) |
| Data generation | Online: generates new samples during training | Offline: uses a fixed preference dataset |
| Hyperparameter sensitivity | High; many RL-specific hyperparameters (clip ratio, GAE lambda, etc.) | Low; primarily β and standard training hyperparameters |
| Implementation complexity | High; requires RL infrastructure | Low; standard supervised training loop |
| Training stability | Can be unstable; reward hacking is common | Generally more stable, though overfitting is possible |
| Computational cost | Expensive (sampling, reward evaluation, value estimation) | Cheaper (forward passes through two models only) |
| Theoretical optimality | Optimal policy under certain conditions | Equivalent to RLHF when Bradley-Terry model holds and data is sufficient |

### Experimental results from the original paper

Rafailov et al. (2023) evaluated DPO against PPO-based RLHF on three tasks:[1]

| Task | DPO performance vs. PPO |
|---|---|
| Sentiment control (IMDb) | DPO exceeded PPO in controlling generation sentiment |
| Summarization (TL;DR) | DPO matched or slightly improved summarization quality |
| Single-turn dialogue (Anthropic HH) | DPO matched or improved response quality |

Across all tasks, DPO achieved comparable or better performance while being simpler to implement and more computationally efficient.[1]

## Theoretical properties

### Equivalence to RLHF

Rafailov et al. proved that when the Bradley-Terry model perfectly fits the true preference distribution and the preference dataset has sufficient coverage, the global optimum of the DPO objective coincides with the global optimum of the RLHF objective.[1] In other words, DPO and RLHF converge to the same optimal policy under ideal conditions.

### Implicit reward model

The DPO policy implicitly defines a reward function:[1]

```
r_implicit(x, y) = β · log(π_θ(y|x) / π_ref(y|x))
```

This implicit reward can be extracted and used for evaluation or as a reward signal for other purposes, which is the basis of the paper's subtitle: "Your Language Model is Secretly a Reward Model."[1]

### Connection to maximum entropy RL

The KL-constrained RLHF objective is a special case of the maximum entropy reinforcement learning framework, where the entropy bonus is replaced by a KL penalty relative to a reference policy. The closed-form solution for the optimal policy in this setting is a Gibbs (Boltzmann) distribution, which is well studied in statistical physics and [Bayesian inference](/wiki/bayesian_neural_network).[1]

## Variants and extensions

Since the publication of DPO, numerous variants have been proposed to address its limitations or extend its capabilities.[10] The following table summarizes the most notable ones:

| Method | Authors | Year | Venue | Key innovation |
|---|---|---|---|---|
| [IPO](/wiki/direct_preference_optimization_dpo) (Identity Preference Optimization) | Azar et al. | 2023 | arXiv | Removes the Bradley-Terry assumption; uses a general preference objective (ΨPO) based directly on pairwise preference probabilities |
| [KTO](/wiki/kahneman_tversky_optimization) (Kahneman-Tversky Optimization) | Ethayarajh et al. | 2024 | arXiv | Uses unpaired binary feedback (desirable/undesirable) instead of paired preferences; grounded in prospect theory |
| [ORPO](/wiki/direct_preference_optimization_dpo) (Odds Ratio Preference Optimization) | Hong et al. | 2024 | EMNLP 2024 | Monolithic method that combines SFT and preference alignment in a single step; no reference model needed |
| [SimPO](/wiki/simpo) (Simple Preference Optimization) | Meng et al. | 2024 | NeurIPS 2024 | Uses average log-probability as implicit reward; eliminates reference model; adds target reward margin |
| RSO (Rejection Sampling Optimization) | Liu et al. | 2023 | ICLR 2024 | Sources preference data from the target optimal policy using rejection sampling for better estimation |
| Online DPO | Guo et al. | 2024 | Various | Generates new preference pairs on-the-fly during training to reduce distribution shift |
| Self-Play Preference Optimization (SPPO) | Wu et al. | 2024 | Various | Treats alignment as a two-player game; iteratively improves through self-play |

### IPO (Identity Preference Optimization)

Azar et al. (2023) identified that DPO relies on the Bradley-Terry assumption, which may not hold for real human preferences.[4] They proposed a more general objective called ΨPO that expresses the loss directly in terms of pairwise preference probabilities rather than pointwise rewards.[4] IPO is a specific instance of ΨPO using the identity mapping, which bypasses both the reward model approximation and the Bradley-Terry assumption.[4] IPO has demonstrated empirical improvements over DPO in settings where the Bradley-Terry model is a poor fit for the data.[4]

### KTO (Kahneman-Tversky Optimization)

Ethayarajh et al. (2024) proposed KTO, which draws on [Kahneman and Tversky's prospect theory](https://en.wikipedia.org/wiki/Prospect_theory) from behavioral economics.[5] Unlike DPO, which requires paired preference data (y_w, y_l), KTO works with unpaired binary signals: each response is independently labeled as either desirable or undesirable.[5] This makes KTO applicable to datasets where paired comparisons are unavailable. The loss function applies asymmetric penalties, reflecting the human tendency to weigh losses more heavily than gains.[5] KTO has matched or exceeded DPO performance at scales from 1B to 30B parameters.[5]

### ORPO (Odds Ratio Preference Optimization)

Hong et al. (2024) introduced ORPO as a monolithic approach that eliminates both the reference model and the separate SFT stage.[6] ORPO appends a log odds ratio term to the standard negative log-likelihood loss, applying a weak penalty to rejected responses and a strong adaptation signal to chosen responses.[6] This allows alignment to be performed during the SFT phase in a single [training run](/wiki/training_run).[6]

### SimPO (Simple Preference Optimization)

Meng et al. (2024) proposed SimPO, which replaces the log-likelihood ratio used in DPO with the average log-probability of the sequence.[7] This change serves two purposes: it eliminates the need for a reference model (reducing memory requirements), and it better aligns the training objective with how models actually generate text (via average token-level probabilities rather than total sequence log-probability).[7] SimPO also introduces a target reward margin γ to encourage a larger gap between winning and losing responses.[7] In experiments with [Llama 3](/wiki/llama_3) and [Gemma 2](/wiki/gemma), SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and up to 7.5 points on Arena-Hard.[7]

## Limitations and criticisms

Despite its popularity, DPO has several known limitations:

### Offline data and distribution shift

DPO trains on a fixed, pre-collected preference dataset. If the distribution of responses in the training data differs significantly from the responses the policy generates at inference time, the learned preferences may not transfer well. This distribution shift problem is less severe in PPO-based RLHF because PPO generates new samples during training (on-policy learning).[10]

### Overoptimization

Even though DPO does not train an explicit reward model, it is still susceptible to overoptimization. Empirical studies have observed a "hump-shaped" curve: as the policy diverges further from the reference (at higher KL budgets), the true quality of generations initially improves but then degrades, while the implicit reward continues to increase monotonically.[10] This pattern mirrors the overoptimization behavior seen in RLHF with explicit reward models.

### Reward hacking

Rashidinejad and Tian (2025) identified two types of reward hacking in DPO.[11] Type I arises from over-optimizing out-of-distribution actions due to partial data coverage. Type II stems from performance degradation of the initial model when preference data poorly covers high-reward actions. Maintaining a small KL divergence from the reference model is insufficient to prevent Type I reward hacking.[11]

### Verbosity bias

Models trained with DPO tend to generate increasingly longer responses throughout training. This length increase does not always correspond to improved quality, and in many cases the win-rate plateaus or decreases while response length continues to grow.[7] Length-Desensitized DPO (LD-DPO) has been proposed to separate explicit length preference from other implicit preferences.

### Implicit reward degradation

Yan et al. (2025) identified what they call the "3D-Properties" of DPO's implicit reward modeling: (1) drastic drop in rejected response likelihood, (2) degradation into response suppression rather than genuine preference learning, and (3) dispersion effects on unseen responses.[12] These properties can cause the model to learn shallow heuristics rather than meaningful preference distinctions.[12]

### Dependence on the Bradley-Terry assumption

DPO's derivation assumes that human preferences follow the Bradley-Terry model.[1] Real human preferences can be intransitive, context-dependent, or influenced by factors not captured by a scalar reward. When this assumption is violated, DPO's theoretical guarantees weaken. IPO was specifically designed to address this limitation.[4]

## Applications

DPO has been adopted across a wide range of language model training pipelines:

| Application | Description |
|---|---|
| [Llama 3](/wiki/llama_3) post-training | Meta used DPO alongside SFT, rejection sampling, and PPO in the Llama 3 post-training pipeline |
| [Mixtral 8x7B](/wiki/mixtral) | Mistral AI applied DPO to the Mixtral mixture-of-experts model for instruction-following alignment |
| [Phi-3](/wiki/phi) | Microsoft used DPO in the optimization of its small language model series |
| Gemma fine-tuning | Google's [Gemma](/wiki/gemma) models can be aligned using SFT followed by DPO |
| [Zephyr](/wiki/zephyr) | The Zephyr model series used DPO on UltraFeedback data as a key alignment step |
| Research and open-source models | Widely used in the open-source community via Hugging Face TRL, LLaMA Factory, and PyTorch Torchtune |

Beyond text generation, DPO has been applied to other modalities, including text-to-image [diffusion models](/wiki/diffusion_models), code generation, and mathematical reasoning. The simplicity of the DPO loss function makes it adaptable to any setting where paired preference data is available.

## Tooling and frameworks

Several open-source frameworks provide production-ready implementations of DPO:

| Framework | Description |
|---|---|
| Hugging Face [TRL](https://github.com/huggingface/trl) | DPOTrainer with full integration into the Hugging Face ecosystem, supporting LoRA, quantization, and multi-GPU training |
| [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) | Unified fine-tuning framework supporting DPO for [Llama](/wiki/llama), [Mistral](/wiki/mistral), Gemma, and 100+ models |
| [PyTorch](/wiki/pytorch) Torchtune | Native [PyTorch](/wiki/pytorch) recipes for DPO training with Llama 2, Llama 3, Mistral, and Gemma |
| [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) | Community-driven fine-tuning tool with DPO support |

## Timeline

The following table traces the development of DPO and related preference optimization methods:

| Date | Event |
|---|---|
| 2017 | Schulman et al. introduce [Proximal Policy Optimization](/wiki/reinforcement_learning_rl) (PPO), which becomes the standard RL algorithm for RLHF |
| March 2022 | Ouyang et al. publish InstructGPT, demonstrating RLHF with PPO for aligning [GPT-3](/wiki/gpt3) |
| May 2023 | Rafailov et al. publish the DPO paper on arXiv |
| October 2023 | Azar et al. publish IPO, generalizing beyond the Bradley-Terry assumption |
| December 2023 | DPO is presented at NeurIPS 2023 |
| February 2024 | Ethayarajh et al. publish KTO, enabling alignment with unpaired feedback |
| March 2024 | Hong et al. publish ORPO, a monolithic reference-free approach |
| April 2024 | Meta releases Llama 3, which includes DPO in its post-training pipeline |
| May 2024 | Meng et al. publish SimPO, achieving state-of-the-art results without a reference model |
| 2024-2025 | Continued research into online DPO, self-play methods, and hybrid approaches |

## See also

- [Reinforcement learning](/wiki/reinforcement_learning_rl)
- [Fine-tuning](/wiki/fine_tuning)
- [Large language model](/wiki/large_language_model)
- [Loss function](/wiki/loss_function)
- [Reward hacking](/wiki/reward_hacking)
- [Gradient descent](/wiki/gradient_descent)
- [Sigmoid function](/wiki/sigmoid_function)

## References

1. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." *Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)*. https://arxiv.org/abs/2305.18290
2. Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems, 35*. https://arxiv.org/abs/2203.02155
3. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." *arXiv preprint arXiv:1707.06347*. https://arxiv.org/abs/1707.06347
4. Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." *arXiv preprint arXiv:2310.12036*. https://arxiv.org/abs/2310.12036
5. Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." *arXiv preprint arXiv:2402.01306*. https://arxiv.org/abs/2402.01306
6. Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)*. https://arxiv.org/abs/2403.07691
7. Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024)*. https://arxiv.org/abs/2405.14734
8. Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., & Liu, J. (2023). "Statistical Rejection Sampling Improves Preference Optimization." *Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024)*. https://arxiv.org/abs/2309.06657
9. Wu, J., Xie, Z., Wang, X., & Lin, Z. (2024). "Self-Play Preference Optimization for Language Model Alignment." *arXiv preprint arXiv:2405.00675*. https://arxiv.org/abs/2405.00675
10. Lambert, N. (2024). "Direct Alignment Algorithms." *RLHF Book*, Chapter 12. https://rlhfbook.com/c/12-direct-alignment
11. Rashidinejad, P. & Tian, A. (2025). "Understanding Reward Hacking in Direct Alignment Algorithms." *Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025)*.
12. Yan, Y., et al. (2025). "3D-Properties of DPO: Drastic Drop, Degradation, and Dispersion." *arXiv preprint*.
13. Tunstall, L., Beeching, E., Lambert, N., et al. (2023). "Zephyr: Direct Distillation of LM Alignment." *arXiv preprint arXiv:2310.16944*. https://arxiv.org/abs/2310.16944
14. Dubey, A., et al. (2024). "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*. https://arxiv.org/abs/2407.21783
