Direct Preference Optimization (DPO) is a method for aligning large language models with human preferences without requiring explicit reward model training or reinforcement learning. Introduced by Rafailov et al. in 2023, DPO reparameterizes the standard reinforcement learning from human feedback (RLHF) objective so that the optimal policy can be extracted in closed form, reducing the alignment problem to a simple classification loss function over preference pairs. The method was published at NeurIPS 2023 under the title "Your Language Model is Secretly a Reward Model."
DPO has become one of the most widely adopted post-training alignment techniques in modern natural language processing, used in the training pipelines of models such as Llama 3, Mixtral, and Phi-3.
Imagine you are teaching a dog two tricks at the same time. You show the dog two different tricks and say "this one is better" each time. Eventually the dog learns which tricks you like more. DPO works the same way: instead of giving a language model a score for every single answer (which is complicated), you just show it pairs of answers and tell it which one you prefer. The model then adjusts itself to give more answers like the ones you picked. It skips the step of building a separate "scoring machine" and learns directly from your choices.
Reinforcement learning from human feedback (RLHF) is the standard approach for aligning language models with human preferences. The conventional RLHF pipeline consists of three stages:
While RLHF has been used successfully in systems like ChatGPT and Claude, the pipeline has several practical drawbacks:
DPO was designed to address these issues by collapsing the reward modeling and RL optimization stages into a single supervised learning step.
The standard RLHF objective seeks a policy π_θ that maximizes expected reward while remaining close to the reference policy π_ref:
max_π E_{x~D, y~π(y|x)} [r(x, y)] - β · D_KL[π(y|x) || π_ref(y|x)]
Here, β > 0 controls the strength of the KL constraint. A larger β keeps the policy closer to the reference, while a smaller β allows more aggressive optimization of the reward.
The Bradley-Terry model expresses the probability that completion y_1 is preferred over y_2 given prompt x as:
p*(y_1 ≻ y_2 | x) = σ(r*(x, y_1) - r*(x, y_2))
where σ is the sigmoid function and r* is the true (latent) reward function. This model assumes that preferences depend only on the difference in rewards between the two completions.
The reward model r_φ is typically trained by minimizing the negative log-likelihood under this model:
L_R(r_φ, D) = -E_{(x, y_w, y_l) ~ D} [log σ(r_φ(x, y_w) - r_φ(x, y_l))]
The central insight of DPO comes from solving the KL-constrained optimization problem analytically. The optimal policy for the RLHF objective has a closed-form solution:
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp((1/β) · r(x, y))
where Z(x) = Σ_y π_ref(y|x) · exp((1/β) · r(x, y)) is the partition function (a normalizing constant that depends only on x). This result follows from the calculus of variations and the Gibbs distribution.
By rearranging the optimal policy expression, the reward function can be written in terms of the policy:
r(x, y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)
This is the reparameterization step: instead of learning an explicit reward function, the reward is defined implicitly through the ratio of the current policy to the reference policy.
Substituting the reparameterized reward into the Bradley-Terry model yields:
p(y_w ≻ y_l | x) = σ(β · log(π_θ(y_w|x) / π_ref(y_w|x)) - β · log(π_θ(y_l|x) / π_ref(y_l|x)))
The partition function Z(x) cancels out because it appears identically in both terms. This cancellation is what makes DPO tractable: the intractable normalizing constant disappears from the objective.
The final DPO training loss is the negative log-likelihood of the observed preferences:
L_DPO(π_θ; π_ref) = -E_{(x, y_w, y_l) ~ D} [log σ(β · (log(π_θ(y_w|x)/π_ref(y_w|x)) - log(π_θ(y_l|x)/π_ref(y_l|x))))]
This can be interpreted as increasing the log-probability ratio of preferred completions relative to dispreferred completions, adjusted by how much the reference policy already favors each completion.
The gradient of the DPO loss with respect to the policy parameters θ has an intuitive form. It consists of three components:
This weighting mechanism is essential for stability. Without it (as in naive "unlikelihood training"), the model tends to degenerate by uniformly suppressing all outputs rather than learning fine-grained preferences.
The practical DPO training procedure involves the following steps:
| Hyperparameter | Typical range | Effect |
|---|---|---|
| β (beta) | 0.1 to 0.5 | Controls divergence from reference policy. Lower values allow more aggressive adaptation; higher values keep the model closer to the SFT baseline. |
| Learning rate | 1e-7 to 5e-6 | Standard learning rate for fine-tuning. DPO typically uses lower rates than SFT. |
| Batch size | 32 to 128 | Larger batches provide more stable gradient estimates for the preference loss. |
| Epochs | 1 to 3 | DPO usually requires few epochs. Overtraining can lead to overfitting on the preference dataset. |
| Label smoothing | 0.0 to 0.1 | Optional smoothing applied to the preference labels to mitigate noise in human judgments. |
DPO is straightforward to implement using standard deep learning frameworks. The Hugging Face TRL (Transformer Reinforcement Learning) library provides a DPOTrainer class that handles the training loop:
from trl import DPOConfig, DPOTrainer
training_args = DPOConfig(
beta=0.1,
learning_rate=5e-7,
per_device_train_batch_size=4,
num_train_epochs=1,
output_dir="./dpo_output",
)
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=training_args,
processing_class=tokenizer,
train_dataset=preference_dataset,
)
trainer.train()
The following table summarizes the main differences between DPO and the standard RLHF pipeline using PPO:
| Aspect | RLHF with PPO | DPO |
|---|---|---|
| Reward model | Requires training a separate explicit reward model | Implicit; reward is defined through the policy itself |
| Optimization method | Reinforcement learning (PPO) | Supervised learning (gradient descent) |
| Models in memory | Four (policy, reference, reward model, value function) | Two (policy and reference) |
| Data generation | Online: generates new samples during training | Offline: uses a fixed preference dataset |
| Hyperparameter sensitivity | High; many RL-specific hyperparameters (clip ratio, GAE lambda, etc.) | Low; primarily β and standard training hyperparameters |
| Implementation complexity | High; requires RL infrastructure | Low; standard supervised training loop |
| Training stability | Can be unstable; reward hacking is common | Generally more stable, though overfitting is possible |
| Computational cost | Expensive (sampling, reward evaluation, value estimation) | Cheaper (forward passes through two models only) |
| Theoretical optimality | Optimal policy under certain conditions | Equivalent to RLHF when Bradley-Terry model holds and data is sufficient |
Rafailov et al. (2023) evaluated DPO against PPO-based RLHF on three tasks:
| Task | DPO performance vs. PPO |
|---|---|
| Sentiment control (IMDb) | DPO exceeded PPO in controlling generation sentiment |
| Summarization (TL;DR) | DPO matched or slightly improved summarization quality |
| Single-turn dialogue (Anthropic HH) | DPO matched or improved response quality |
Across all tasks, DPO achieved comparable or better performance while being simpler to implement and more computationally efficient.
Rafailov et al. proved that when the Bradley-Terry model perfectly fits the true preference distribution and the preference dataset has sufficient coverage, the global optimum of the DPO objective coincides with the global optimum of the RLHF objective. In other words, DPO and RLHF converge to the same optimal policy under ideal conditions.
The DPO policy implicitly defines a reward function:
r_implicit(x, y) = β · log(π_θ(y|x) / π_ref(y|x))
This implicit reward can be extracted and used for evaluation or as a reward signal for other purposes, which is the basis of the paper's subtitle: "Your Language Model is Secretly a Reward Model."
The KL-constrained RLHF objective is a special case of the maximum entropy reinforcement learning framework, where the entropy bonus is replaced by a KL penalty relative to a reference policy. The closed-form solution for the optimal policy in this setting is a Gibbs (Boltzmann) distribution, which is well studied in statistical physics and Bayesian inference.
Since the publication of DPO, numerous variants have been proposed to address its limitations or extend its capabilities. The following table summarizes the most notable ones:
| Method | Authors | Year | Venue | Key innovation |
|---|---|---|---|---|
| IPO (Identity Preference Optimization) | Azar et al. | 2023 | arXiv | Removes the Bradley-Terry assumption; uses a general preference objective (ΨPO) based directly on pairwise preference probabilities |
| KTO (Kahneman-Tversky Optimization) | Ethayarajh et al. | 2024 | arXiv | Uses unpaired binary feedback (desirable/undesirable) instead of paired preferences; grounded in prospect theory |
| ORPO (Odds Ratio Preference Optimization) | Hong et al. | 2024 | EMNLP 2024 | Monolithic method that combines SFT and preference alignment in a single step; no reference model needed |
| SimPO (Simple Preference Optimization) | Meng et al. | 2024 | NeurIPS 2024 | Uses average log-probability as implicit reward; eliminates reference model; adds target reward margin |
| RSO (Rejection Sampling Optimization) | Liu et al. | 2023 | ICLR 2024 | Sources preference data from the target optimal policy using rejection sampling for better estimation |
| Online DPO | Guo et al. | 2024 | Various | Generates new preference pairs on-the-fly during training to reduce distribution shift |
| Self-Play Preference Optimization (SPPO) | Wu et al. | 2024 | Various | Treats alignment as a two-player game; iteratively improves through self-play |
Azar et al. (2023) identified that DPO relies on the Bradley-Terry assumption, which may not hold for real human preferences. They proposed a more general objective called ΨPO that expresses the loss directly in terms of pairwise preference probabilities rather than pointwise rewards. IPO is a specific instance of ΨPO using the identity mapping, which bypasses both the reward model approximation and the Bradley-Terry assumption. IPO has demonstrated empirical improvements over DPO in settings where the Bradley-Terry model is a poor fit for the data.
Ethayarajh et al. (2024) proposed KTO, which draws on Kahneman and Tversky's prospect theory from behavioral economics. Unlike DPO, which requires paired preference data (y_w, y_l), KTO works with unpaired binary signals: each response is independently labeled as either desirable or undesirable. This makes KTO applicable to datasets where paired comparisons are unavailable. The loss function applies asymmetric penalties, reflecting the human tendency to weigh losses more heavily than gains. KTO has matched or exceeded DPO performance at scales from 1B to 30B parameters.
Hong et al. (2024) introduced ORPO as a monolithic approach that eliminates both the reference model and the separate SFT stage. ORPO appends a log odds ratio term to the standard negative log-likelihood loss, applying a weak penalty to rejected responses and a strong adaptation signal to chosen responses. This allows alignment to be performed during the SFT phase in a single training run.
Meng et al. (2024) proposed SimPO, which replaces the log-likelihood ratio used in DPO with the average log-probability of the sequence. This change serves two purposes: it eliminates the need for a reference model (reducing memory requirements), and it better aligns the training objective with how models actually generate text (via average token-level probabilities rather than total sequence log-probability). SimPO also introduces a target reward margin γ to encourage a larger gap between winning and losing responses. In experiments with Llama 3 and Gemma 2, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and up to 7.5 points on Arena-Hard.
Despite its popularity, DPO has several known limitations:
DPO trains on a fixed, pre-collected preference dataset. If the distribution of responses in the training data differs significantly from the responses the policy generates at inference time, the learned preferences may not transfer well. This distribution shift problem is less severe in PPO-based RLHF because PPO generates new samples during training (on-policy learning).
Even though DPO does not train an explicit reward model, it is still susceptible to overoptimization. Empirical studies have observed a "hump-shaped" curve: as the policy diverges further from the reference (at higher KL budgets), the true quality of generations initially improves but then degrades, while the implicit reward continues to increase monotonically. This pattern mirrors the overoptimization behavior seen in RLHF with explicit reward models.
Rashidinejad and Tian (2025) identified two types of reward hacking in DPO. Type I arises from over-optimizing out-of-distribution actions due to partial data coverage. Type II stems from performance degradation of the initial model when preference data poorly covers high-reward actions. Maintaining a small KL divergence from the reference model is insufficient to prevent Type I reward hacking.
Models trained with DPO tend to generate increasingly longer responses throughout training. This length increase does not always correspond to improved quality, and in many cases the win-rate plateaus or decreases while response length continues to grow. Length-Desensitized DPO (LD-DPO) has been proposed to separate explicit length preference from other implicit preferences.
Yan et al. (2025) identified what they call the "3D-Properties" of DPO's implicit reward modeling: (1) drastic drop in rejected response likelihood, (2) degradation into response suppression rather than genuine preference learning, and (3) dispersion effects on unseen responses. These properties can cause the model to learn shallow heuristics rather than meaningful preference distinctions.
DPO's derivation assumes that human preferences follow the Bradley-Terry model. Real human preferences can be intransitive, context-dependent, or influenced by factors not captured by a scalar reward. When this assumption is violated, DPO's theoretical guarantees weaken. IPO was specifically designed to address this limitation.
DPO has been adopted across a wide range of language model training pipelines:
| Application | Description |
|---|---|
| Llama 3 post-training | Meta used DPO alongside SFT, rejection sampling, and PPO in the Llama 3 post-training pipeline |
| Mixtral 8x7B | Mistral AI applied DPO to the Mixtral mixture-of-experts model for instruction-following alignment |
| Phi-3 | Microsoft used DPO in the optimization of its small language model series |
| Gemma fine-tuning | Google's Gemma models can be aligned using SFT followed by DPO |
| Zephyr | The Zephyr model series used DPO on UltraFeedback data as a key alignment step |
| Research and open-source models | Widely used in the open-source community via Hugging Face TRL, LLaMA Factory, and PyTorch Torchtune |
Beyond text generation, DPO has been applied to other modalities, including text-to-image diffusion models, code generation, and mathematical reasoning. The simplicity of the DPO loss function makes it adaptable to any setting where paired preference data is available.
Several open-source frameworks provide production-ready implementations of DPO:
| Framework | Description |
|---|---|
| Hugging Face TRL | DPOTrainer with full integration into the Hugging Face ecosystem, supporting LoRA, quantization, and multi-GPU training |
| LLaMA Factory | Unified fine-tuning framework supporting DPO for Llama, Mistral, Gemma, and 100+ models |
| PyTorch Torchtune | Native PyTorch recipes for DPO training with Llama 2, Llama 3, Mistral, and Gemma |
| Axolotl | Community-driven fine-tuning tool with DPO support |
The following table traces the development of DPO and related preference optimization methods:
| Date | Event |
|---|---|
| 2017 | Schulman et al. introduce Proximal Policy Optimization (PPO), which becomes the standard RL algorithm for RLHF |
| March 2022 | Ouyang et al. publish InstructGPT, demonstrating RLHF with PPO for aligning GPT-3 |
| May 2023 | Rafailov et al. publish the DPO paper on arXiv |
| October 2023 | Azar et al. publish IPO, generalizing beyond the Bradley-Terry assumption |
| December 2023 | DPO is presented at NeurIPS 2023 |
| February 2024 | Ethayarajh et al. publish KTO, enabling alignment with unpaired feedback |
| March 2024 | Hong et al. publish ORPO, a monolithic reference-free approach |
| April 2024 | Meta releases Llama 3, which includes DPO in its post-training pipeline |
| May 2024 | Meng et al. publish SimPO, achieving state-of-the-art results without a reference model |
| 2024-2025 | Continued research into online DPO, self-play methods, and hybrid approaches |