DPO (Direct Preference Optimization) is an alignment technique for large language models that directly optimizes a language model policy from human preference data, without training a separate reward model or using reinforcement learning algorithms like Proximal Policy Optimization (PPO). The method was introduced in May 2023 by Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn at Stanford University. DPO reformulates the reinforcement learning from human feedback (RLHF) objective into a simple binary cross-entropy loss over preference pairs. The key insight behind DPO is that the constrained reward maximization problem underlying RLHF admits a closed-form solution, allowing the reward function to be reparameterized in terms of the optimal policy itself. This eliminates the need for explicit reward modeling and reinforcement learning entirely, resulting in a training pipeline that is simpler, more stable, and less computationally expensive than traditional RLHF [1].
Since its publication, DPO has become one of the most widely adopted alignment methods in the open-source machine learning community and has been used by major organizations including Meta for the Llama model family, Mistral AI for Mixtral, Microsoft for Phi-3, and Alibaba for the Qwen family. The original paper was presented at NeurIPS 2023, where it received an Outstanding Paper Runner-up award, and has spawned a large family of variant algorithms collectively referred to as Direct Alignment Algorithms (DAAs) [2]. The longer companion entry Direct Preference Optimization (DPO) provides additional detail on training procedure, gradients, and applications.
Aligning language models with human preferences is a central challenge in modern AI development. Without alignment, a pretrained language model will generate text that reflects the statistical patterns in its training data, which may include harmful, unhelpful, or undesirable content. The dominant approach to alignment prior to DPO was RLHF, a multi-stage pipeline that involves collecting human preference data, training a reward model on that data, and then fine-tuning the language model using a reinforcement learning algorithm (typically PPO) to maximize the learned reward [3].
RLHF was popularized by OpenAI's 2022 paper on InstructGPT, which used the technique to convert GPT-3 into a more helpful instruction-following assistant, and by Anthropic's work on Claude and Constitutional AI. The same recipe powered ChatGPT and most early production chatbots. While effective, RLHF has several well-documented practical difficulties. The pipeline requires training and maintaining a separate reward model, which introduces its own failure modes such as reward hacking, where the policy learns to exploit artifacts in the reward model rather than genuinely improving. PPO is notoriously sensitive to hyperparameters and can be unstable during training. The entire process requires sampling from the language model during training, which is computationally expensive. Running the full RLHF pipeline typically consumes 30 to 50 percent more compute than simpler fine-tuning approaches and demands four model copies in memory simultaneously: the active policy, a frozen reference policy, the reward model, and a value network used by PPO [3].
These difficulties motivated the search for simpler alignment methods. DPO emerged from the observation that the mathematical structure of the RLHF objective allows the reward model to be eliminated entirely from the optimization process. Rather than learning a reward function and then optimizing against it, the DPO authors showed that the policy itself can be treated as an implicit reward model, collapsing two stages into one.
The paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" was posted to arXiv on May 29, 2023, and later presented at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), where it was selected as one of two Outstanding Main Track Runner-Up papers [1][2]. The authors were all affiliated with Stanford University at the time of publication, with Rafailov and Sharma as joint first authors and Finn and Ermon as senior authors.
The paper's title captures its central insight: a language model being optimized with DPO implicitly defines a reward model through the ratio of its output probabilities to those of a reference model. There is no need to explicitly parameterize or train a separate reward function because the policy itself encodes the reward.
The reference implementation was released as open-source code on GitHub [4], enabling rapid adoption by the research community. Within months, third-party reimplementations appeared in the Hugging Face TRL library, NVIDIA NeMo, Allen AI's open-instruct, and Axolotl, which made the algorithm accessible to anyone with a single GPU [5][6].
The mathematical derivation of DPO proceeds in several steps, beginning from the standard RLHF objective and arriving at a loss function that can be optimized directly with gradient descent.
The starting point is the KL-constrained reward maximization objective used in RLHF:
max_pi E_{x ~ D, y ~ pi(y|x)} [r(x, y)] - beta * D_KL[pi(y|x) || pi_ref(y|x)]
Here, pi is the policy (language model) being optimized, pi_ref is a reference policy (typically the supervised fine-tuned model), r(x, y) is the reward function, beta is a parameter controlling the strength of the KL constraint, and D_KL is the Kullback-Leibler divergence. The KL penalty prevents the optimized policy from deviating too far from the reference model, which helps maintain generation quality and diversity.
Rafailov et al. showed that this optimization problem has a closed-form solution. The optimal policy pi* satisfies:
pi*(y|x) = (1 / Z(x)) * pi_ref(y|x) * exp((1/beta) * r(x, y))
where Z(x) is a partition function (normalizing constant) that ensures the probabilities sum to one. This result was already known in the reinforcement learning literature as the solution to KL-regularized policy improvement and is closely related to the Boltzmann or Gibbs distribution in statistical physics. The DPO authors made a novel use of it by inverting the relationship.
By rearranging the expression for the optimal policy, the reward function can be expressed in terms of the policy and the reference model:
r(x, y) = beta * log(pi*(y|x) / pi_ref(y|x)) + beta * log Z(x)
This is the key reparameterization. It shows that the reward for any response y given prompt x can be recovered from the log-ratio of the optimal policy to the reference policy, plus a prompt-dependent constant.
Human preferences are modeled using the Bradley-Terry model, a standard framework for pairwise comparisons developed by Ralph Bradley and Milton Terry in 1952. Given a prompt x and two responses y_w (preferred, or "winner") and y_l (dispreferred, or "loser"), the probability that a human prefers y_w over y_l is:
P(y_w > y_l | x) = sigma(r(x, y_w) - r(x, y_l))
where sigma is the logistic sigmoid function. This model assumes that preferences depend only on the difference in latent rewards between the two completions and that comparisons are independent.
Substituting the reparameterized reward into the Bradley-Terry model yields the DPO objective. The partition function Z(x) cancels out because it depends only on the prompt and not on the response, giving:
L_DPO(pi_theta; pi_ref) = -E_{(x, y_w, y_l) ~ D} [log sigma(beta * (log(pi_theta(y_w|x) / pi_ref(y_w|x)) - log(pi_theta(y_l|x) / pi_ref(y_l|x))))]
This is a binary cross-entropy loss. For each preference pair in the dataset, the loss encourages the model to assign a higher log-probability ratio (relative to the reference model) to the preferred response than to the dispreferred response. The parameter beta controls how sharply the model should differentiate between preferred and dispreferred outputs. A smaller beta leads to more aggressive optimization, while a larger beta keeps the policy closer to the reference model.
The gradient of the DPO loss has an intuitive interpretation. It increases the probability of preferred responses and decreases the probability of dispreferred responses, weighted by how wrong the current model is. Specifically, examples where the model already strongly prefers the correct response contribute less to the gradient, while examples where the model incorrectly assigns higher probability to the dispreferred response contribute more. This implicit importance weighting is a natural consequence of the loss function's structure and helps prevent the model from overfitting to easy examples [1].
DPO and RLHF both use human preference data to align language models, but they differ fundamentally in how they convert preferences into training updates.
The RLHF pipeline consists of three stages: (1) supervised fine-tuning (SFT) of the base model, (2) training a reward model on preference data, and (3) optimizing the SFT model against the reward model using PPO. DPO simplifies this to two stages: (1) SFT, and (2) direct optimization on preference data using the DPO loss. The reward model training and PPO stages are both eliminated.
| Aspect | RLHF (with PPO) | DPO |
|---|---|---|
| Number of models during training | 4 (policy, reference, reward, value) | 2 (policy, reference) |
| Reward model required | Yes (explicit) | No (implicit) |
| RL algorithm required | Yes (PPO) | No (supervised) |
| Sampling during training | Yes (on-policy generation) | No (offline, fixed dataset) |
| Hyperparameter sensitivity | High (PPO has many hyperparameters) | Low (mainly beta) |
| Computational cost | Higher (4 forward passes plus sampling) | Lower (2 forward passes) |
| Implementation complexity | Significant (full RL stack) | Modest (a few hundred lines) |
| Memory footprint | Very high | Moderate |
| Distribution shift handling | Inherent (on-policy) | Requires iterative variants |
| Reward hacking susceptibility | Direct (via reward model) | Indirect (via implicit reward) |
The simplicity of the DPO approach has driven much of its adoption. DPO reduces the alignment pipeline to a single supervised learning problem. There is no need to implement or debug a reinforcement learning loop, maintain a separate reward model, or deal with the engineering challenges of on-policy sampling during training. Practitioners report that a working DPO training run can be set up in an afternoon using TRL or Axolotl, while a comparable PPO pipeline often takes weeks of debugging.
Stability is another significant benefit. PPO training is prone to instabilities, including reward hacking, mode collapse, and divergence. DPO avoids these issues because it does not use an explicit reward model that can be exploited, and its loss function is a well-behaved classification objective. The optimization dynamics of cross-entropy losses are well understood, while PPO's clipping ratios, value function coefficients, and generalized advantage estimation introduce many interacting moving parts.
Lower compute requirements are critical for open-source adoption. Because DPO does not require sampling from the model during training or maintaining four separate models in memory simultaneously, it uses substantially less GPU memory and wall-clock time. Meta reported that for Llama 3, DPO required less compute than PPO for large-scale alignment and performed better on instruction-following benchmarks like IFEval [7].
DPO can be implemented in a few dozen lines of code on top of standard language model training infrastructure. Reference implementations are available in popular libraries including Hugging Face's TRL (Transformer Reinforcement Learning) library, allowing teams without dedicated RL engineers to perform alignment research [5].
Despite DPO's practical benefits, RLHF with PPO retains advantages in certain settings. Because PPO is an on-policy algorithm, it generates fresh samples from the current policy during training, which helps it adapt to the evolving distribution of the model. DPO, by contrast, is an offline algorithm that trains on a fixed dataset of preferences. This means DPO can suffer from distribution shift: as the model improves during training, the preference data, collected from a different earlier model, may become less relevant.
Research published in 2025 found that in approximate optimization settings, RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of the two-stage learning approach in some regimes [8]. RLHF can also handle more nuanced feedback signals beyond simple pairwise preferences, which makes it more suitable for high-stakes applications requiring fine-grained alignment. For frontier labs with the engineering resources to operate full RL stacks, PPO and its successors (such as Group Relative Policy Optimization, GRPO) remain attractive options.
The practical DPO pipeline involves four stages, the first three shared with conventional RLHF.
The SFT step is critical because the reference policy anchors the KL divergence term. Skipping SFT, or using a poorly fit SFT model, generally produces unstable DPO training. Some variants such as ORPO eliminate the separate SFT stage by combining both objectives into a single training run.
DPO has fewer hyperparameters than PPO, but a small number of choices have outsized effects on the final model.
| Hyperparameter | Typical range | Notes |
|---|---|---|
| beta | 0.01 to 0.5 | Controls divergence from reference. Llama 3 used 0.1 [7]. Smaller values allow more aggressive adaptation; larger values stay close to SFT |
| Learning rate | 1e-7 to 1e-5 | Llama 3 used 1e-5; Zephyr used 5e-7. Lower than SFT learning rates by an order of magnitude |
| Batch size | 32 to 128 | Larger batches stabilize the preference loss; gradient accumulation often used |
| Epochs | 1 to 3 | Few epochs typical; longer training risks overfitting and reward hacking |
| Optimizer | AdamW | Standard choice with weight decay around 0.01 to 0.1 |
| Label smoothing | 0.0 to 0.1 | Optional; mitigates noisy preference labels (used in cDPO variant) |
| LR schedule | Cosine or linear with warmup | Warmup over 10 percent of steps is common |
| Sequence length | 1024 to 8192 | Longer contexts increase memory requirements substantially |
Meta's Llama 3 team also added a small auxiliary negative log-likelihood loss on chosen responses with a coefficient of 0.2, which prevented the log-probability of preferred completions from collapsing during training [7]. They masked special formatting tokens (chat template headers and end-of-turn markers) from both chosen and rejected responses to prevent the model from learning superficial cues about preference labels.
The success and limitations of DPO have inspired a large family of variant algorithms, sometimes referred to collectively as Direct Alignment Algorithms (DAAs). A 2024 survey catalogued more than 30 distinct methods, each addressing specific shortcomings or adapting the core idea to different settings [9]. The table below summarizes the most influential variants.
| Method | Authors and year | Venue | Key innovation | Pairing required | Reference model |
|---|---|---|---|---|---|
| DPO | Rafailov et al. 2023 | NeurIPS 2023 | Reparameterizes reward through policy log-ratios; original method | Yes | Yes |
| IPO | Azar et al. 2023 | AISTATS 2024 | Replaces sigmoid with identity loss to bound the implicit reward gap | Yes | Yes |
| KTO | Ethayarajh et al. 2024 | ICML 2024 | Uses unpaired binary feedback; grounded in Kahneman-Tversky prospect theory | No | Yes |
| ORPO | Hong et al. 2024 | EMNLP 2024 | Combines SFT and preference optimization in one stage; no reference model | Yes | No |
| SimPO | Meng et al. 2024 | NeurIPS 2024 | Length-normalized average log-probability reward; no reference model | Yes | No |
| CPO | Xu et al. 2024 | ICML 2024 | Contrastive loss with behavior cloning regularizer; designed for translation | Yes | No |
| RSO | Liu et al. 2023 | ICLR 2024 | Rejection sampling from optimal policy estimate | Yes | Yes |
| sDPO | Kim et al. 2024 | arXiv | Stepwise DPO with progressively updated reference | Yes | Yes |
| cDPO | Mitchell 2023 | Blog | Conservative DPO with label smoothing for noisy data | Yes | Yes |
| f-DPO | Wang et al. 2024 | ICML 2024 | Generalizes KL constraint to arbitrary f-divergences | Yes | Yes |
| TDPO | Zeng et al. 2024 | ICML 2024 | Token-level DPO with per-token KL regularization | Yes | Yes |
| Online DPO | Guo et al. 2024 | DeepMind | Generates fresh preference pairs each round using AI feedback | Yes | Yes |
| Iterative DPO | RLHFlow, Meta 2024 | Various | Multi-round DPO using current model to label new pairs | Yes | Yes |
| Self-Rewarding | Yuan et al. 2024 | ICML 2024 | Model judges its own outputs to create preference pairs | Yes | Yes |
| SPPO | Wu et al. 2024 | NeurIPS 2024 | Self-play game-theoretic formulation; asymmetric updates | Yes | Yes |
| NPO | Zhang et al. 2024 | arXiv | Negative Preference Optimization for unlearning | No | Yes |
| Plackett-Luce DPO | Multiple | Various | Generalizes Bradley-Terry to ranked lists of more than two items | Yes | Yes |
| Rainbow PO | Zhao et al. 2024 | ICLR 2025 | Unifies improvements from multiple DAAs into a single framework | Yes | Yes |
IPO was introduced by Mohammad Gheshlaghi Azar and colleagues at Google DeepMind in October 2023 [10]. The paper presented a more general theoretical framework called PsiPO that expresses the loss directly in terms of pairwise preference probabilities rather than requiring a Bradley-Terry pointwise reward. IPO is the specific case of PsiPO that uses an identity mapping in place of the sigmoid log-likelihood. This change bounds the implicit reward gap between chosen and rejected responses, which prevents the policy from assigning extreme probability ratios that can cause overconfidence and degenerate behavior on long training runs. IPO is particularly useful when the Bradley-Terry assumption is violated, for instance when human preferences are intransitive or when annotators frequently disagree.
KTO was introduced by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela in February 2024 [11]. The method draws on prospect theory from behavioral economics, specifically the observation by Daniel Kahneman and Amos Tversky that humans weigh losses more heavily than equivalent gains. KTO applies asymmetric penalties to outputs rated as bad versus rewards for outputs rated as good. Crucially, KTO eliminates the requirement for paired preference data. While DPO requires pairs of (preferred, dispreferred) responses to the same prompt, KTO works with unpaired binary feedback such as simple thumbs-up or thumbs-down labels on individual responses. This makes it practical in settings where collecting matched pairs is expensive or infeasible, including production telemetry from chat applications. KTO matched or exceeded DPO performance at scales from 1 billion to 30 billion parameters.
ORPO, by Jiwoo Hong, Noah Lee, and James Thorne at KAIST, was published at EMNLP 2024 [12]. It takes a more radical departure from the DPO framework by eliminating the reference model entirely and combining supervised fine-tuning with preference optimization into a single training objective. ORPO appends a log odds-ratio term to the standard negative log-likelihood loss, applying a weak penalty to rejected responses and a strong adaptation signal to chosen responses. This monolithic design allows alignment to be performed during the SFT phase in a single training run, which both saves compute and avoids the memory cost of holding a reference model. Fine-tuning Phi-2, Llama 2, and Mistral-7B with ORPO on UltraFeedback alone surpassed the performance of much larger Llama-2-Chat and Zephyr models on AlpacaEval 2.0 and MT-Bench [12].
SimPO, developed at Princeton by Yu Meng, Mengzhou Xia, and Danqi Chen, was published at NeurIPS 2024 [13]. It eliminates the reference model but takes a different approach than ORPO. SimPO uses average log-probability of a sequence as the implicit reward, rather than the log-ratio used by DPO. This length normalization directly addresses the length bias problem and aligns the training signal with the way models actually generate text (per-token probabilities at inference). SimPO also introduces a target reward margin gamma that helps maintain a consistent gap between preferred and dispreferred responses. In benchmarks with Llama 3 and Gemma 2 backbones, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard. The Princeton team's Gemma-2-9B-it variant ranked first on Chatbot Arena among models under 10 billion parameters at the time of release [13].
CPO, by Haoran Xu and colleagues at Microsoft and Johns Hopkins, was developed primarily for machine translation but applies more broadly [14]. CPO removes the reference model term by proving an upper bound on DPO's objective and incorporating a behavior cloning regularizer that anchors the model to the chosen distribution. Applied to the ALMA translation model with only 22,000 parallel sentences and 12 million parameters of additional training, CPO produced ALMA-R, which matched or exceeded WMT competition winners and GPT-4 on WMT'21, WMT'22, and WMT'23 test sets.
Several additional methods address specific weaknesses of vanilla DPO. RSO (Rejection Sampling Optimization) by Tianqi Liu and colleagues sources preference data from an estimate of the optimal policy via rejection sampling, which improves the match between training data distribution and the model being optimized [15]. cDPO (Conservative DPO), proposed by Eric Mitchell in late 2023, adds label smoothing to the binary cross-entropy loss to handle noisy preference annotations. NPO (Negative Preference Optimization) adapts the framework for machine unlearning, where the goal is to make a model forget specific outputs rather than learn new ones. Plackett-Luce DPO generalizes the pairwise Bradley-Terry assumption to ranked lists, allowing more than two responses to be compared simultaneously. f-DPO replaces the KL divergence with a general f-divergence, providing a knob to trade off between mode-seeking and mode-covering behavior.
The original DPO is offline: training proceeds against a fixed dataset of preferences. Iterative DPO addresses the resulting distribution shift by alternating between policy training and fresh preference data generation. After each round of DPO training, the updated model produces new candidate responses that are then ranked (by humans, an AI judge, or a separate reward model), forming new preference pairs for the next round. Meta used this approach for Llama 3, performing six rounds of SFT followed by DPO on preference data labeled with the best model from the previous round [7]. The RLHFlow project at Salesforce released open weights for Llama-3-Iterative-DPO based on this recipe.
Online DPO, sometimes called Online AI Feedback (OAIF) and described by Shangmin Guo and colleagues at DeepMind in 2024, generates preference pairs on the fly during training using an external LLM judge, blurring the line between offline DPO and on-policy RLHF. Self-Rewarding Language Models, introduced by Weizhe Yuan and collaborators at Meta and NYU in early 2024, take this idea further: the model itself acts as the LLM-as-a-judge to score its own candidate outputs, removing the need for an external reward signal entirely [16].
DPO has become a standard ingredient in the post-training stack for both proprietary and open-source LLMs. The table below lists notable models that report using DPO or a close variant.
| Model | Organization | Year | Notes |
|---|---|---|---|
| Zephyr-7B | Hugging Face H4 | 2023 | Built on Mistral 7B; used distilled DPO on UltraFeedback; first viral DPO success |
| Tulu 2 (7B/13B/70B) | Allen AI | 2023 | DPO on UltraFeedback; Llama-2 backbone; open codebase |
| Mixtral 8x7B Instruct | Mistral AI | 2023 | DPO-aligned mixture of experts; MT-Bench score 8.30 at release |
| Llama 3 (8B/70B) | Meta | 2024 | Six rounds of SFT plus DPO; learning rate 1e-5, beta 0.1 |
| Llama 3.1 (8B/70B/405B) | Meta | 2024 | Same DPO recipe extended to 405B parameters |
| Llama 3.2 (1B/3B/11B/90B) | Meta | 2024 | Multimodal Llama variants; DPO retained for chat alignment |
| Llama 3.3 70B | Meta | 2024 | DPO-aligned chat model with improved instruction following |
| Phi-3 (3.8B/14B) | Microsoft | 2024 | DPO in post-training stack for the Phi series |
| Qwen2 (0.5B to 72B) | Alibaba | 2024 | Both offline and online DPO; Online Merging Optimizer for alignment tax |
| Qwen 2.5 / Qwen 3 | Alibaba | 2024 to 2025 | DPO retained as core alignment method across the family |
| Gemma 2 | 2024 | Supports DPO and SimPO via Hugging Face TRL | |
| Nous-Hermes-2 | Nous Research | 2024 | DPO fine-tunes of various Mistral and Llama bases |
| OpenHermes-2.5 DPO | Teknium / Nous | 2024 | Community DPO fine-tunes |
| Notus 7B | Argilla | 2023 | DPO fine-tune of Zephyr with curated preference data |
| Starling 7B | Berkeley | 2023 | Reward-RANCHO model with DPO and PPO comparisons |
| Llama-3-Tulu-3 | Allen AI | 2024 | Uses DPO and RLVR (RL with Verifiable Rewards) |
| ALMA-R | Johns Hopkins / Microsoft | 2024 | Translation model trained with CPO |
Closed proprietary frontier models such as GPT-4 and Claude 3 do not publicly disclose their full alignment recipes, though Anthropic and OpenAI have both published research describing direct preference learning techniques that resemble DPO in spirit. The conventional wisdom in the open-source community is that frontier labs use proprietary blends of supervised fine-tuning, DPO-style direct alignment, RLHF with PPO, and constitutional or rule-based methods such as RLAIF.
DPO requires a dataset of preference pairs. Several public datasets have become standard benchmarks for the method.
| Dataset | Source | Size | Notes |
|---|---|---|---|
| UltraFeedback | Cui et al. (OpenBMB) 2023 | 64k prompts, 4 responses each | GPT-4 ranked across instruction-following, honesty, helpfulness, truthfulness [17] |
| HuggingFaceH4/ultrafeedback_binarized | Hugging Face H4 | 61k pairs | Binarized version used for Zephyr |
| Anthropic HH-RLHF | Anthropic 2022 | 170k pairs | Helpful and harmless preferences across base, RS, and online tranches [18] |
| HelpSteer / HelpSteer 2 | NVIDIA 2023 to 2024 | 37k responses | Multi-attribute ratings (helpful, correct, coherent, complex, verbose) |
| PKU-SafeRLHF | Peking University 2023 | 30k+ pairs | Safety-focused preferences over helpful responses |
| OpenAI WebGPT comparisons | OpenAI 2021 | 19k pairs | Early public RLHF dataset using web search |
| StackExchange Preferences | StackExchange / community | ~10M pairs | Implicit upvote signals from Q&A site |
| Argilla DPO Mix | Argilla 2024 | Varied | Curated combination of UltraFeedback and other sources |
| Capybara DPO | LDJnr | 2024 | Multi-turn DPO data from open-source models |
| Distilabel preference datasets | Argilla | 2024 to 2025 | AI-judge-labeled pairs from synthetic generation pipelines |
Many production deployments also use proprietary preference data collected from user interactions, such as thumbs-up and thumbs-down feedback in chat applications. KTO is particularly suited to this setting because it accepts unpaired binary signals without requiring matched comparisons.
DPO is supported in essentially every modern LLM post-training framework. The Hugging Face TRL library provides DPOTrainer, a subclass of the Transformers Trainer that handles tokenization, batching, log-probability computation for both policy and reference models, and optional integration with PEFT for LoRA-based DPO. TRL also ships an OnlineDPOTrainer for iterative variants and a DPOConfig dataclass that exposes all hyperparameters [5][6].
Other production frameworks include Axolotl, a community-driven YAML-configured fine-tuning tool; LLaMA-Factory, which supports DPO and most variants across more than a hundred model architectures; PyTorch Torchtune, with native DPO recipes for Llama 2, Llama 3, Mistral, and Gemma; NVIDIA NeMo Aligner, used internally for Nemotron and supporting DPO, RPO, and IPO; OpenRLHF, a Ray-based distributed framework geared toward larger-scale runs; and Allen AI's open-instruct codebase used for the Tulu series. NVIDIA's TensorRT-LLM and Hugging Face Inference Endpoints add deployment paths for DPO-trained models.
A minimal TRL example illustrates how compact a DPO training script can be:
from trl import DPOConfig, DPOTrainer
args = DPOConfig(
beta=0.1,
learning_rate=5e-7,
per_device_train_batch_size=4,
num_train_epochs=1,
output_dir="./dpo_output",
)
trainer = DPOTrainer(
model=policy,
ref_model=reference,
args=args,
processing_class=tokenizer,
train_dataset=preference_dataset,
)
trainer.train()
DPO has several known limitations that have motivated the development of variant algorithms.
Because DPO optimizes directly on preference pairs, the quality of those pairs is critical. Noisy or inconsistent preference labels degrade performance significantly. In RLHF, the reward model provides a layer of smoothing that can partially absorb label noise. DPO has no such buffer, so errors in the preference data translate directly into errors in the policy gradient [9]. Annotator agreement rates of 60 to 80 percent are typical for subjective tasks, which means a substantial fraction of the training signal is effectively random.
DPO trains on a fixed dataset of preference pairs, typically generated by a previous version of the model or a different model entirely. As training progresses and the policy changes, the training data may no longer reflect the distribution of outputs the model actually produces. This distribution shift can lead the model to find biased solutions that exploit out-of-distribution responses. Several researchers have noted that DPO can sometimes produce policies that look good on the training data but behave unexpectedly on novel inputs [8]. Iterative and online DPO variants address this by periodically generating new preference data from the current policy.
DPO requires a reference model pi_ref throughout training, and the quality of the final policy depends on the quality of this reference. If the reference model is poor, the KL constraint anchors the optimized policy to a weak baseline. The reference model must also be kept in memory during training, which increases hardware requirements relative to methods that do not need one. ORPO, SimPO, and CPO eliminate the reference model entirely, trading some theoretical guarantees for memory savings and simpler infrastructure.
As DPO training progresses, the implicit reward gap between chosen and rejected responses can grow without bound. This means the policy may assign extreme probability ratios that do not reflect the actual strength of human preferences. In practice this manifests as overconfident or degenerate behavior, particularly with long training runs. Empirical studies have observed a hump-shaped quality curve: as the policy diverges further from the reference at higher KL budgets, true generation quality initially improves but then degrades while the implicit reward continues to climb. IPO and Rainbow PO were designed in part to bound this growth [10][19].
DPO has been observed to develop a bias toward generating longer responses, since longer responses tend to have more tokens over which the log-probability ratio can accumulate. Park and colleagues studied this phenomenon in depth in 2024, finding that uncontrolled DPO training can increase response length by 50 to 100 percent without commensurate quality gains [20]. SimPO addresses this directly by normalizing the implicit reward by sequence length. Length-Desensitized DPO (LD-DPO) and Down-Sampled KL Divergence approaches offer alternative mitigations.
A 2025 analysis by Yan and collaborators identified what they called the 3D-Properties of DPO's implicit reward modeling: a Drastic drop in rejected response likelihood, Degradation into general response suppression rather than genuine preference learning, and Dispersion effects that spread negative signal to unseen responses [9]. These properties can cause the model to learn shallow heuristics rather than meaningful preference distinctions. The fix involves regularizing the chosen-response likelihood (as Llama 3 does with its NLL auxiliary loss) and using iterative variants that refresh the data distribution.
DPO's derivation assumes that human preferences follow the Bradley-Terry model. Real human preferences can be intransitive (A preferred to B, B preferred to C, but C preferred to A), context-dependent, or influenced by factors not captured by a scalar reward. When this assumption is violated, DPO's theoretical guarantees weaken. IPO was specifically designed to address this limitation by using a more general PsiPO objective.
Rafailov et al. proved that when the Bradley-Terry model perfectly fits the true preference distribution and the preference dataset has sufficient coverage, the global optimum of the DPO objective coincides with the global optimum of the RLHF objective. In other words, DPO and RLHF converge to the same optimal policy under ideal conditions. The methods differ in their finite-sample behavior and in how they handle distribution shift, not in their ultimate target.
The DPO policy implicitly defines a reward function:
r_implicit(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x))
This implicit reward can be extracted and used for evaluation, for ranking candidate responses, or as a reward signal for other purposes. The paper's subtitle, "Your Language Model is Secretly a Reward Model," refers to this property. Researchers have shown that the implicit reward correlates well with separately trained explicit reward models on benchmarks such as RewardBench, suggesting that DPO-trained models can serve as drop-in replacements for traditional reward models in some settings.
The KL-constrained RLHF objective is a special case of the maximum entropy reinforcement learning framework, where the entropy bonus is replaced by a KL penalty relative to a reference policy. The closed-form solution for the optimal policy in this setting is a Gibbs (Boltzmann) distribution, well studied in statistical physics, Bayesian inference, and earlier RL work on soft Q-learning. DPO inherits these connections, which has helped researchers transfer ideas between alignment and other branches of machine learning.
Meta used DPO extensively in the alignment of its Llama 3 model family, released in 2024. According to the Llama 3 technical report, Meta performed six rounds of post-training, with each round involving supervised fine-tuning followed by DPO on preference data collected via human annotation or generated synthetically. Meta explored on-policy algorithms such as PPO but found that DPO required less compute for large-scale models and performed better on instruction-following benchmarks. For Llama 3, Meta used a learning rate of 1e-5 and set the beta hyperparameter to 0.1 [7]. The team added several stability tricks: masking special formatting tokens from the loss, adding an auxiliary NLL term on chosen responses with coefficient 0.2, and ensuring that each new round of preference data came from the strongest model produced so far.
The Llama 3.1 release scaled this recipe to 405 billion parameters, the largest open-weights LLM at the time. Llama 3.2 extended the approach to multimodal models including the 11B and 90B vision-language variants, and Llama 3.3 70B continued the iterative DPO recipe with refreshed preference data.
Zephyr-7B, developed by HuggingFace's H4 team in October 2023, demonstrated that DPO could produce highly capable aligned models even at small scale. Built on Mistral-7B, Zephyr used a variant called dDPO (distilled DPO) that trained on synthetic preference data generated by GPT-4 ranking responses from an ensemble of teacher models. Despite its 7B size, Zephyr competed with much larger models including Llama-2-Chat 70B on MT-Bench, validating the effectiveness of DPO for resource-constrained alignment [21]. The Zephyr training recipe became a widely copied template for community DPO fine-tunes throughout 2024.
Mistral AI used DPO for the instruction-tuned version of Mixtral 8x7B, its sparse mixture-of-experts model. The Mixtral 8x7B Instruct model was optimized through supervised fine-tuning and DPO, achieving a score of 8.30 on MT-Bench, making it the highest-performing open-source model at the time of its December 2023 release [22].
Allen AI's Tulu 2 model series, released in November 2023, was among the first large open-weights models trained end-to-end with DPO. Tulu 2 used a JAX-based DPO trainer built on EasyLM, with UltraFeedback as the preference dataset. The 7B, 13B, and 70B Tulu-2-DPO variants were released alongside the open-instruct training codebase, providing a reproducible reference for the community [23]. Tulu 3, released in late 2024, extended the recipe with verifier-based RL (RLVR) alongside DPO.
Microsoft's Phi-3 small language model series used DPO in its post-training pipeline. Despite being only 3.8 billion parameters, Phi-3-mini achieved benchmark performance comparable to much larger contemporaries, with DPO contributing to the chat tuning of the Phi-3-mini-instruct, Phi-3-small, and Phi-3-medium variants.
The Qwen2 family released by Alibaba in mid-2024 underwent both offline DPO and an online DPO stage in which the model sampled multiple responses, a separate reward model selected the best and worst, and the resulting pairs were used for DPO updates within each episode [24]. The Qwen team developed an Online Merging Optimizer to mitigate the alignment tax (the regression on capability benchmarks that often accompanies preference tuning). Qwen 2.5 and Qwen 3 retained DPO as a central alignment method.
DPO has become a standard tool in the open-source alignment toolkit. The Hugging Face TRL library provides a production-ready DPO trainer, and major fine-tuning frameworks including Axolotl, LLaMA-Factory, OpenRLHF, NVIDIA NeMo Aligner, and Allen AI open-instruct all support DPO out of the box. By 2025, DPO and its variants were the default choice for alignment in the majority of open-source model releases. PPO-based RLHF remains in use primarily at frontier labs with the engineering resources to run it at scale, and even there it is increasingly combined with or replaced by direct alignment methods [3].
As of early 2026, DPO remains a foundational method in language model alignment, though the landscape has evolved significantly since 2023.
Iterative and online variants have addressed key weaknesses. The original DPO's offline nature was its most significant practical limitation. Iterative DPO, which generates fresh preference data between rounds of optimization, has become the standard practice. Meta's six-round DPO pipeline for Llama 3 exemplified this approach, using the best-performing model from each round to generate new preference data for the next [7].
The post-training stack has matured. Rather than relying on a single alignment method, practitioners in 2025 and 2026 typically combine multiple techniques in sequence. A common recipe involves SFT as a foundation, followed by one or more rounds of preference optimization (using DPO, SimPO, or a variant), sometimes combined with rejection sampling or best-of-n filtering, and increasingly augmented by reinforcement learning from verifiable rewards (RLVR) for reasoning tasks. The choice of method depends on the specific goals: SimPO for broad chat alignment, KTO for applications where binary feedback is more natural, ORPO for compute-constrained single-stage training, and standard DPO for fine-grained preference tuning.
Research continues to refine the theoretical understanding. A 2025 study from Columbia University explored the performance gap between RLHF and DPO, finding that each method has distinct advantages depending on the optimization regime and data characteristics [8]. This line of work suggests that DPO and RLHF are complementary rather than strictly competitive approaches.
Industry surveys indicate that by 2025, approximately 70 percent of enterprises using LLMs employed some form of preference optimization (RLHF or DPO) for output alignment, up from roughly 25 percent in 2023. DPO adoption specifically grew by an estimated 45 percent in 2024, driven by lower compute requirements and simpler implementation [3]. Direct alignment has effectively become the default for any team that does not already have an in-house RL infrastructure team.
New directions in 2025 and 2026 push beyond the basic DPO framework. These include combining DPO with reasoning optimization for models trained to produce chain-of-thought traces, applying DPO to multimodal models including vision-language and text-to-image diffusion systems, and developing theoretical frameworks that unify the growing family of preference optimization methods. The Rainbow PO framework, published at ICLR 2025, represents one attempt at such unification, combining insights from multiple DPO variants into a single tunable approach [19]. Token-level extensions such as TDPO and step-level methods such as Step-DPO target reasoning chains where intermediate steps matter as much as final answers.
| Date | Event |
|---|---|
| 1952 | Bradley and Terry publish the pairwise comparison model that underlies DPO's preference formulation |
| 2017 | Schulman et al. introduce PPO, the standard RL algorithm for RLHF |
| March 2022 | Ouyang et al. publish InstructGPT, demonstrating RLHF with PPO for aligning GPT-3 |
| April 2022 | Bai et al. (Anthropic) release the HH-RLHF dataset |
| May 2023 | Rafailov et al. publish DPO on arXiv |
| October 2023 | Cui et al. release UltraFeedback; Azar et al. publish IPO |
| October 2023 | Tunstall et al. release Zephyr-7B with dDPO |
| November 2023 | Allen AI releases Tulu 2 with open-instruct DPO codebase |
| December 2023 | DPO presented at NeurIPS 2023 with Outstanding Runner-Up award; Mistral releases Mixtral 8x7B Instruct |
| January 2024 | Xu et al. publish CPO for machine translation |
| February 2024 | Ethayarajh et al. publish KTO based on prospect theory |
| March 2024 | Hong et al. publish ORPO; Park et al. analyze DPO length bias |
| April 2024 | Meta releases Llama 3 with iterative DPO; Self-Rewarding LM paper |
| May 2024 | Meng et al. publish SimPO; Wu et al. publish SPPO |
| July 2024 | Meta releases Llama 3.1 405B; Alibaba releases Qwen2 |
| October 2024 | Comprehensive DPO survey by Mr-Loevan group |
| November 2024 | Allen AI releases Tulu 3 with DPO and RLVR; Llama 3.2 |
| December 2024 | Llama 3.3 70B; Qwen 2.5 |
| 2025 | Rainbow PO at ICLR 2025; continued research on online and iterative variants; Qwen 3 family |