DPO

DPO (Direct Preference Optimization) is an alignment technique for large language models that directly optimizes a language model policy from human preference data, without training a separate reward model or using reinforcement learning algorithms like Proximal Policy Optimization (PPO). The method was introduced in May 2023 by Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn at Stanford University. DPO reformulates the reinforcement learning from human feedback (RLHF) objective into a simple binary cross-entropy loss over preference pairs. The key insight behind DPO is that the constrained reward maximization problem underlying RLHF admits a closed-form solution, allowing the reward function to be reparameterized in terms of the optimal policy itself. This eliminates the need for explicit reward modeling and reinforcement learning entirely, resulting in a training pipeline that is simpler, more stable, and less computationally expensive than traditional RLHF ^[1].

Since its publication, DPO has become one of the most widely adopted alignment methods in the open-source machine learning community and has been used by major organizations including Meta for the Llama model family, Mistral AI for Mixtral, Microsoft for Phi-3, and Alibaba for the Qwen family. The original paper was presented at NeurIPS 2023, where it received an Outstanding Paper Runner-up award, and has spawned a large family of variant algorithms collectively referred to as Direct Alignment Algorithms (DAAs) ^[2]. The longer companion entry Direct Preference Optimization (DPO) provides additional detail on training procedure, gradients, and applications.

Background and motivation

Aligning language models with human preferences is a central challenge in modern AI development. Without alignment, a pretrained language model will generate text that reflects the statistical patterns in its training data, which may include harmful, unhelpful, or undesirable content. The dominant approach to alignment prior to DPO was RLHF, a multi-stage pipeline that involves collecting human preference data, training a reward model on that data, and then fine-tuning the language model using a reinforcement learning algorithm (typically PPO) to maximize the learned reward ^[3].

RLHF was popularized by OpenAI's 2022 paper on InstructGPT, which used the technique to convert GPT-3 into a more helpful instruction-following assistant, and by Anthropic's work on Claude and Constitutional AI. The same recipe powered ChatGPT and most early production chatbots. While effective, RLHF has several well-documented practical difficulties. The pipeline requires training and maintaining a separate reward model, which introduces its own failure modes such as reward hacking, where the policy learns to exploit artifacts in the reward model rather than genuinely improving. PPO is notoriously sensitive to hyperparameters and can be unstable during training. The entire process requires sampling from the language model during training, which is computationally expensive. Running the full RLHF pipeline typically consumes 30 to 50 percent more compute than simpler fine-tuning approaches and demands four model copies in memory simultaneously: the active policy, a frozen reference policy, the reward model, and a value network used by PPO ^[3].

These difficulties motivated the search for simpler alignment methods. DPO emerged from the observation that the mathematical structure of the RLHF objective allows the reward model to be eliminated entirely from the optimization process. Rather than learning a reward function and then optimizing against it, the DPO authors showed that the policy itself can be treated as an implicit reward model, collapsing two stages into one.

Original paper

The paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" was posted to arXiv on May 29, 2023, and later presented at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), where it was selected as one of two Outstanding Main Track Runner-Up papers ^[1]^[2]. The authors were all affiliated with Stanford University at the time of publication, with Rafailov and Sharma as joint first authors and Finn and Ermon as senior authors.

The paper's title captures its central insight: a language model being optimized with DPO implicitly defines a reward model through the ratio of its output probabilities to those of a reference model. There is no need to explicitly parameterize or train a separate reward function because the policy itself encodes the reward.

The reference implementation was released as open-source code on GitHub ^[4], enabling rapid adoption by the research community. Within months, third-party reimplementations appeared in the Hugging Face TRL library, NVIDIA NeMo, Allen AI's open-instruct, and Axolotl, which made the algorithm accessible to anyone with a single GPU ^[5]^[6].

Mathematical formulation

The mathematical derivation of DPO proceeds in several steps, beginning from the standard RLHF objective and arriving at a loss function that can be optimized directly with gradient descent.

The RLHF objective

The starting point is the KL-constrained reward maximization objective used in RLHF:

max_pi E_{x ~ D, y ~ pi(y|x)} [r(x, y)] - beta * D_KL[pi(y|x) || pi_ref(y|x)]

Here, pi is the policy (language model) being optimized, pi_ref is a reference policy (typically the supervised fine-tuned model), r(x, y) is the reward function, beta is a parameter controlling the strength of the KL constraint, and D_KL is the Kullback-Leibler divergence. The KL penalty prevents the optimized policy from deviating too far from the reference model, which helps maintain generation quality and diversity.

Closed-form optimal policy

Rafailov et al. showed that this optimization problem has a closed-form solution. The optimal policy pi* satisfies:

pi*(y|x) = (1 / Z(x)) * pi_ref(y|x) * exp((1/beta) * r(x, y))

where Z(x) is a partition function (normalizing constant) that ensures the probabilities sum to one. This result was already known in the reinforcement learning literature as the solution to KL-regularized policy improvement and is closely related to the Boltzmann or Gibbs distribution in statistical physics. The DPO authors made a novel use of it by inverting the relationship.

Reparameterization of the reward

By rearranging the expression for the optimal policy, the reward function can be expressed in terms of the policy and the reference model:

r(x, y) = beta * log(pi*(y|x) / pi_ref(y|x)) + beta * log Z(x)

This is the key reparameterization. It shows that the reward for any response y given prompt x can be recovered from the log-ratio of the optimal policy to the reference policy, plus a prompt-dependent constant.

The Bradley-Terry preference model

Human preferences are modeled using the Bradley-Terry model, a standard framework for pairwise comparisons developed by Ralph Bradley and Milton Terry in 1952. Given a prompt x and two responses y_w (preferred, or "winner") and y_l (dispreferred, or "loser"), the probability that a human prefers y_w over y_l is:

P(y_w > y_l | x) = sigma(r(x, y_w) - r(x, y_l))

where sigma is the logistic sigmoid function. This model assumes that preferences depend only on the difference in latent rewards between the two completions and that comparisons are independent.

The DPO loss function

Substituting the reparameterized reward into the Bradley-Terry model yields the DPO objective. The partition function Z(x) cancels out because it depends only on the prompt and not on the response, giving:

L_DPO(pi_theta; pi_ref) = -E_{(x, y_w, y_l) ~ D} [log sigma(beta * (log(pi_theta(y_w|x) / pi_ref(y_w|x)) - log(pi_theta(y_l|x) / pi_ref(y_l|x))))]

This is a binary cross-entropy loss. For each preference pair in the dataset, the loss encourages the model to assign a higher log-probability ratio (relative to the reference model) to the preferred response than to the dispreferred response. The parameter beta controls how sharply the model should differentiate between preferred and dispreferred outputs. A smaller beta leads to more aggressive optimization, while a larger beta keeps the policy closer to the reference model.

Gradient behavior

The gradient of the DPO loss has an intuitive interpretation. It increases the probability of preferred responses and decreases the probability of dispreferred responses, weighted by how wrong the current model is. Specifically, examples where the model already strongly prefers the correct response contribute less to the gradient, while examples where the model incorrectly assigns higher probability to the dispreferred response contribute more. This implicit importance weighting is a natural consequence of the loss function's structure and helps prevent the model from overfitting to easy examples ^[1].

Comparison to RLHF

DPO and RLHF both use human preference data to align language models, but they differ fundamentally in how they convert preferences into training updates.

Pipeline differences

The RLHF pipeline consists of three stages: (1) supervised fine-tuning (SFT) of the base model, (2) training a reward model on preference data, and (3) optimizing the SFT model against the reward model using PPO. DPO simplifies this to two stages: (1) SFT, and (2) direct optimization on preference data using the DPO loss. The reward model training and PPO stages are both eliminated.

Aspect	RLHF (with PPO)	DPO
Number of models during training	4 (policy, reference, reward, value)	2 (policy, reference)
Reward model required	Yes (explicit)	No (implicit)
RL algorithm required	Yes (PPO)	No (supervised)
Sampling during training	Yes (on-policy generation)	No (offline, fixed dataset)
Hyperparameter sensitivity	High (PPO has many hyperparameters)	Low (mainly beta)
Computational cost	Higher (4 forward passes plus sampling)	Lower (2 forward passes)
Implementation complexity	Significant (full RL stack)	Modest (a few hundred lines)
Memory footprint	Very high	Moderate
Distribution shift handling	Inherent (on-policy)	Requires iterative variants
Reward hacking susceptibility	Direct (via reward model)	Indirect (via implicit reward)

Advantages of DPO

The simplicity of the DPO approach has driven much of its adoption. DPO reduces the alignment pipeline to a single supervised learning problem. There is no need to implement or debug a reinforcement learning loop, maintain a separate reward model, or deal with the engineering challenges of on-policy sampling during training. Practitioners report that a working DPO training run can be set up in an afternoon using TRL or Axolotl, while a comparable PPO pipeline often takes weeks of debugging.

Stability is another significant benefit. PPO training is prone to instabilities, including reward hacking, mode collapse, and divergence. DPO avoids these issues because it does not use an explicit reward model that can be exploited, and its loss function is a well-behaved classification objective. The optimization dynamics of cross-entropy losses are well understood, while PPO's clipping ratios, value function coefficients, and generalized advantage estimation introduce many interacting moving parts.

Lower compute requirements are critical for open-source adoption. Because DPO does not require sampling from the model during training or maintaining four separate models in memory simultaneously, it uses substantially less GPU memory and wall-clock time. Meta reported that for Llama 3, DPO required less compute than PPO for large-scale alignment and performed better on instruction-following benchmarks like IFEval ^[7].

DPO can be implemented in a few dozen lines of code on top of standard language model training infrastructure. Reference implementations are available in popular libraries including Hugging Face's TRL (Transformer Reinforcement Learning) library, allowing teams without dedicated RL engineers to perform alignment research ^[5].

Where RLHF retains advantages

Despite DPO's practical benefits, RLHF with PPO retains advantages in certain settings. Because PPO is an on-policy algorithm, it generates fresh samples from the current policy during training, which helps it adapt to the evolving distribution of the model. DPO, by contrast, is an offline algorithm that trains on a fixed dataset of preferences. This means DPO can suffer from distribution shift: as the model improves during training, the preference data, collected from a different earlier model, may become less relevant.

Research published in 2025 found that in approximate optimization settings, RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of the two-stage learning approach in some regimes ^[8]. RLHF can also handle more nuanced feedback signals beyond simple pairwise preferences, which makes it more suitable for high-stakes applications requiring fine-grained alignment. For frontier labs with the engineering resources to operate full RL stacks, PPO and its successors (such as Group Relative Policy Optimization, GRPO) remain attractive options.

The DPO-vs-PPO debate

A central debate in the alignment literature concerns whether DPO can fully replace PPO. The ICML 2024 study "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study" by Xu et al. argued that DPO has fundamental limitations not shared by PPO, and that PPO outperforms DPO on average across reasoning, coding, and safety benchmarks once the PPO pipeline is properly tuned. The authors reported gains of 1.3 points on reasoning, 2.9 points on coding, and 2.3 points on safety for PPO over DPO, although DPO retained an edge on truthfulness by about 2.5 points ^[25]. Their PPO-aligned models also outperformed DPO and Iterative DPO on human evaluation in head-to-head comparisons.

The opposing view, championed by Nathan Lambert and others on the open alignment landscape, holds that the gap shrinks or disappears once DPO is given equivalent amounts of preference data, an SFT pass tuned for it, and iterative loops ^[26]. Both Llama 3 and the Tulu 3 ablations support this view: Meta chose iterative DPO over PPO for the entire Llama 3 family after internal experiments, and Tulu 3 found that PPO produced only modest gains over DPO on matched data. The practical synthesis that emerged by 2025 is that DPO and PPO occupy different points on a tradeoff curve. DPO wins on simplicity, throughput, and engineer-hours; PPO and its successors win when the team has the infrastructure to run them and the alignment target demands on-policy correction. Many production pipelines now run DPO first to cover the bulk of alignment cheaply, then a small final PPO or GRPO pass for the tail of hard cases.

Training pipeline

The practical DPO pipeline involves four stages, the first three shared with conventional RLHF.

Pretrain a base language model on a large general corpus of text.
Apply supervised fine-tuning (SFT) on a smaller dataset of high-quality demonstrations. The resulting SFT model becomes the reference policy pi_ref.
Collect preference data: pairs of responses to the same prompt, with one labeled as preferred (chosen) and the other dispreferred (rejected). Sources include human annotators, AI-judge feedback (such as GPT-4), or production logs of user upvotes and downvotes.
Run DPO training: minimize the DPO loss using standard gradient descent (typically AdamW), updating the policy pi_theta while keeping pi_ref frozen.

The SFT step is critical because the reference policy anchors the KL divergence term. Skipping SFT, or using a poorly fit SFT model, generally produces unstable DPO training. Some variants such as ORPO eliminate the separate SFT stage by combining both objectives into a single training run.

Hyperparameters

DPO has fewer hyperparameters than PPO, but a small number of choices have outsized effects on the final model.

Hyperparameter	Typical range	Notes
beta	0.01 to 0.5	Controls divergence from reference. Llama 3 used 0.1 ^[7]. Smaller values allow more aggressive adaptation; larger values stay close to SFT
Learning rate	1e-7 to 1e-5	Llama 3 used 1e-5; Zephyr used 5e-7. Lower than SFT learning rates by an order of magnitude
Batch size	32 to 128	Larger batches stabilize the preference loss; gradient accumulation often used
Epochs	1 to 3	Few epochs typical; longer training risks overfitting and reward hacking
Optimizer	AdamW	Standard choice with weight decay around 0.01 to 0.1
Label smoothing	0.0 to 0.1	Optional; mitigates noisy preference labels (used in cDPO variant)
LR schedule	Cosine or linear with warmup	Warmup over 10 percent of steps is common
Sequence length	1024 to 8192	Longer contexts increase memory requirements substantially

Meta's Llama 3 team also added a small auxiliary negative log-likelihood loss on chosen responses with a coefficient of 0.2, which prevented the log-probability of preferred completions from collapsing during training ^[7]. They masked special formatting tokens (chat template headers and end-of-turn markers) from both chosen and rejected responses to prevent the model from learning superficial cues about preference labels.

Variants

The success and limitations of DPO have inspired a large family of variant algorithms, sometimes referred to collectively as Direct Alignment Algorithms (DAAs). A 2024 survey catalogued more than 30 distinct methods, each addressing specific shortcomings or adapting the core idea to different settings ^[9]. The table below summarizes the most influential variants.

Method	Authors and year	Venue	Key innovation	Pairing required	Reference model
DPO	Rafailov et al. 2023	NeurIPS 2023	Reparameterizes reward through policy log-ratios; original method	Yes	Yes
IPO	Azar et al. 2023	AISTATS 2024	Replaces sigmoid with identity loss to bound the implicit reward gap	Yes	Yes
KTO	Ethayarajh et al. 2024	ICML 2024	Uses unpaired binary feedback; grounded in Kahneman-Tversky prospect theory	No	Yes
ORPO	Hong et al. 2024	EMNLP 2024	Combines SFT and preference optimization in one stage; no reference model	Yes	No
SimPO	Meng et al. 2024	NeurIPS 2024	Length-normalized average log-probability reward; no reference model	Yes	No
CPO	Xu et al. 2024	ICML 2024	Contrastive loss with behavior cloning regularizer; designed for translation	Yes	No
RSO	Liu et al. 2023	ICLR 2024	Rejection sampling from optimal policy estimate	Yes	Yes
sDPO	Kim et al. 2024	arXiv	Stepwise DPO with progressively updated reference	Yes	Yes
cDPO	Mitchell 2023	Blog	Conservative DPO with label smoothing for noisy data	Yes	Yes
f-DPO	Wang et al. 2024	ICML 2024	Generalizes KL constraint to arbitrary f-divergences	Yes	Yes
TDPO	Zeng et al. 2024	ICML 2024	Token-level DPO with per-token KL regularization	Yes	Yes
Online DPO	Guo et al. 2024	DeepMind	Generates fresh preference pairs each round using AI feedback	Yes	Yes
Iterative DPO	RLHFlow, Meta 2024	Various	Multi-round DPO using current model to label new pairs	Yes	Yes
Self-Rewarding	Yuan et al. 2024	ICML 2024	Model judges its own outputs to create preference pairs	Yes	Yes
SPPO	Wu et al. 2024	NeurIPS 2024	Self-play game-theoretic formulation; asymmetric updates	Yes	Yes
NPO	Zhang et al. 2024	arXiv	Negative Preference Optimization for unlearning	No	Yes
Plackett-Luce DPO	Multiple	Various	Generalizes Bradley-Terry to ranked lists of more than two items	Yes	Yes
Rainbow PO	Zhao et al. 2024	ICLR 2025	Unifies improvements from multiple DAAs into a single framework	Yes	Yes

IPO (Identity Preference Optimization)

IPO was introduced by Mohammad Gheshlaghi Azar and colleagues at Google DeepMind in October 2023 ^[10]. The paper presented a more general theoretical framework called PsiPO that expresses the loss directly in terms of pairwise preference probabilities rather than requiring a Bradley-Terry pointwise reward. IPO is the specific case of PsiPO that uses an identity mapping in place of the sigmoid log-likelihood. This change bounds the implicit reward gap between chosen and rejected responses, which prevents the policy from assigning extreme probability ratios that can cause overconfidence and degenerate behavior on long training runs. IPO is particularly useful when the Bradley-Terry assumption is violated, for instance when human preferences are intransitive or when annotators frequently disagree.

KTO (Kahneman-Tversky Optimization)

KTO was introduced by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela in February 2024 ^[11]. The method draws on prospect theory from behavioral economics, specifically the observation by Daniel Kahneman and Amos Tversky that humans weigh losses more heavily than equivalent gains. KTO applies asymmetric penalties to outputs rated as bad versus rewards for outputs rated as good. Crucially, KTO eliminates the requirement for paired preference data. While DPO requires pairs of (preferred, dispreferred) responses to the same prompt, KTO works with unpaired binary feedback such as simple thumbs-up or thumbs-down labels on individual responses. This makes it practical in settings where collecting matched pairs is expensive or infeasible, including production telemetry from chat applications. KTO matched or exceeded DPO performance at scales from 1 billion to 30 billion parameters.

ORPO (Odds-Ratio Preference Optimization)

ORPO, by Jiwoo Hong, Noah Lee, and James Thorne at KAIST, was published at EMNLP 2024 ^[12]. It takes a more radical departure from the DPO framework by eliminating the reference model entirely and combining supervised fine-tuning with preference optimization into a single training objective. ORPO appends a log odds-ratio term to the standard negative log-likelihood loss, applying a weak penalty to rejected responses and a strong adaptation signal to chosen responses. This monolithic design allows alignment to be performed during the SFT phase in a single training run, which both saves compute and avoids the memory cost of holding a reference model. Fine-tuning Phi-2, Llama 2, and Mistral-7B with ORPO on UltraFeedback alone surpassed the performance of much larger Llama-2-Chat and Zephyr models on AlpacaEval 2.0 and MT-Bench ^[12].

SimPO (Simple Preference Optimization)

SimPO, developed at Princeton by Yu Meng, Mengzhou Xia, and Danqi Chen, was published at NeurIPS 2024 ^[13]. It eliminates the reference model but takes a different approach than ORPO. SimPO uses average log-probability of a sequence as the implicit reward, rather than the log-ratio used by DPO. This length normalization directly addresses the length bias problem and aligns the training signal with the way models actually generate text (per-token probabilities at inference). SimPO also introduces a target reward margin gamma that helps maintain a consistent gap between preferred and dispreferred responses. In benchmarks with Llama 3 and Gemma 2 backbones, SimPO outperformed DPO by up to 6.4 points on AlpacaEval 2 and 7.5 points on Arena-Hard. The Princeton team's Gemma-2-9B-it variant ranked first on Chatbot Arena among models under 10 billion parameters at the time of release ^[13].

CPO (Contrastive Preference Optimization)

CPO, by Haoran Xu and colleagues at Microsoft and Johns Hopkins, was developed primarily for machine translation but applies more broadly ^[14]. CPO removes the reference model term by proving an upper bound on DPO's objective and incorporating a behavior cloning regularizer that anchors the model to the chosen distribution. Applied to the ALMA translation model with only 22,000 parallel sentences and 12 million parameters of additional training, CPO produced ALMA-R, which matched or exceeded WMT competition winners and GPT-4 on WMT'21, WMT'22, and WMT'23 test sets.

Other notable variants

Several additional methods address specific weaknesses of vanilla DPO. RSO (Rejection Sampling Optimization) by Tianqi Liu and colleagues sources preference data from an estimate of the optimal policy via rejection sampling, which improves the match between training data distribution and the model being optimized ^[15]. cDPO (Conservative DPO), proposed by Eric Mitchell in late 2023, adds label smoothing to the binary cross-entropy loss to handle noisy preference annotations. NPO (Negative Preference Optimization) adapts the framework for machine unlearning, where the goal is to make a model forget specific outputs rather than learn new ones. Plackett-Luce DPO generalizes the pairwise Bradley-Terry assumption to ranked lists, allowing more than two responses to be compared simultaneously. f-DPO replaces the KL divergence with a general f-divergence, providing a knob to trade off between mode-seeking and mode-covering behavior.

Iterative and online DPO

The original DPO is offline: training proceeds against a fixed dataset of preferences. Iterative DPO addresses the resulting distribution shift by alternating between policy training and fresh preference data generation. After each round of DPO training, the updated model produces new candidate responses that are then ranked (by humans, an AI judge, or a separate reward model), forming new preference pairs for the next round. Meta used this approach for Llama 3, performing six rounds of SFT followed by DPO on preference data labeled with the best model from the previous round ^[7]. The RLHFlow project at Salesforce released open weights for Llama-3-Iterative-DPO based on this recipe.

Online DPO, sometimes called Online AI Feedback (OAIF) and described by Shangmin Guo and colleagues at DeepMind in 2024, generates preference pairs on the fly during training using an external LLM judge, blurring the line between offline DPO and on-policy RLHF. Self-Rewarding Language Models, introduced by Weizhe Yuan and collaborators at Meta and NYU in early 2024, take this idea further: the model itself acts as the LLM-as-a-judge to score its own candidate outputs, removing the need for an external reward signal entirely ^[16].

Relationship to GRPO and reasoning RL

The rise of reasoning models in 2025, led by DeepSeek-R1 and similar systems, brought a new family of policy-gradient methods into focus, particularly Group Relative Policy Optimization (GRPO). GRPO was introduced by DeepSeek in 2024 for DeepSeekMath and rose to prominence with DeepSeek-R1 in January 2025. Like DPO, GRPO eliminates one of PPO's heavy components: it removes the value (critic) network and instead estimates an advantage baseline from a group of sampled completions to the same prompt. The reward signal in GRPO is typically a programmatic verifier (for example, did the math answer match ground truth) rather than a learned reward model, which both sidesteps reward hacking and enables training on verifiable tasks at scale ^[27].

Although DPO and GRPO operate in different paradigms, offline preference learning versus online verifier-driven reinforcement learning, recent work has shown a surprising mathematical kinship. The 2025 paper "It Takes Two: Your GRPO Is Secretly DPO" demonstrated that GRPO with group size two reduces to an instance of DPO with a specific advantage formulation, and that under a gradient analysis GRPO can be viewed as an online contrastive variant of DPO ^[28]. This connection unifies a large swath of the post-training literature: methods previously labelled "preference optimization" and "reasoning RL" turn out to share an underlying gradient structure differentiated mainly by the source of the preference signal (offline pairs vs on-policy verifier scores).

In practice, the two methods are now treated as complementary tools. DPO handles broad chat alignment from human or AI preferences, while GRPO and other verifier-based RL methods like RLVR sharpen capabilities on tasks where correctness is checkable: math, code, formatted output, and logical reasoning. Allen AI's Tulu 3 used DPO followed by RLVR, Meta's Llama 3.3 used iterative DPO without GRPO, and DeepSeek-R1 used pure GRPO without DPO. Most frontier post-training recipes in 2025 and 2026 combine elements of both ^[27].

DPO in multimodal and diffusion models

Beyond text alignment, DPO has been extended to several non-text modalities. The dominant pattern is to replace the autoregressive log-probability with a model-appropriate density estimate and otherwise keep the Bradley-Terry preference loss intact.

Diffusion-DPO, proposed by Wallace et al. in 2023, adapts DPO to text-to-image diffusion models by deriving an evidence lower bound on the diffusion noise prediction loss that mirrors the language model log-ratio. Stable Diffusion XL fine-tuned with Diffusion-DPO on the Pick-a-Pic preference dataset substantially improved on human-judged image quality, prompt fidelity, and aesthetic appeal compared with the SFT baseline. Step-aware Preference Optimization (SPO) refined this approach by applying preference signals at intermediate diffusion timesteps rather than only at the final image, which improved both quality and training efficiency ^[29].

For video generation, VideoDPO (CVPR 2025) extended Diffusion-DPO to video diffusion models, introducing an OmniScore that combines visual quality with semantic alignment between text and the generated video. The method aligned both motion smoothness and prompt fidelity in a single training run ^[30]. D-Fusion (2025) addressed the problem that naive video and image DPO can drift visually because the chosen and rejected samples differ in unrelated ways; it uses mask-guided self-attention fusion to construct DPO-trainable pairs that share visual structure and differ only in alignment quality.

In vision-language models, DPO has been applied to reduce hallucination in image captioning, ground responses more accurately in visual evidence, and improve multimodal reasoning. Methods like POVID, RLHF-V, and HA-DPO use preferences over image-conditioned text outputs to fine-tune models such as LLaVA, InstructBLIP, and Qwen-VL. Llama 3.2's 11B and 90B vision variants were also aligned with DPO over multimodal preference pairs.

DPO and KTO have also been used for speech synthesis, where preferences over generated audio (naturalness, speaker similarity, expressivity) replace text preferences, and for code generation, where pass-at-k or execution success serves as the preference signal. The flexibility of the underlying framework, which only requires a log-density and a pairwise preference, has made DPO one of the most portable post-training algorithms across modalities.

Models trained with DPO

DPO has become a standard ingredient in the post-training stack for both proprietary and open-source LLMs. The table below lists notable models that report using DPO or a close variant.

Model	Organization	Year	Notes
Zephyr-7B	Hugging Face H4	2023	Built on Mistral 7B; used distilled DPO on UltraFeedback; first viral DPO success
Tulu 2 (7B/13B/70B)	Allen AI	2023	DPO on UltraFeedback; Llama-2 backbone; open codebase
Mixtral 8x7B Instruct	Mistral AI	2023	DPO-aligned mixture of experts; MT-Bench score 8.30 at release
Llama 3 (8B/70B)	Meta	2024	Six rounds of SFT plus DPO; learning rate 1e-5, beta 0.1
Llama 3.1 (8B/70B/405B)	Meta	2024	Same DPO recipe extended to 405B parameters
Llama 3.2 (1B/3B/11B/90B)	Meta	2024	Multimodal Llama variants; DPO retained for chat alignment
Llama 3.3 70B	Meta	2024	DPO-aligned chat model with improved instruction following
Phi-3 (3.8B/14B)	Microsoft	2024	DPO in post-training stack for the Phi series
Qwen2 (0.5B to 72B)	Alibaba	2024	Both offline and online DPO; Online Merging Optimizer for alignment tax
Qwen 2.5 / Qwen 3	Alibaba	2024 to 2025	DPO retained as core alignment method across the family
Gemma 2	Google	2024	Supports DPO and SimPO via Hugging Face TRL
Nous-Hermes-2	Nous Research	2024	DPO fine-tunes of various Mistral and Llama bases
OpenHermes-2.5 DPO	Teknium / Nous	2024	Community DPO fine-tunes
Notus 7B	Argilla	2023	DPO fine-tune of Zephyr with curated preference data
Starling 7B	Berkeley	2023	Reward-RANCHO model with DPO and PPO comparisons
Llama-3-Tulu-3	Allen AI	2024	Uses DPO and RLVR (RL with Verifiable Rewards)
ALMA-R	Johns Hopkins / Microsoft	2024	Translation model trained with CPO
SmolLM2 / SmolLM3	Hugging Face	2024 to 2025	DPO documented at length in the public Smol Training Playbook

Closed proprietary frontier models such as GPT-4 and Claude 3 do not publicly disclose their full alignment recipes, though Anthropic and OpenAI have both published research describing direct preference learning techniques that resemble DPO in spirit. The conventional wisdom in the open-source community is that frontier labs use proprietary blends of supervised fine-tuning, DPO-style direct alignment, RLHF with PPO, and constitutional or rule-based methods such as RLAIF.

Datasets for DPO

DPO requires a dataset of preference pairs. Several public datasets have become standard benchmarks for the method.

Dataset	Source	Size	Notes
UltraFeedback	Cui et al. (OpenBMB) 2023	64k prompts, 4 responses each	GPT-4 ranked across instruction-following, honesty, helpfulness, truthfulness ^[17]
HuggingFaceH4/ultrafeedback_binarized	Hugging Face H4	61k pairs	Binarized version used for Zephyr
Anthropic HH-RLHF	Anthropic 2022	170k pairs	Helpful and harmless preferences across base, RS, and online tranches ^[18]
HelpSteer / HelpSteer 2	NVIDIA 2023 to 2024	37k responses	Multi-attribute ratings (helpful, correct, coherent, complex, verbose)
PKU-SafeRLHF	Peking University 2023	30k+ pairs	Safety-focused preferences over helpful responses
OpenAI WebGPT comparisons	OpenAI 2021	19k pairs	Early public RLHF dataset using web search
StackExchange Preferences	StackExchange / community	~10M pairs	Implicit upvote signals from Q&A site
Argilla DPO Mix	Argilla 2024	Varied	Curated combination of UltraFeedback and other sources
Capybara DPO	LDJnr	2024	Multi-turn DPO data from open-source models
Distilabel preference datasets	Argilla	2024 to 2025	AI-judge-labeled pairs from synthetic generation pipelines

Many production deployments also use proprietary preference data collected from user interactions, such as thumbs-up and thumbs-down feedback in chat applications. KTO is particularly suited to this setting because it accepts unpaired binary signals without requiring matched comparisons.

Tooling and frameworks

DPO is supported in essentially every modern LLM post-training framework. The Hugging Face TRL library provides DPOTrainer, a subclass of the Transformers Trainer that handles tokenization, batching, log-probability computation for both policy and reference models, and optional integration with PEFT for LoRA-based DPO. TRL also ships an OnlineDPOTrainer for iterative variants and a DPOConfig dataclass that exposes all hyperparameters ^[5]^[6].

Other production frameworks include Axolotl, a community-driven YAML-configured fine-tuning tool; LLaMA-Factory, which supports DPO and most variants across more than a hundred model architectures; PyTorch Torchtune, with native DPO recipes for Llama 2, Llama 3, Mistral, and Gemma; NVIDIA NeMo Aligner, used internally for Nemotron and supporting DPO, RPO, and IPO; OpenRLHF, a Ray-based distributed framework geared toward larger-scale runs; and Allen AI's open-instruct codebase used for the Tulu series. NVIDIA's TensorRT-LLM and Hugging Face Inference Endpoints add deployment paths for DPO-trained models.

A minimal TRL example illustrates how compact a DPO training script can be:

from trl import DPOConfig, DPOTrainer

args = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    output_dir="./dpo_output",
)

trainer = DPOTrainer(
    model=policy,
    ref_model=reference,
    args=args,
    processing_class=tokenizer,
    train_dataset=preference_dataset,
)
trainer.train()

Limitations

DPO has several known limitations that have motivated the development of variant algorithms.

Sensitivity to data quality

Because DPO optimizes directly on preference pairs, the quality of those pairs is critical. Noisy or inconsistent preference labels degrade performance significantly. In RLHF, the reward model provides a layer of smoothing that can partially absorb label noise. DPO has no such buffer, so errors in the preference data translate directly into errors in the policy gradient ^[9]. Annotator agreement rates of 60 to 80 percent are typical for subjective tasks, which means a substantial fraction of the training signal is effectively random.

Offline nature and distribution shift

DPO trains on a fixed dataset of preference pairs, typically generated by a previous version of the model or a different model entirely. As training progresses and the policy changes, the training data may no longer reflect the distribution of outputs the model actually produces. This distribution shift can lead the model to find biased solutions that exploit out-of-distribution responses. Several researchers have noted that DPO can sometimes produce policies that look good on the training data but behave unexpectedly on novel inputs ^[8]. Iterative and online DPO variants address this by periodically generating new preference data from the current policy.

Reference model dependency

DPO requires a reference model pi_ref throughout training, and the quality of the final policy depends on the quality of this reference. If the reference model is poor, the KL constraint anchors the optimized policy to a weak baseline. The reference model must also be kept in memory during training, which increases hardware requirements relative to methods that do not need one. ORPO, SimPO, and CPO eliminate the reference model entirely, trading some theoretical guarantees for memory savings and simpler infrastructure.

Reward overoptimization

As DPO training progresses, the implicit reward gap between chosen and rejected responses can grow without bound. This means the policy may assign extreme probability ratios that do not reflect the actual strength of human preferences. In practice this manifests as overconfident or degenerate behavior, particularly with long training runs. Empirical studies have observed a hump-shaped quality curve: as the policy diverges further from the reference at higher KL budgets, true generation quality initially improves but then degrades while the implicit reward continues to climb. IPO and Rainbow PO were designed in part to bound this growth ^[10]^[19].

Length bias

DPO has been observed to develop a bias toward generating longer responses, since longer responses tend to have more tokens over which the log-probability ratio can accumulate. Park and colleagues studied this phenomenon in depth in 2024, finding that uncontrolled DPO training can increase response length by 50 to 100 percent without commensurate quality gains ^[20]. SimPO addresses this directly by normalizing the implicit reward by sequence length. Length-Desensitized DPO (LD-DPO) and Down-Sampled KL Divergence approaches offer alternative mitigations.

3D properties and implicit reward degradation

A 2025 analysis by Yan and collaborators identified what they called the 3D-Properties of DPO's implicit reward modeling: a Drastic drop in rejected response likelihood, Degradation into general response suppression rather than genuine preference learning, and Dispersion effects that spread negative signal to unseen responses ^[9]. These properties can cause the model to learn shallow heuristics rather than meaningful preference distinctions. The fix involves regularizing the chosen-response likelihood (as Llama 3 does with its NLL auxiliary loss) and using iterative variants that refresh the data distribution.

Alignment tax and catastrophic forgetting

A practical concern with any post-training method is the alignment tax: the degradation in capability benchmarks (such as MMLU, GSM8K, or HumanEval) that often follows alignment training, sometimes called catastrophic forgetting. Naive fine-tuning can erode 20 to 30 percent of base capabilities, and DPO is no exception when run aggressively. The most cited mitigations include limiting the number of epochs, using a larger beta to keep the policy close to the SFT model, mixing in SFT examples during DPO training, and applying model merging techniques such as the Online Merging Optimizer introduced by the Qwen team for Qwen2 ^[24]. Empirical comparisons in 2024 and 2025 generally find that DPO causes less alignment tax than PPO when both are run for matched compute, because the offline objective lacks the on-policy exploration that lets PPO drift further from base capabilities.

Bradley-Terry assumption

DPO's derivation assumes that human preferences follow the Bradley-Terry model. Real human preferences can be intransitive (A preferred to B, B preferred to C, but C preferred to A), context-dependent, or influenced by factors not captured by a scalar reward. When this assumption is violated, DPO's theoretical guarantees weaken. IPO was specifically designed to address this limitation by using a more general PsiPO objective.

Theoretical properties

Equivalence to RLHF

Rafailov et al. proved that when the Bradley-Terry model perfectly fits the true preference distribution and the preference dataset has sufficient coverage, the global optimum of the DPO objective coincides with the global optimum of the RLHF objective. In other words, DPO and RLHF converge to the same optimal policy under ideal conditions. The methods differ in their finite-sample behavior and in how they handle distribution shift, not in their ultimate target.

Implicit reward model

The DPO policy implicitly defines a reward function:

r_implicit(x, y) = beta * log(pi_theta(y|x) / pi_ref(y|x))

This implicit reward can be extracted and used for evaluation, for ranking candidate responses, or as a reward signal for other purposes. The paper's subtitle, "Your Language Model is Secretly a Reward Model," refers to this property. Researchers have shown that the implicit reward correlates well with separately trained explicit reward models on benchmarks such as RewardBench, suggesting that DPO-trained models can serve as drop-in replacements for traditional reward models in some settings.

Connection to maximum entropy RL

The KL-constrained RLHF objective is a special case of the maximum entropy reinforcement learning framework, where the entropy bonus is replaced by a KL penalty relative to a reference policy. The closed-form solution for the optimal policy in this setting is a Gibbs (Boltzmann) distribution, well studied in statistical physics, Bayesian inference, and earlier RL work on soft Q-learning. DPO inherits these connections, which has helped researchers transfer ideas between alignment and other branches of machine learning.

Contrastive learning view

A second theoretical lens, popularized in 2024 and 2025, casts DPO as contrastive learning. The pairwise loss treats the chosen response as a positive and the rejected response as a negative, learning a representation in which preferred completions sit closer to the prompt in the implicit reward landscape than dispreferred ones. This view connects DPO to InfoNCE, triplet losses, and modern self-supervised learning, and helps explain why simple gradient analyses can bridge offline preference optimization and on-policy methods like GRPO ^[28].

Adoption and real-world usage

Meta Llama

Meta used DPO extensively in the alignment of its Llama 3 model family, released in 2024. According to the Llama 3 technical report, Meta performed six rounds of post-training, with each round involving supervised fine-tuning followed by DPO on preference data collected via human annotation or generated synthetically. Meta explored on-policy algorithms such as PPO but found that DPO required less compute for large-scale models and performed better on instruction-following benchmarks. For Llama 3, Meta used a learning rate of 1e-5 and set the beta hyperparameter to 0.1 ^[7]. The team added several stability tricks: masking special formatting tokens from the loss, adding an auxiliary NLL term on chosen responses with coefficient 0.2, and ensuring that each new round of preference data came from the strongest model produced so far.

The Llama 3.1 release scaled this recipe to 405 billion parameters, the largest open-weights LLM at the time. Llama 3.2 extended the approach to multimodal models including the 11B and 90B vision-language variants, and Llama 3.3 70B continued the iterative DPO recipe with refreshed preference data.

HuggingFace Zephyr

Zephyr-7B, developed by HuggingFace's H4 team in October 2023, demonstrated that DPO could produce highly capable aligned models even at small scale. Built on Mistral-7B, Zephyr used a variant called dDPO (distilled DPO) that trained on synthetic preference data generated by GPT-4 ranking responses from an ensemble of teacher models. Despite its 7B size, Zephyr competed with much larger models including Llama-2-Chat 70B on MT-Bench, validating the effectiveness of DPO for resource-constrained alignment ^[21]. The Zephyr training recipe became a widely copied template for community DPO fine-tunes throughout 2024.

Mistral and Mixtral

Mistral AI used DPO for the instruction-tuned version of Mixtral 8x7B, its sparse mixture-of-experts model. The Mixtral 8x7B Instruct model was optimized through supervised fine-tuning and DPO, achieving a score of 8.30 on MT-Bench, making it the highest-performing open-source model at the time of its December 2023 release ^[22].

Allen AI Tulu

Allen AI's Tulu 2 model series, released in November 2023, was among the first large open-weights models trained end-to-end with DPO. Tulu 2 used a JAX-based DPO trainer built on EasyLM, with UltraFeedback as the preference dataset. The 7B, 13B, and 70B Tulu-2-DPO variants were released alongside the open-instruct training codebase, providing a reproducible reference for the community ^[23]. Tulu 3, released in late 2024, extended the recipe with verifier-based RL (RLVR) alongside DPO.

Microsoft Phi

Microsoft's Phi-3 small language model series used DPO in its post-training pipeline. Despite being only 3.8 billion parameters, Phi-3-mini achieved benchmark performance comparable to much larger contemporaries, with DPO contributing to the chat tuning of the Phi-3-mini-instruct, Phi-3-small, and Phi-3-medium variants.

Alibaba Qwen

The Qwen2 family released by Alibaba in mid-2024 underwent both offline DPO and an online DPO stage in which the model sampled multiple responses, a separate reward model selected the best and worst, and the resulting pairs were used for DPO updates within each episode ^[24]. The Qwen team developed an Online Merging Optimizer to mitigate the alignment tax (the regression on capability benchmarks that often accompanies preference tuning). Qwen 2.5 and Qwen 3 retained DPO as a central alignment method.

Hugging Face SmolLM

Hugging Face's SmolLM2 and SmolLM3 releases used DPO as part of their public post-training recipe. The team's Smol Training Playbook, released in October 2025, documents pretraining through SFT and DPO in unusual depth and has become a widely cited reference for end-to-end open post-training pipelines ^[31].

Broader ecosystem adoption

DPO has become a standard tool in the open-source alignment toolkit. The Hugging Face TRL library provides a production-ready DPO trainer, and major fine-tuning frameworks including Axolotl, LLaMA-Factory, OpenRLHF, NVIDIA NeMo Aligner, and Allen AI open-instruct all support DPO out of the box. By 2025, DPO and its variants were the default choice for alignment in the majority of open-source model releases. PPO-based RLHF remains in use primarily at frontier labs with the engineering resources to run it at scale, and even there it is increasingly combined with or replaced by direct alignment methods ^[3].

Current state (2025 to 2026)

As of early 2026, DPO remains a foundational method in language model alignment, though the landscape has evolved significantly since 2023.

Iterative and online variants have addressed key weaknesses. The original DPO's offline nature was its most significant practical limitation. Iterative DPO, which generates fresh preference data between rounds of optimization, has become the standard practice. Meta's six-round DPO pipeline for Llama 3 exemplified this approach, using the best-performing model from each round to generate new preference data for the next ^[7].

The post-training stack has matured. Rather than relying on a single alignment method, practitioners in 2025 and 2026 typically combine multiple techniques in sequence. A common recipe involves SFT as a foundation, followed by one or more rounds of preference optimization (using DPO, SimPO, or a variant), sometimes combined with rejection sampling or best-of-n filtering, and increasingly augmented by reinforcement learning from verifiable rewards (RLVR) or GRPO for reasoning tasks. The choice of method depends on the specific goals: SimPO for broad chat alignment, KTO for applications where binary feedback is more natural, ORPO for compute-constrained single-stage training, and standard DPO for fine-grained preference tuning.

Research continues to refine the theoretical understanding. A 2025 study from Columbia University explored the performance gap between RLHF and DPO, finding that each method has distinct advantages depending on the optimization regime and data characteristics ^[8]. The "It Takes Two" gradient analysis further unified DPO with GRPO, suggesting that the apparent zoo of post-training algorithms is in fact a small number of equivalent objectives differentiated by the source of the preference signal ^[28]. This line of work suggests that DPO, GRPO, and RLHF are complementary rather than strictly competitive approaches.

Industry surveys indicate that by 2025, approximately 70 percent of enterprises using LLMs employed some form of preference optimization (RLHF or DPO) for output alignment, up from roughly 25 percent in 2023. DPO adoption specifically grew by an estimated 45 percent in 2024, driven by lower compute requirements and simpler implementation ^[3]. Direct alignment has effectively become the default for any team that does not already have an in-house RL infrastructure team.

New directions in 2025 and 2026 push beyond the basic DPO framework. These include combining DPO with reasoning optimization for models trained to produce chain-of-thought traces, applying DPO to multimodal models including vision-language and text-to-image diffusion systems, and developing theoretical frameworks that unify the growing family of preference optimization methods. The Rainbow PO framework, published at ICLR 2025, represents one attempt at such unification, combining insights from multiple DPO variants into a single tunable approach ^[19]. Token-level extensions such as TDPO and step-level methods such as Step-DPO target reasoning chains where intermediate steps matter as much as final answers.

Timeline

Date	Event
1952	Bradley and Terry publish the pairwise comparison model that underlies DPO's preference formulation
2017	Schulman et al. introduce PPO, the standard RL algorithm for RLHF
March 2022	Ouyang et al. publish InstructGPT, demonstrating RLHF with PPO for aligning GPT-3
April 2022	Bai et al. (Anthropic) release the HH-RLHF dataset
May 2023	Rafailov et al. publish DPO on arXiv
October 2023	Cui et al. release UltraFeedback; Azar et al. publish IPO
October 2023	Tunstall et al. release Zephyr-7B with dDPO
November 2023	Allen AI releases Tulu 2 with open-instruct DPO codebase
December 2023	DPO presented at NeurIPS 2023 with Outstanding Runner-Up award; Mistral releases Mixtral 8x7B Instruct
January 2024	Xu et al. publish CPO for machine translation
February 2024	Ethayarajh et al. publish KTO based on prospect theory
March 2024	Hong et al. publish ORPO; Park et al. analyze DPO length bias
April 2024	Meta releases Llama 3 with iterative DPO; Self-Rewarding LM paper; Xu et al. publish "Is DPO Superior to PPO?" comprehensive study
May 2024	Meng et al. publish SimPO; Wu et al. publish SPPO
July 2024	Meta releases Llama 3.1 405B; Alibaba releases Qwen2
October 2024	Comprehensive DPO survey by Wang et al.
November 2024	Allen AI releases Tulu 3 with DPO and RLVR; Llama 3.2
December 2024	Llama 3.3 70B; Qwen 2.5
January 2025	DeepSeek-R1 demonstrates GRPO at scale, sparking renewed analysis of DPO/GRPO links
April 2025	Rainbow PO published at ICLR 2025
May 2025	CVPR 2025 features VideoDPO and several multimodal DPO papers
October 2025	"It Takes Two: Your GRPO Is Secretly DPO" unifies DPO and GRPO gradients; Hugging Face Smol Training Playbook documents end-to-end DPO recipe
2025 to 2026	DPO remains default for open-source alignment; Qwen 3 family; ongoing research on online and iterative variants

References

Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. https://arxiv.org/abs/2305.18290
NeurIPS 2023 Awards. "Announcing the NeurIPS 2023 Paper Awards." December 11, 2023. https://blog.neurips.cc/2023/12/11/announcing-the-neurips-2023-paper-awards/
Raschka, S. "How is RLHF different from DPO at a high level?" https://sebastianraschka.com/faq/docs/rlhf-vs-dpo.html
Mitchell, E. "Reference implementation for DPO." GitHub. https://github.com/eric-mitchell/direct-preference-optimization
Hugging Face. "DPO Trainer documentation." https://huggingface.co/docs/trl/main/en/dpo_trainer
Hugging Face. "TRL library on GitHub." https://github.com/huggingface/trl
Dubey, A., et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783
"Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO." 2025. https://arxiv.org/abs/2505.19770
Wang, Z., et al. (2024). "A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications." arXiv:2410.15595. https://arxiv.org/abs/2410.15595
Azar, M.G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." (IPO paper) arXiv:2310.12036. https://arxiv.org/abs/2310.12036
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). "KTO: Model Alignment as Prospect Theoretic Optimization." ICML 2024. https://arxiv.org/abs/2402.01306
Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." EMNLP 2024. https://arxiv.org/abs/2403.07691
Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." NeurIPS 2024. https://arxiv.org/abs/2405.14734
Xu, H., Sharaf, A., Chen, Y., Tan, W., Shen, L., Van Durme, B., Murray, K., & Kim, Y.J. (2024). "Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation." ICML 2024. https://arxiv.org/abs/2401.08417
Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P.J., & Liu, J. (2023). "Statistical Rejection Sampling Improves Preference Optimization." ICLR 2024. https://arxiv.org/abs/2309.06657
Yuan, W., Pang, R.Y., Cho, K., Sukhbaatar, S., Xu, J., & Weston, J. (2024). "Self-Rewarding Language Models." ICML 2024. https://arxiv.org/abs/2401.10020
Cui, G., et al. (2023). "UltraFeedback: Boosting Language Models with Scaled AI Feedback." ICML 2024. https://arxiv.org/abs/2310.01377
Bai, Y., et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." arXiv:2204.05862. https://arxiv.org/abs/2204.05862
Zhao, H., et al. (2024). "Rainbow PO: A Unified Framework for Combining Improvements in Preference Optimization." ICLR 2025. https://arxiv.org/abs/2410.04203
Park, R., Rafailov, R., Ermon, S., & Finn, C. (2024). "Disentangling Length from Quality in Direct Preference Optimization." ACL Findings 2024. https://arxiv.org/abs/2403.19159
Tunstall, L., Beeching, E., Lambert, N., et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. https://arxiv.org/abs/2310.16944
Mistral AI. (2023). "Mixtral of Experts." https://mistral.ai/news/mixtral-of-experts
Allen AI. "Tulu 2 model and open-instruct codebase." https://allenai.org/tulu
Yang, A., et al. (2024). "Qwen2 Technical Report." arXiv:2407.10671. https://arxiv.org/abs/2407.10671
Xu, S., Fu, W., Gao, J., Ye, W., Liu, W., Mei, Z., Wang, G., Yu, C., & Wu, Y. (2024). "Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study." ICML 2024. https://arxiv.org/abs/2404.10719
Lambert, N. "The DPO debate: Do we need RL for RLHF?" Interconnects, 2024. https://www.interconnects.ai/p/the-dpo-debate
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." Nature 2025. https://arxiv.org/abs/2501.12948
"It Takes Two: Your GRPO Is Secretly DPO." 2025. https://arxiv.org/abs/2510.00977
Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., & Naik, N. (2023). "Diffusion Model Alignment Using Direct Preference Optimization." CVPR 2024. https://arxiv.org/abs/2311.12908
Liu, R., et al. (2025). "VideoDPO: Omni-Preference Alignment for Video Diffusion Generation." CVPR 2025. https://openaccess.thecvf.com/content/CVPR2025/papers/Liu_VideoDPO_Omni-Preference_Alignment_for_Video_Diffusion_Generation_CVPR_2025_paper.pdf
Hugging Face. (2025). "The Smol Training Playbook: How to train a smol LM." October 2025.

Background and motivation

Original paper

Mathematical formulation

The RLHF objective

Closed-form optimal policy

Reparameterization of the reward

The Bradley-Terry preference model

The DPO loss function

Gradient behavior

Comparison to RLHF

Pipeline differences

Advantages of DPO

Where RLHF retains advantages

The DPO-vs-PPO debate

Training pipeline

Hyperparameters

Variants

IPO (Identity Preference Optimization)

KTO (Kahneman-Tversky Optimization)

ORPO (Odds-Ratio Preference Optimization)

SimPO (Simple Preference Optimization)

CPO (Contrastive Preference Optimization)

Other notable variants

Iterative and online DPO

Relationship to GRPO and reasoning RL

DPO in multimodal and diffusion models

Models trained with DPO

Datasets for DPO

Tooling and frameworks

Limitations

Sensitivity to data quality

Offline nature and distribution shift

Reference model dependency

Reward overoptimization

Length bias

3D properties and implicit reward degradation

Alignment tax and catastrophic forgetting

Bradley-Terry assumption

Theoretical properties

Equivalence to RLHF

Implicit reward model

Connection to maximum entropy RL

Contrastive learning view

Adoption and real-world usage

Meta Llama

HuggingFace Zephyr

Mistral and Mixtral

Allen AI Tulu

Microsoft Phi

Alibaba Qwen

Hugging Face SmolLM

Broader ecosystem adoption

Current state (2025 to 2026)

Timeline

See also

References

Improve this article

Related Articles

Reinforcement Learning from Human Feedback (RLHF)

Constitutional AI

Direct Preference Optimization (DPO)

InstructGPT

Background and motivation

Original paper

Mathematical formulation

The RLHF objective

Closed-form optimal policy

Reparameterization of the reward

The Bradley-Terry preference model

The DPO loss function

Gradient behavior

Comparison to RLHF

Pipeline differences

Advantages of DPO

Where RLHF retains advantages

The DPO-vs-PPO debate

Training pipeline

Hyperparameters

Variants

IPO (Identity Preference Optimization)