ORPO (Odds Ratio Preference Optimization) is a preference alignment algorithm for large language models that combines supervised fine-tuning and preference alignment into a single training stage, eliminating the need for a reference model. Introduced by Jiwoo Hong, Noah Lee, and James Thorne of KAIST AI in March 2024, ORPO appends an odds ratio penalty term to the standard negative log-likelihood loss, allowing the model to simultaneously learn the target domain and suppress undesired response styles. The paper was published at EMNLP 2024.
Unlike DPO, which requires a frozen reference model to stabilize training, or RLHF, which requires a separate reward model and reinforcement learning phase, ORPO completes alignment in one pass with roughly half the GPU memory footprint.
Aligning language models to human preferences has traditionally required at minimum two separate training stages. The first stage is supervised fine-tuning (SFT), which adapts a pretrained base model to the target domain by training on curated examples. The second stage applies a preference alignment algorithm to make the model prefer high-quality responses over low-quality ones.
RLHF as described in the InstructGPT paper (Ouyang et al., 2022) actually requires three stages: SFT, training a reward model on human preference comparisons, and then optimizing the policy with Proximal Policy Optimization (PPO). This pipeline is expensive in both compute and engineering overhead. The reward model must be trained separately, the PPO loop requires generating samples at inference time, and the entire pipeline needs careful hyperparameter tuning to remain stable.
DPO, introduced by Rafailov et al. in 2023, simplified the alignment stage by showing that the reward model in RLHF can be expressed as a closed-form function of the policy itself. DPO eliminates the explicit reward model and PPO, replacing them with a binary cross-entropy objective over chosen and rejected response pairs. However, DPO still requires a reference model, which is typically the SFT checkpoint. The reference model's log probabilities anchor the training signal and prevent the policy from drifting too far from the SFT initialization. This means DPO still requires two sequential stages: run SFT first, save the checkpoint, freeze it as the reference, then run DPO.
The ORPO authors identified a less-discussed flaw in the standard SFT training phase. During SFT, the model is trained on the chosen (preferred) responses from a preference dataset, but it has no mechanism to penalize the rejected (disfavored) responses. Because the model architecture is shared across all response types, the cross-entropy loss on chosen examples inadvertently increases the log probability of token sequences that resemble rejected responses as well.
Put differently: SFT trains the model to generate responses in a particular style and domain. If a dataset contains both helpful and unhelpful responses, running SFT on just the helpful ones still nudges the model toward generating the kinds of tokens that appear in unhelpful ones, simply because they share vocabulary, syntax, and topic distribution. The model learns the domain but not the distinction between quality levels within it.
This problem is not unique to any particular SFT dataset. It is a structural consequence of using maximum likelihood training on positive examples alone, with no contrastive signal distinguishing wanted from unwanted outputs.
ORPO was introduced in "ORPO: Monolithic Preference Optimization without Reference Model" by Jiwoo Hong, Noah Lee, and James Thorne of the KAIST AI lab. The preprint appeared on arXiv (2403.07691) on March 12, 2024, and was accepted at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), held in Miami, Florida. The conference proceedings are at pages 11170 to 11189, with DOI 10.18653/v1/2024.emnlp-main.626.
The official implementation is available at github.com/xfactlab/orpo. The authors released two fine-tuned checkpoints: Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B), both available on the Hugging Face Hub under the kaist-ai organization.
The central claim of the paper is that a minor penalty on rejected responses, applied during SFT itself, is sufficient for preference-aligned fine-tuning. The authors show both empirically and theoretically that the odds ratio is a well-motivated choice for computing this penalty, and that folding the penalty into the SFT loss removes the need for a reference model entirely.
The odds ratio is a concept borrowed from statistics and probability theory. For a binary event, the odds of an outcome are defined as the probability of that outcome divided by the probability of the complementary outcome. If a model assigns probability P to generating a sequence y given input x, the odds of that sequence are:
odds(y | x) = P(y | x) / (1 - P(y | x))
The odds ratio between two sequences y_w (the preferred or "winning" response) and y_l (the rejected or "losing" response) is then:
OR(y_w, y_l | x) = odds(y_w | x) / odds(y_l | x)
Expanding this expression:
OR(y_w, y_l | x) = [P(y_w | x) / (1 - P(y_w | x))] / [P(y_l | x) / (1 - P(y_l | x))]
This can be rearranged as the ratio of the two probabilities multiplied by an additional correction factor:
OR = [P(y_w | x) / P(y_l | x)] * [(1 - P(y_l | x)) / (1 - P(y_w | x))]
The correction factor (1 - P(y_l | x)) / (1 - P(y_w | x)) makes the odds ratio more sensitive to changes in model confidence than a bare probability ratio. When the model assigns near-zero probability to a rejected response, the odds ratio amplifies the signal; when the model is nearly certain about a chosen response, the signal diminishes. The paper argues this sensitivity profile is better calibrated to preference learning than a plain probability ratio.
The paper also contrasts odds ratios with the approach used in DPO, which operates on log probability differences. Because log probabilities span a wide range and can be driven very negative during training, the probability-based contrast in DPO can produce overly strong suppression of rejected responses in some regimes. The odds ratio formulation applies a more moderate and self-regulating penalty.
The ORPO objective combines two components:
L_ORPO = E[L_SFT + lambda * L_OR]
The SFT component is the standard negative log-likelihood loss on the chosen responses:
L_SFT = -log P_theta(y_w | x)
This is exactly the cross-entropy loss used in conventional SFT. It maximizes the log probability of the preferred completion given the prompt.
The odds ratio component is:
L_OR = -log sigmoid(log OR(y_w, y_l | x))
= -log sigmoid(log[odds(y_w | x) / odds(y_l | x)])
This term applies a log-sigmoid transformation to the log odds ratio. Minimizing L_OR is equivalent to maximizing the log odds ratio, which increases the odds of the chosen response relative to the rejected one. The sigmoid ensures the loss is bounded and numerically stable.
The scalar lambda controls the relative weight of the odds ratio penalty. In the original paper and in the TRL implementation, the recommended default is lambda = 0.1. The TRL library refers to this parameter as beta in the ORPOConfig class, following a naming convention consistent with other preference optimization trainers in the library.
The paper provides a closed-form decomposition of the gradient of L_OR:
nabla_theta L_OR = delta(d) * h(d)
where d = log OR(y_w, y_l | x) is the log odds ratio at the current parameter values.
The term delta(d) acts as a penalty coefficient. It is bounded between 0 and 1 and approaches 0 as the odds ratio in favor of the chosen response grows large. This means the penalty diminishes automatically as the model correctly distinguishes the preferred from the rejected response. There is a built-in stopping behavior: once the model has learned the preference, the gradient contribution from the odds ratio term fades.
The term h(d) captures the direction of the update. It increases the log probability of tokens in the chosen response while decreasing the log probability of tokens in the rejected response. The magnitude of h(d) scales with the current probability assigned to each response type, ensuring the update is proportional to how uncertain the model currently is.
Together, delta and h implement a form of adaptive curriculum: the model is pushed hard to distinguish preferred from rejected responses when the odds ratio is near 1 (roughly equal probabilities), and the push diminishes once the model has developed a clear preference.
The paper includes an ablation comparing the odds ratio formulation against a direct probability ratio. The results show that the odds ratio is more consistent across model sizes and training steps. The probability ratio formulation is more prone to instability at larger model scales, where small changes in log probability can translate into large gradient updates. The odds ratio's correction factor acts as a stabilizer.
The key architectural decision in ORPO is that both loss terms act simultaneously on the same mini-batch, in the same forward and backward pass.
A standard training batch for ORPO contains triplets of (prompt, chosen response, rejected response). The forward pass computes token-level log probabilities for both the chosen and rejected responses. The SFT loss operates on the chosen response only, maximizing its likelihood. The odds ratio loss operates on both, penalizing the rejected response relative to the chosen one.
This is different from DPO in a subtle but important way. In DPO, the reference model's log probabilities are precomputed and stored, then used during training to normalize the policy's log probability ratios. The reference model serves as a fixed baseline that anchors the training signal. In ORPO, there is no baseline to anchor against. The odds ratio compares the current policy's assignment of probability to chosen versus rejected responses against itself. The contrast is entirely internal to the current model state.
This self-referential contrast works because the odds ratio measures the relative confidence of the model across two outputs at a given training step. It does not require knowing where the model started. The SFT loss ensures the model is still moving in the direction of the target domain; the odds ratio loss ensures it is also developing a preference ordering within that domain.
The result is a training procedure that can start from any pretrained base model, without first running a separate SFT phase to produce a reference checkpoint.
Removing the reference model has a direct practical benefit. In DPO training, both the policy model and the reference model must be loaded into GPU memory simultaneously. For a 7B parameter model in bfloat16 precision, this means approximately 28 GB per model, or roughly 56 GB total before accounting for optimizer states, gradients, and activation memory.
With ORPO, only one model is resident in memory. This roughly halves the model weight memory requirement. Combined with the fact that there is no need to run a separate SFT phase and store its checkpoint, ORPO can make preference alignment accessible on hardware configurations that would otherwise require model parallelism or gradient checkpointing to fit a DPO training run.
The paper notes that ORPO theoretically requires roughly half the number of forward passes per training batch compared to DPO with a reference model. DPO requires a forward pass through both the policy and the reference model for each training example; ORPO requires only one forward pass through the policy.
The field of preference alignment has produced a number of DPO variants and alternatives, each addressing different perceived shortcomings.
| Method | Reference model | Preference data format | Training stages | Key mechanism |
|---|---|---|---|---|
| RLHF (PPO) | Reward model | Ranked pairs | 3 (SFT, RM, RL) | Policy gradient with reward signal |
| DPO | Frozen SFT model | Chosen / rejected pairs | 2 (SFT, then DPO) | Log probability ratio |
| IPO | Frozen SFT model | Chosen / rejected pairs | 2 (SFT, then IPO) | Regularized probability ratio |
| KTO | Frozen SFT model | Individual thumbs-up / thumbs-down | 2 (SFT, then KTO) | Kahneman-Tversky prospect theory loss |
| ORPO | None | Chosen / rejected pairs | 1 (SFT + alignment combined) | Odds ratio appended to SFT loss |
| SimPO | None | Chosen / rejected pairs | 1 or 2 | Length-normalized sequence probability |
DPO (Rafailov et al., 2023, NeurIPS 2023) reformulates the RLHF objective as a supervised binary classification problem, eliminating the need for an explicit reward model and PPO. It uses a reference model to normalize the policy's probability ratios, which prevents the model from collapsing to trivial solutions. The main drawbacks of DPO are the two-stage training requirement and the doubled memory footprint. DPO has become the most widely adopted preference alignment method and serves as the primary baseline for ORPO comparisons. The ORPO paper reports that Mistral-ORPO-beta achieves win rates above 70% against Mistral fine-tuned with DPO on the HH-RLHF dataset.
IPO (Azar et al., 2024) was developed partly to address a theoretical overfitting concern with DPO. The DPO loss can drive the policy to assign very high probability to chosen responses and very low probability to rejected ones, potentially overfitting on the preference labels. IPO adds a regularization term that penalizes large probability ratios, keeping the policy closer to the reference model's distribution. In practice, IPO tends to perform comparably to DPO on standard benchmarks and shares the same two-stage training requirement.
KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024) draws from behavioral economics to construct a loss function that does not require paired chosen/rejected examples. Instead, KTO works with individually labeled responses (thumbs-up or thumbs-down), which makes it applicable to datasets where responses are not naturally paired. KTO still requires a reference model and is typically run as a second stage after SFT. Its main advantage over DPO and ORPO is the more flexible data format.
SimPO (Meng et al., 2024) is another reference-free method that normalizes log probabilities by sequence length before computing the preference objective. This addresses a known tendency of DPO-style methods to favor shorter responses. SimPO shares the reference-free property with ORPO but uses a different loss formulation and typically performs a separate SFT phase before alignment.
ORPO is most appropriate when compute or memory budgets are constrained, when the training pipeline should be kept simple, or when starting from a general pretrained base model without a dedicated SFT checkpoint. The single-stage training is particularly convenient for rapid prototyping and for fine-tuning on consumer hardware with limited VRAM.
DPO or IPO may be preferable when a high-quality SFT checkpoint already exists and the goal is to apply further preference tuning on top of it. KTO may be preferable when the available preference data consists of individually labeled responses rather than head-to-head comparisons.
The HuggingFace TRL (Transformer Reinforcement Learning) library added support for ORPO through the ORPOTrainer class. TRL is the standard library for post-training LLMs in the HuggingFace ecosystem, providing implementations of SFT, DPO, PPO, and a range of other alignment methods.
The ORPO implementation in TRL was contributed by Kashif Rasul, Lewis Tunstall, and Alvaro Bartolome.
The ORPOTrainer expects a preference dataset with three fields per example:
prompt: the input prompt or conversation historychosen: the preferred responserejected: the disfavored responseThis format is identical to the DPO trainer's expected format. Both conversational (chat template) and standard (raw text) dataset formats are supported. When given a conversational dataset, the trainer automatically applies the model's chat template to format the inputs.
The recommended dataset for getting started is trl-lib/ultrafeedback_binarized, which is a processed version of the UltraFeedback dataset with chosen and rejected pairs ready for training.
The ORPOConfig class extends HuggingFace's standard TrainingArguments with ORPO-specific parameters:
beta (default 0.1): the lambda hyperparameter from the paper, controlling the weight of the odds ratio loss relative to the SFT lossmax_length (default 1024): the maximum sequence length for prompt plus completionmax_completion_length: optional cap on completion length, useful for encoder-decoder modelsdisable_dropout (default True): disabling dropout during training improves stability and is recommendedgenerate_during_eval (default False): if enabled, the trainer samples completions and logs them to Weights and Biases or Comet during evaluationThe ORPOConfig also sets some different defaults from standard TrainingArguments. The default learning rate is 1e-6 (versus 5e-5 in TrainingArguments), which is more appropriate for fine-tuning pretrained language models. Gradient checkpointing defaults to True and bfloat16 defaults to True when float16 is not set, reducing memory usage.
A minimal training script looks like this:
from datasets import load_dataset
from trl import ORPOConfig, ORPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = ORPOConfig(output_dir="Qwen2-0.5B-ORPO")
trainer = ORPOTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset
)
trainer.train()
The trainer logs several metrics during training, including rewards/chosen, rewards/rejected, rewards/accuracies (the fraction of training steps where the chosen reward exceeds the rejected reward), rewards/margins, log_odds_chosen, log_odds_ratio, and nll_loss.
For Mixture of Experts models, TRL also supports passing the auxiliary router loss through the ORPO objective by setting output_router_logits=True in the model config.
A full example script is available at examples/scripts/orpo.py in the TRL repository, along with a command-line launcher using accelerate launch for multi-GPU training.
The ORPOTrainer accepts a peft_config argument that wraps the model in a PEFT (Parameter-Efficient Fine-Tuning) adapter. This makes it straightforward to combine ORPO with LoRA or QLoRA for training on a single consumer GPU. Because ORPO already avoids loading a reference model, QLoRA plus ORPO can fit a 7B parameter training run on a single GPU with 24 GB of VRAM.
The paper evaluates ORPO across three model families (Phi-2 at 2.7B, Llama-2 at 7B, and Mistral at 7B) on two preference datasets (Anthropic's HH-RLHF and the binarized UltraFeedback dataset). The baselines include standard SFT, RLHF with PPO, and DPO.
Key results on the binarized UltraFeedback dataset:
| Model | AlpacaEval 2.0 | MT-Bench | IFEval (instr. loose) |
|---|---|---|---|
| Llama-2 Chat 7B | 71.34% | 6.27 | -- |
| Llama-2 ORPO (7B) | 81.26% | -- | -- |
| Mistral-ORPO-alpha (7B) | 11.33% | 7.23 | -- |
| Mistral-ORPO-beta (7B) | 12.20% | 7.32 | 66.19% |
| Zephyr-beta (7B, DPO-based) | 10.99% | 7.34 | -- |
Mistral-ORPO-beta achieves 12.20% on AlpacaEval 2.0, which is competitive with or better than several models exceeding 7B parameters at the time of publication. On MT-Bench, it scores 7.32, comparable to Zephyr-beta despite the latter being trained with a multi-stage DPO pipeline. On IFEval, Mistral-ORPO-beta reaches 66.19% on instruction-level loose accuracy.
The paper also reports win rates against DPO-trained baselines using reward model scoring on HH-RLHF. At the OPT-125M scale, ORPO achieves a 64.3% win rate against DPO. At OPT-1.3B, the win rate rises to 70.9%. This scaling behavior suggests that ORPO's advantage over DPO grows as model capacity increases.
One of the more interesting empirical findings in the paper is a direct measurement of the SFT penalty problem described in the background section. The authors train a model with standard SFT on only the chosen responses from a preference dataset, then evaluate the log probabilities of both chosen and rejected responses during training.
They find that the log probabilities of rejected responses rise during SFT training, alongside the log probabilities of chosen responses. The model is inadvertently becoming more likely to generate both kinds of responses. This validates the motivation for adding a contrastive term to the SFT loss.
When ORPO is used instead, the log probabilities of chosen and rejected responses diverge during training. The chosen response log probability rises while the rejected response log probability falls. The model learns both the domain and the preference ordering simultaneously.
After the paper's release in March 2024, ORPO saw rapid adoption in the open-source community, particularly for fine-tuning the Llama-3 family of models released by Meta in April 2024. A widely cited tutorial by Maxime Labonne demonstrated fine-tuning Llama-3 8B with ORPO using TRL and QLoRA, helping establish ORPO as a practical alternative to DPO for resource-constrained settings.
ORPO support was also added to Axolotl, a popular configuration-driven fine-tuning framework, and to LLaMA-Factory, another widely used fine-tuning toolkit. This broad framework support lowered the barrier to trying ORPO for researchers and practitioners who prefer configuration-based workflows over writing training scripts from scratch.
The single-stage design of ORPO has influenced subsequent work on efficient preference alignment. The Triple Preference Optimization (TPO) paper (2024) cites ORPO as a motivation for combining alignment objectives rather than staging them sequentially. The RainbowPO paper (ICLR 2025) surveys a family of DPO variants and analyzes ORPO as a representative of the reference-free single-stage class, alongside SimPO.
Researchers have also applied ORPO to multimodal settings. A Stanford CS231n project applied both ORPO and DPO to visual question answering preference alignment, finding that ORPO produced competitive results on VQA benchmarks with less setup overhead.
The ORPO paper explicitly used Mistral 7B as a primary evaluation model and released Mistral-ORPO-alpha and Mistral-ORPO-beta checkpoints. Community fine-tuning extended this to Llama-3 (8B and 70B) and Phi-2 (2.7B). While neither Microsoft's Phi-3 nor Meta's Llama-3 official releases used ORPO in their reported training pipelines, ORPO became one of the most commonly cited methods for efficient community fine-tuning of both model families following the paper's release.
Although the paper recommends lambda = 0.1 as a robust default, practitioners have reported that the optimal value is dataset-dependent. With values too low, the odds ratio penalty provides negligible contrastive signal and the training effectively reduces to SFT. With values too high, the odds ratio term can dominate the SFT loss, causing the model to overfit on the preference labels at the expense of general language modeling quality.
The absence of a reference model is ORPO's main computational advantage, but it is also a source of theoretical concern. Reference models in DPO and IPO serve as regularizers that keep the policy from moving too far from the SFT distribution. Without this anchor, ORPO relies entirely on the self-referential odds ratio to prevent distributional collapse.
In practice, the SFT loss component of the ORPO objective provides some regularization by continuously optimizing for the chosen responses. However, there is no formal guarantee analogous to the KL divergence constraint in RLHF or the reference model normalization in DPO. For tasks requiring careful preservation of specific capabilities from the base model, DPO with a strong SFT reference may be more conservative and reliable.
Like DPO and IPO, ORPO requires paired preference data (each training example must have both a chosen and a rejected response for the same prompt). This is a more demanding data format than KTO, which can work with individually labeled responses. Constructing paired preference data requires either human annotation or using a reward model or judge to rank multiple completions per prompt, which adds data preparation overhead.
Most of the paper's evaluations are at 7B parameters or below. Preference optimization at larger scales (30B, 70B, or beyond) introduces more complex training dynamics, and it is less clear whether ORPO's self-referential odds ratio remains stable as model capacity and batch size grow. Multi-GPU training with gradient accumulation can alter the effective batch distribution in ways that interact with the odds ratio computation. The TRL documentation notes specific recommendations for Mixture of Experts models, but systematic evaluation at scales above 7B with ORPO specifically is limited in the literature.
Practitioners have noted that the reward margin (the difference between rewards/chosen and rewards/rejected in TRL's logged metrics) can be slow to increase during early training, sometimes remaining near zero for the first several hundred steps. This can make it difficult to diagnose whether training is proceeding correctly. Unlike DPO, where the log probability difference provides an intuitive signal, the odds ratio metric is less familiar and its expected trajectory during training is less documented.