ORPO
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 6,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 6,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
ORPO (Odds Ratio Preference Optimization) is a preference alignment algorithm for large language models that combines supervised fine-tuning and preference alignment into a single training stage, eliminating the need for a reference model. Introduced by Jiwoo Hong, Noah Lee, and James Thorne of KAIST AI in March 2024, ORPO appends an odds ratio penalty term to the standard negative log-likelihood loss, allowing the model to simultaneously learn the target domain and suppress undesired response styles. The paper was published at EMNLP 2024.
Unlike DPO, which requires a frozen reference model to stabilize training, or RLHF, which requires a separate reward model and reinforcement learning phase, ORPO completes alignment in one pass with roughly half the GPU memory footprint. The method became one of the most widely adopted lightweight alternatives to DPO for community fine-tuning of open-weight language models during 2024 and 2025, with broad support in mainstream training frameworks including TRL, Axolotl, and LLaMA-Factory.
Aligning language models to human preferences has traditionally required at minimum two separate training stages. The first stage is supervised fine-tuning (SFT), which adapts a pretrained base model to the target domain by training on curated examples. The second stage applies a preference alignment algorithm to make the model prefer high-quality responses over low-quality ones.
RLHF as described in the InstructGPT paper (Ouyang et al., 2022) actually requires three stages: SFT, training a reward model on human preference comparisons, and then optimizing the policy with Proximal Policy Optimization (PPO). This pipeline is expensive in both compute and engineering overhead. The reward model must be trained separately, the PPO loop requires generating samples at inference time, and the entire pipeline needs careful hyperparameter tuning to remain stable.
DPO, introduced by Rafailov et al. in 2023, simplified the alignment stage by showing that the reward model in RLHF can be expressed as a closed-form function of the policy itself. DPO eliminates the explicit reward model and PPO, replacing them with a binary cross-entropy objective over chosen and rejected response pairs. However, DPO still requires a reference model, which is typically the SFT checkpoint. The reference model's log probabilities anchor the training signal and prevent the policy from drifting too far from the SFT initialization. This means DPO still requires two sequential stages: run SFT first, save the checkpoint, freeze it as the reference, then run DPO.
The ORPO authors identified a less-discussed flaw in the standard SFT training phase. During SFT, the model is trained on the chosen (preferred) responses from a preference dataset, but it has no mechanism to penalize the rejected (disfavored) responses. Because the model architecture is shared across all response types, the cross-entropy loss on chosen examples inadvertently increases the log probability of token sequences that resemble rejected responses as well.
Put differently: SFT trains the model to generate responses in a particular style and domain. If a dataset contains both helpful and unhelpful responses, running SFT on just the helpful ones still nudges the model toward generating the kinds of tokens that appear in unhelpful ones, simply because they share vocabulary, syntax, and topic distribution. The model learns the domain but not the distinction between quality levels within it.
This problem is not unique to any particular SFT dataset. It is a structural consequence of using maximum likelihood training on positive examples alone, with no contrastive signal distinguishing wanted from unwanted outputs. Hong et al. demonstrate the effect quantitatively in the paper's appendix: when a base model is fine-tuned on chosen responses from HH-RLHF using standard cross-entropy, the log probability assigned to corresponding rejected responses also rises monotonically across training steps. The SFT phase increases the model's likelihood of producing both desirable and undesirable continuations, leaving the subsequent alignment phase to clean up the contrast that should have been preserved from the start.
The phrase monolithic preference optimization in the paper title refers to compressing the entire post-training pipeline into a single optimization objective and a single training run. Rather than treating SFT and preference alignment as separate phases with distinct loss functions, data formats, and reference checkpoints, ORPO frames them as components of one composite loss. The technical insight is that the contrastive signal needed for preference alignment can be supplied through the same forward pass that performs SFT, provided the contrastive term is computed from the policy itself rather than from an external reward model.
ORPO was introduced in "ORPO: Monolithic Preference Optimization without Reference Model" by Jiwoo Hong, Noah Lee, and James Thorne of the KAIST AI lab. The preprint appeared on arXiv (2403.07691) on March 12, 2024, and was accepted at the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), held in Miami, Florida. The conference proceedings are at pages 11170 to 11189, with DOI 10.18653/v1/2024.emnlp-main.626.
The official implementation is available at github.com/xfactlab/orpo. The authors released two fine-tuned checkpoints: Mistral-ORPO-alpha (7B) and Mistral-ORPO-beta (7B), both available on the Hugging Face Hub under the kaist-ai organization. The xfactlab repository is named after the eXtended Foundations of AI and Computing Together lab at KAIST, headed by James Thorne, where the work originated.
The central claim of the paper is that a minor penalty on rejected responses, applied during SFT itself, is sufficient for preference-aligned fine-tuning. The authors show both empirically and theoretically that the odds ratio is a well-motivated choice for computing this penalty, and that folding the penalty into the SFT loss removes the need for a reference model entirely.
Version 1 of the arXiv preprint was titled "Reference-free Monolithic Preference Optimization with Odds Ratio," which is occasionally cited in early survey papers. Version 2, dated March 14, 2024, adopts the more familiar "ORPO: Monolithic Preference Optimization without Reference Model" title and includes an expanded evaluation set.
The odds ratio is a concept borrowed from statistics and probability theory. For a binary event, the odds of an outcome are defined as the probability of that outcome divided by the probability of the complementary outcome. If a model assigns probability P to generating a sequence y given input x, the odds of that sequence are:
odds(y | x) = P(y | x) / (1 - P(y | x))
The odds ratio between two sequences y_w (the preferred or "winning" response) and y_l (the rejected or "losing" response) is then:
OR(y_w, y_l | x) = odds(y_w | x) / odds(y_l | x)
Expanding this expression:
OR(y_w, y_l | x) = [P(y_w | x) / (1 - P(y_w | x))] / [P(y_l | x) / (1 - P(y_l | x))]
This can be rearranged as the ratio of the two probabilities multiplied by an additional correction factor:
OR = [P(y_w | x) / P(y_l | x)] * [(1 - P(y_l | x)) / (1 - P(y_w | x))]
The correction factor (1 - P(y_l | x)) / (1 - P(y_w | x)) makes the odds ratio more sensitive to changes in model confidence than a bare probability ratio. When the model assigns near-zero probability to a rejected response, the odds ratio amplifies the signal; when the model is nearly certain about a chosen response, the signal diminishes. The paper argues this sensitivity profile is better calibrated to preference learning than a plain probability ratio.
The paper also contrasts odds ratios with the approach used in DPO, which operates on log probability differences. Because log probabilities span a wide range and can be driven very negative during training, the probability-based contrast in DPO can produce overly strong suppression of rejected responses in some regimes. The odds ratio formulation applies a more moderate and self-regulating penalty.
The ORPO objective combines two components:
L_ORPO = E[L_SFT + lambda * L_OR]
The SFT component is the standard negative log-likelihood loss on the chosen responses:
L_SFT = -log P_theta(y_w | x)
This is exactly the cross-entropy loss used in conventional SFT. It maximizes the log probability of the preferred completion given the prompt.
The odds ratio component is:
L_OR = -log sigmoid(log OR(y_w, y_l | x))
= -log sigmoid(log[odds(y_w | x) / odds(y_l | x)])
This term applies a log-sigmoid transformation to the log odds ratio. Minimizing L_OR is equivalent to maximizing the log odds ratio, which increases the odds of the chosen response relative to the rejected one. The sigmoid ensures the loss is bounded and numerically stable.
The scalar lambda controls the relative weight of the odds ratio penalty. In the original paper and in the TRL implementation, the recommended default is lambda = 0.1. The TRL library refers to this parameter as beta in the ORPOConfig class, following a naming convention consistent with other preference optimization trainers in the library.
A practical implementation detail not always emphasized in summaries of ORPO is that the log probabilities used in the odds ratio are length-normalized averages over the response tokens, not raw sums. Without normalization, longer responses accumulate more negative log-likelihood mass simply by being longer, which would bias the odds ratio toward shorter sequences. The TRL implementation averages the log probabilities over the number of completion tokens, mirroring the convention used by DPO and IPO trainers in the same library. This averaging is what makes the odds ratio quantity comparable across chosen and rejected responses of differing lengths.
This length normalization is also why the ORPO loss is sometimes described as operating on the geometric mean of token probabilities rather than the joint probability of the full sequence. The distinction matters for understanding how the gradient flows through individual tokens: every token in the completion contributes to the loss in roughly equal measure, rather than the early tokens dominating.
The paper provides a closed-form decomposition of the gradient of L_OR:
nabla_theta L_OR = delta(d) * h(d)
where d = log OR(y_w, y_l | x) is the log odds ratio at the current parameter values.
The term delta(d) acts as a penalty coefficient. It is bounded between 0 and 1 and approaches 0 as the odds ratio in favor of the chosen response grows large. This means the penalty diminishes automatically as the model correctly distinguishes the preferred from the rejected response. There is a built-in stopping behavior: once the model has learned the preference, the gradient contribution from the odds ratio term fades.
The term h(d) captures the direction of the update. It increases the log probability of tokens in the chosen response while decreasing the log probability of tokens in the rejected response. The magnitude of h(d) scales with the current probability assigned to each response type, ensuring the update is proportional to how uncertain the model currently is.
Together, delta and h implement a form of adaptive curriculum: the model is pushed hard to distinguish preferred from rejected responses when the odds ratio is near 1 (roughly equal probabilities), and the push diminishes once the model has developed a clear preference.
The paper includes an ablation comparing the odds ratio formulation against a direct probability ratio. The results show that the odds ratio is more consistent across model sizes and training steps. The probability ratio formulation is more prone to instability at larger model scales, where small changes in log probability can translate into large gradient updates. The odds ratio's correction factor acts as a stabilizer.
The theoretical justification rests on the observation that odds (P / (1 - P)) and probability (P) behave differently as P approaches its boundaries. When P is very small, the two quantities are nearly identical. When P is close to 1, odds grow without bound while probability is capped. For preference learning, the interesting regime is when chosen probabilities should rise toward 1 and rejected probabilities should fall toward 0. The odds ratio's unbounded growth on the high side and finite penalty on the low side gives the loss a useful asymmetry that the bare probability ratio lacks.
The key architectural decision in ORPO is that both loss terms act simultaneously on the same mini-batch, in the same forward and backward pass.
A standard training batch for ORPO contains triplets of (prompt, chosen response, rejected response). The forward pass computes token-level log probabilities for both the chosen and rejected responses. The SFT loss operates on the chosen response only, maximizing its likelihood. The odds ratio loss operates on both, penalizing the rejected response relative to the chosen one.
This is different from DPO in a subtle but important way. In DPO, the reference model's log probabilities are precomputed and stored, then used during training to normalize the policy's log probability ratios. The reference model serves as a fixed baseline that anchors the training signal. In ORPO, there is no baseline to anchor against. The odds ratio compares the current policy's assignment of probability to chosen versus rejected responses against itself. The contrast is entirely internal to the current model state.
This self-referential contrast works because the odds ratio measures the relative confidence of the model across two outputs at a given training step. It does not require knowing where the model started. The SFT loss ensures the model is still moving in the direction of the target domain; the odds ratio loss ensures it is also developing a preference ordering within that domain.
The result is a training procedure that can start from any pretrained base model, without first running a separate SFT phase to produce a reference checkpoint.
Removing the reference model has a direct practical benefit. In DPO training, both the policy model and the reference model must be loaded into GPU memory simultaneously. For a 7B parameter model in bfloat16 precision, this means approximately 28 GB per model, or roughly 56 GB total before accounting for optimizer states, gradients, and activation memory.
With ORPO, only one model is resident in memory. This roughly halves the model weight memory requirement. Combined with the fact that there is no need to run a separate SFT phase and store its checkpoint, ORPO can make preference alignment accessible on hardware configurations that would otherwise require model parallelism or gradient checkpointing to fit a DPO training run.
The paper notes that ORPO theoretically requires roughly half the number of forward passes per training batch compared to DPO with a reference model. DPO requires a forward pass through both the policy and the reference model for each training example; ORPO requires only one forward pass through the policy. In practice, DPO implementations often precompute reference log probabilities once before training begins and cache them on disk, which reduces the per-step compute overhead at the expense of disk I/O. Even with this optimization, ORPO retains the advantage of avoiding the precomputation step entirely.
The field of preference alignment has produced a number of DPO variants and alternatives, each addressing different perceived shortcomings.
| Method | Reference model | Preference data format | Training stages | Key mechanism |
|---|---|---|---|---|
| RLHF (PPO) | Reward model | Ranked pairs | 3 (SFT, RM, RL) | Policy gradient with reward signal |
| DPO | Frozen SFT model | Chosen / rejected pairs | 2 (SFT, then DPO) | Log probability ratio |
| IPO | Frozen SFT model | Chosen / rejected pairs | 2 (SFT, then IPO) | Regularized probability ratio |
| KTO | Frozen SFT model | Individual thumbs-up / thumbs-down | 2 (SFT, then KTO) | Kahneman-Tversky prospect theory loss |
| ORPO | None | Chosen / rejected pairs | 1 (SFT + alignment combined) | Odds ratio appended to SFT loss |
| SimPO | None | Chosen / rejected pairs | 1 or 2 | Length-normalized sequence probability |
| CPO | None or frozen | Chosen / rejected pairs | 1 or 2 | Sequence-level cross-entropy contrast |
DPO (Rafailov et al., 2023, NeurIPS 2023) reformulates the RLHF objective as a supervised binary classification problem, eliminating the need for an explicit reward model and PPO. It uses a reference model to normalize the policy's probability ratios, which prevents the model from collapsing to trivial solutions. The main drawbacks of DPO are the two-stage training requirement and the doubled memory footprint. DPO has become the most widely adopted preference alignment method and serves as the primary baseline for ORPO comparisons. The ORPO paper reports that Mistral-ORPO-beta achieves win rates above 70% against Mistral fine-tuned with DPO on the HH-RLHF dataset.
One difference often noted by practitioners is the training stability profile. DPO is generally robust to learning rate choice within a range of roughly 1e-7 to 1e-6, while ORPO's combined loss is more sensitive to the interaction between learning rate and the lambda (beta) hyperparameter. Practitioners commonly report that ORPO benefits from a learning rate one to two orders of magnitude higher than DPO, in the 1e-6 to 8e-6 range, because the SFT-like signal dominates the loss and requires more aggressive updates.
IPO (Azar et al., 2024) was developed partly to address a theoretical overfitting concern with DPO. The DPO loss can drive the policy to assign very high probability to chosen responses and very low probability to rejected ones, potentially overfitting on the preference labels. IPO adds a regularization term that penalizes large probability ratios, keeping the policy closer to the reference model's distribution. In practice, IPO tends to perform comparably to DPO on standard benchmarks and shares the same two-stage training requirement.
KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2024) draws from behavioral economics to construct a loss function that does not require paired chosen/rejected examples. Instead, KTO works with individually labeled responses (thumbs-up or thumbs-down), which makes it applicable to datasets where responses are not naturally paired. KTO still requires a reference model and is typically run as a second stage after SFT. Its main advantage over DPO and ORPO is the more flexible data format.
SimPO (Meng et al., 2024) is another reference-free method that normalizes log probabilities by sequence length before computing the preference objective. This addresses a known tendency of DPO-style methods to favor shorter responses. SimPO shares the reference-free property with ORPO but uses a different loss formulation and typically performs a separate SFT phase before alignment. The key distinction is that SimPO does not contain an SFT component within its loss: it expects the model to already be a strong SFT checkpoint, while ORPO is designed to be applied directly to a pretrained base model. In community comparisons, SimPO and ORPO are often described as complementary rather than competing: SimPO for fine alignment on top of a strong SFT, ORPO for one-shot domain-and-preference adaptation.
Contrastive Preference Optimization (CPO), introduced for machine translation by Xu et al. (2024), shares structural similarity with ORPO. CPO also uses a log-sigmoid-based contrast between chosen and rejected sequence log probabilities and optionally drops the reference model. The principal difference is that CPO uses a raw log-probability ratio rather than a log-odds ratio, and was developed for translation quality optimization rather than chat alignment. CPO appeared roughly contemporaneously with ORPO and is often listed alongside it in surveys of reference-free preference methods.
ORPO is most appropriate when compute or memory budgets are constrained, when the training pipeline should be kept simple, or when starting from a general pretrained base model without a dedicated SFT checkpoint. The single-stage training is particularly convenient for rapid prototyping and for fine-tuning on consumer hardware with limited VRAM.
DPO or IPO may be preferable when a high-quality SFT checkpoint already exists and the goal is to apply further preference tuning on top of it. KTO may be preferable when the available preference data consists of individually labeled responses rather than head-to-head comparisons. The modern post-training stack used by frontier labs and enterprise teams in 2025 is often a combination rather than a single method: SimPO for stability on top of strong SFT, ORPO for robustness in single-shot pipelines, KTO for asymmetric thumbs-up/thumbs-down data, and DPO as a final polish step. Treating ORPO as the only tool is rarely optimal; treating it as one tool in a portfolio matches contemporary practice.
The HuggingFace TRL (Transformer Reinforcement Learning) library added support for ORPO through the ORPOTrainer class. TRL is the standard library for post-training LLMs in the HuggingFace ecosystem, providing implementations of SFT, DPO, PPO, and a range of other alignment methods.
The ORPO implementation in TRL was contributed by Kashif Rasul, Lewis Tunstall, and Alvaro Bartolome. As of TRL v1.0 (released in 2025), ORPOTrainer is marked as experimental in the library's trainer status hierarchy, alongside other community-contributed preference optimization trainers. The stable surface of TRL includes the SFT, DPO, Reward Modeling, RLOO, and GRPO trainers, with ORPO and several others listed as experimental but supported.
The ORPOTrainer expects a preference dataset with three fields per example:
prompt: the input prompt or conversation historychosen: the preferred responserejected: the disfavored responseThis format is identical to the DPO trainer's expected format. Both conversational (chat template) and standard (raw text) dataset formats are supported. When given a conversational dataset, the trainer automatically applies the model's chat template to format the inputs.
The recommended dataset for getting started is trl-lib/ultrafeedback_binarized, which is a processed version of the UltraFeedback dataset with chosen and rejected pairs ready for training. Other popular ORPO-compatible datasets include argilla/dpo-mix-7k, argilla/distilabel-capybara-dpo-7k-binarized, and mlabonne/orpo-dpo-mix-40k.
The ORPOConfig class extends HuggingFace's standard TrainingArguments with ORPO-specific parameters:
| Parameter | Default | Description |
|---|---|---|
beta | 0.1 | The lambda hyperparameter from the paper, controlling the weight of the odds ratio loss relative to the SFT loss |
max_length | 1024 | Maximum sequence length for prompt plus completion |
max_prompt_length | 512 | Maximum prompt length when prompt and completion are tokenized separately |
max_completion_length | None | Optional cap on completion length, useful for encoder-decoder models |
disable_dropout | True | Disabling dropout during training improves stability and is recommended |
generate_during_eval | False | If enabled, the trainer samples completions and logs them to Weights and Biases or Comet during evaluation |
learning_rate | 1e-6 | Lower than standard TrainingArguments (5e-5), appropriate for fine-tuning pretrained language models |
The ORPOConfig also sets some different defaults from standard TrainingArguments. Gradient checkpointing defaults to True and bfloat16 defaults to True when float16 is not set, reducing memory usage.
A minimal training script looks like this:
from datasets import load_dataset
from trl import ORPOConfig, ORPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
training_args = ORPOConfig(output_dir="Qwen2-0.5B-ORPO")
trainer = ORPOTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset
)
trainer.train()
The trainer logs several metrics during training, including rewards/chosen, rewards/rejected, rewards/accuracies (the fraction of training steps where the chosen reward exceeds the rejected reward), rewards/margins, log_odds_chosen, log_odds_ratio, and nll_loss. Reading these metrics is the standard way to diagnose ORPO training health: the log_odds_ratio should rise steadily across training, rewards/accuracies should climb above 0.5 within the first few hundred steps, and nll_loss should decrease smoothly.
For Mixture of Experts models, TRL also supports passing the auxiliary router loss through the ORPO objective by setting output_router_logits=True in the model config.
A full example script is available at examples/scripts/orpo.py in the TRL repository, along with a command-line launcher using accelerate launch for multi-GPU training.
The ORPOTrainer accepts a peft_config argument that wraps the model in a PEFT (Parameter-Efficient Fine-Tuning) adapter. This makes it straightforward to combine ORPO with LoRA or QLoRA for training on a single consumer GPU. Because ORPO already avoids loading a reference model, QLoRA plus ORPO can fit a 7B parameter training run on a single GPU with 24 GB of VRAM.
A representative QLoRA + ORPO configuration used in community recipes loads the model in 4-bit NF4 quantization with double quantization, wraps it in a LoRA adapter with rank 16 and alpha 32, targets the standard attention and MLP projection modules (q_proj, k_proj, v_proj, o_proj, up_proj, down_proj, gate_proj), and trains at a learning rate of 8e-6 with beta 0.1, max length 1024, and max prompt length 512. This recipe became widely shared after Maxime Labonne's April 2024 tutorial showed it could fine-tune Llama-3 8B on roughly 1,000 preference examples in about two hours on a single Nvidia L4 GPU.
ORPO is implemented in several training frameworks beyond TRL, reflecting its broad adoption in the open-source post-training ecosystem.
LLaMA-Factory added ORPO support on March 31, 2024, about three weeks after the paper's release. LLaMA-Factory is a unified efficient fine-tuning framework that exposes a YAML or web UI configuration interface across pretraining, SFT, reward modeling, PPO, DPO, KTO, ORPO, and several other training stages. Its ORPO trainer follows the same data format and hyperparameter conventions as TRL but integrates with LLaMA-Factory's data registry and template system, allowing the same preference dataset to be reused across multiple training methods with minimal configuration changes.
Axolotl supports ORPO through its preference tuning configuration block. Axolotl is a configuration-driven fine-tuning framework that emphasizes flexibility and rapid adoption of new techniques. Its preference tuning documentation covers DPO, KTO, ORPO, and SimPO as first-class methods, all sharing a similar YAML schema for dataset specification, model loading, and hyperparameter tuning. Axolotl has become a common choice for cluster-based training workflows because of its built-in support for DeepSpeed, FSDP, and multi-node Accelerate launches.
The paper evaluates ORPO across three model families (Phi-2 at 2.7B, Llama-2 at 7B, and Mistral at 7B) on two preference datasets (Anthropic's HH-RLHF and the binarized UltraFeedback dataset). The baselines include standard SFT, RLHF with PPO, and DPO.
Key results on the binarized UltraFeedback dataset:
| Model | AlpacaEval 2.0 | MT-Bench | IFEval (instr. loose) |
|---|---|---|---|
| Llama-2 Chat 7B | 71.34% | 6.27 | -- |
| Llama-2 ORPO (7B) | 81.26% | -- | -- |
| Mistral-ORPO-alpha (7B) | 11.33% | 7.23 | -- |
| Mistral-ORPO-beta (7B) | 12.20% | 7.32 | 66.19% |
| Zephyr-beta (7B, DPO-based) | 10.99% | 7.34 | -- |
Mistral-ORPO-beta achieves 12.20% on AlpacaEval 2.0, which is competitive with or better than several models exceeding 7B parameters at the time of publication. On MT-Bench, it scores 7.32, comparable to Zephyr-beta despite the latter being trained with a multi-stage DPO pipeline. On IFEval, Mistral-ORPO-beta reaches 66.19% on instruction-level loose accuracy.
The paper also reports win rates against DPO-trained baselines using reward model scoring on HH-RLHF. At the OPT-125M scale, ORPO achieves a 64.3% win rate against DPO. At OPT-1.3B, the win rate rises to 70.9%. This scaling behavior suggests that ORPO's advantage over DPO grows as model capacity increases.
One of the more interesting empirical findings in the paper is a direct measurement of the SFT penalty problem described in the background section. The authors train a model with standard SFT on only the chosen responses from a preference dataset, then evaluate the log probabilities of both chosen and rejected responses during training.
They find that the log probabilities of rejected responses rise during SFT training, alongside the log probabilities of chosen responses. The model is inadvertently becoming more likely to generate both kinds of responses. This validates the motivation for adding a contrastive term to the SFT loss.
When ORPO is used instead, the log probabilities of chosen and rejected responses diverge during training. The chosen response log probability rises while the rejected response log probability falls. The model learns both the domain and the preference ordering simultaneously.
A secondary set of metrics tracked in the paper measures the evolution of the log odds ratio and reward margin across training. The log odds ratio starts near zero (chosen and rejected responses are nearly equally probable in a fresh base model) and rises steadily through training, with a steeper slope when lambda is larger. The reward margin shows a similar trajectory but with more noise, reflecting the fact that reward in the ORPO objective is implicit rather than directly optimized.
The paper notes that ORPO converges to a stable preference state in roughly the same number of total optimizer steps as an SFT-then-DPO pipeline, but achieves this in a single training run rather than two consecutive ones. The wall-clock saving is therefore approximately the full duration of the SFT phase that ORPO eliminates.
Beyond the official Mistral-ORPO-alpha and Mistral-ORPO-beta checkpoints from the KAIST authors, several other ORPO-aligned models have been released by the community.
Alvaro Bartolome, a co-implementer of the TRL ORPOTrainer, released alvarobartt/Mistral-7B-v0.1-ORPO as a demonstration fine-tune of Mistral 7B base on alvarobartt/dpo-mix-7k-simplified, a simplified variant of argilla/dpo-mix-7k with the prompt column detached for direct chat template application. A PEFT/LoRA variant, alvarobartt/Mistral-7B-v0.1-ORPO-PEFT, was released alongside as a low-memory alternative. The accompanying dataset, argilla/dpo-mix-7k, is a small high-quality cocktail compiled by Argilla using their distilabel pipeline, combining several DPO datasets and filtering for chosen responses with high quality ratings. Despite its modest size of about 7,000 examples, it became a popular ORPO benchmark dataset because its quality filtering reduces noise and makes preference signals easier to learn.
Following the release of Llama-3 8B by Meta in April 2024, ORPO became one of the most popular alignment methods for community fine-tunes. Maxime Labonne's tutorial fine-tuning Llama-3 8B with ORPO using TRL and QLoRA on Google Colab was widely cited and replicated. The associated model mlabonne/OrpoLlama-3-8B was one of the earliest Llama-3 ORPO fine-tunes published on the Hugging Face Hub. Many derivative models built on this recipe, including 70B-scale ORPO fine-tunes assembled by community contributors with access to multi-GPU infrastructure.
After the paper's release in March 2024, ORPO saw rapid adoption in the open-source community, particularly for fine-tuning the Llama-3 family of models released by Meta in April 2024. A widely cited tutorial by Maxime Labonne demonstrated fine-tuning Llama-3 8B with ORPO using TRL and QLoRA, helping establish ORPO as a practical alternative to DPO for resource-constrained settings.
ORPO support was also added to Axolotl, a popular configuration-driven fine-tuning framework, and to LLaMA-Factory, another widely used fine-tuning toolkit. This broad framework support lowered the barrier to trying ORPO for researchers and practitioners who prefer configuration-based workflows over writing training scripts from scratch.
On the dataset side, the period after ORPO's release saw a wave of preference datasets optimized for monolithic training. The mlabonne/orpo-dpo-mix-40k dataset, a 40,000-example mix of high-quality DPO datasets, was widely used as a one-stop preference corpus for ORPO fine-tunes. Argilla's distilabel pipeline produced several refined preference datasets in this period, including argilla/distilabel-capybara-dpo-7k-binarized and argilla/dpo-mix-7k, which were heavily used in community ORPO experiments.
The single-stage design of ORPO has influenced subsequent work on efficient preference alignment. The Triple Preference Optimization (TPO) paper (2024) cites ORPO as a motivation for combining alignment objectives rather than staging them sequentially. The RainbowPO paper (ICLR 2025) surveys a family of DPO variants and analyzes ORPO as a representative of the reference-free single-stage class, alongside SimPO.
The Comprehensive Survey of Direct Preference Optimization (Xiao et al., arXiv 2410.15595, 2024) classifies ORPO as a reference-free method within the broader DPO variant family, examining its loss formulation in relation to dozens of other variants. Surveys published in 2025 typically place ORPO alongside SimPO and CPO as the canonical examples of reference-free preference optimization.
A 2025 line of research on Explicit Preference Optimization (arXiv 2506.07492) builds on the observation that ORPO's odds ratio can be interpreted as an implicit reward function and explores variations that make this reward explicit, potentially recovering some of the regularization benefits of reference-based methods without reintroducing the second model.
Researchers have also applied ORPO to multimodal settings. A Stanford CS231n project applied both ORPO and DPO to visual question answering preference alignment, finding that ORPO produced competitive results on VQA benchmarks with less setup overhead.
The ORPO paper explicitly used Mistral 7B as a primary evaluation model and released Mistral-ORPO-alpha and Mistral-ORPO-beta checkpoints. Community fine-tuning extended this to Llama-3 (8B and 70B) and Phi-2 (2.7B). While neither Microsoft's Phi-3 nor Meta's Llama-3 official releases used ORPO in their reported training pipelines, ORPO became one of the most commonly cited methods for efficient community fine-tuning of both model families following the paper's release.
Although the paper recommends lambda = 0.1 as a robust default, practitioners have reported that the optimal value is dataset-dependent. With values too low, the odds ratio penalty provides negligible contrastive signal and the training effectively reduces to SFT. With values too high, the odds ratio term can dominate the SFT loss, causing the model to overfit on the preference labels at the expense of general language modeling quality.
The interaction between lambda and learning rate is particularly important. Community recipes that work well at one learning rate may produce unstable training when the learning rate is changed without also retuning lambda. A common rule of thumb is to scale lambda inversely with learning rate, although no formal scaling law has been published. In practice, lambda values between 0.05 and 0.2 cover most reported successful configurations.
The absence of a reference model is ORPO's main computational advantage, but it is also a source of theoretical concern. Reference models in DPO and IPO serve as regularizers that keep the policy from moving too far from the SFT distribution. Without this anchor, ORPO relies entirely on the self-referential odds ratio to prevent distributional collapse.
In practice, the SFT loss component of the ORPO objective provides some regularization by continuously optimizing for the chosen responses. However, there is no formal guarantee analogous to the KL divergence constraint in RLHF or the reference model normalization in DPO. For tasks requiring careful preservation of specific capabilities from the base model, DPO with a strong SFT reference may be more conservative and reliable. Some 2025 follow-up work has investigated hybrid methods that recover a soft reference signal without adding a second model to GPU memory, but these remain less mature than the core ORPO formulation.
Like DPO and IPO, ORPO requires paired preference data (each training example must have both a chosen and a rejected response for the same prompt). This is a more demanding data format than KTO, which can work with individually labeled responses. Constructing paired preference data requires either human annotation or using a reward model or judge to rank multiple completions per prompt, which adds data preparation overhead.
Most of the paper's evaluations are at 7B parameters or below. Preference optimization at larger scales (30B, 70B, or beyond) introduces more complex training dynamics, and it is less clear whether ORPO's self-referential odds ratio remains stable as model capacity and batch size grow. Multi-GPU training with gradient accumulation can alter the effective batch distribution in ways that interact with the odds ratio computation. The TRL documentation notes specific recommendations for Mixture of Experts models, but systematic evaluation at scales above 7B with ORPO specifically is limited in the literature.
Follow-up surveys in 2025 explicitly flag ORPO's scalability to models above 13 billion parameters and to open-ended generation domains as open questions, alongside the absence of a formal proof of global optimality (only local convergence and gradient boundedness are established in the original paper).
Because ORPO collapses SFT and preference alignment into a single loss, the quality of the chosen and rejected responses jointly determines both signals. Poorly defined or noisy preference pairs can simultaneously degrade the SFT signal (the chosen response is no longer a clean training target) and the alignment signal (the chosen versus rejected contrast is weak). DPO is somewhat more robust to noisy preferences because the SFT phase is decoupled and can be performed on a separately curated high-quality corpus. ORPO practitioners therefore tend to invest more effort in data cleaning, often using LLM-as-judge filtering or human-in-the-loop annotation to ensure the chosen responses are independently high quality.
Practitioners have noted that the reward margin (the difference between rewards/chosen and rewards/rejected in TRL's logged metrics) can be slow to increase during early training, sometimes remaining near zero for the first several hundred steps. This can make it difficult to diagnose whether training is proceeding correctly. Unlike DPO, where the log probability difference provides an intuitive signal, the odds ratio metric is less familiar and its expected trajectory during training is less documented.