KTO

KTO (Kahneman-Tversky Optimization) is a method for aligning large language models with human feedback using only binary signals indicating whether a model output is desirable or undesirable. Introduced in February 2024 by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela of Stanford University and Contextual AI, KTO does not require paired preference comparisons and can instead learn directly from the kind of thumbs-up/thumbs-down feedback that organizations already collect in the ordinary course of deploying AI products.

The method is grounded in Kahneman and Tversky's prospect theory, a framework from behavioral economics that describes how humans perceive gains and losses asymmetrically relative to a reference point. The paper, titled "KTO: Model Alignment as Prospect Theoretic Optimization" (arXiv:2402.01306), was accepted at the International Conference on Machine Learning (ICML) 2024 as a spotlight presentation, placing it in the top 3.5% of accepted papers. The official implementation is available through ContextualAI/HALOs on GitHub, and TRL (Transformer Reinforcement Learning) provides a KTOTrainer for practical use.

KTO sits within a broader family of objectives the paper christens Human-Aware Losses (HALOs). Within that family, the authors place DPO, a clipped version of PPO, and SLiC alongside KTO itself, showing that each method already encodes some of the cognitive biases Kahneman and Tversky originally documented in human subjects. The framing turns the design of alignment objectives into an empirical question about which inductive biases best match a given feedback distribution rather than a search for the single "correct" loss.

Background: the problem with pairwise preference data

The dominant paradigm for aligning language models before KTO was RLHF (Reinforcement Learning from Human Feedback), followed by offline alternatives such as DPO (Direct Preference Optimization). Both approaches depend on a specific kind of training data: pairs of outputs for the same prompt, one labeled as preferred over the other. A human annotator is shown two model responses and asked to indicate which one is better. This pairwise comparison structure is what the Bradley-Terry model (and its derivatives used in RLHF and DPO) was designed to learn from.

Pairwise preference data is expensive and difficult to collect at scale. Building it requires:

Challenge	Description
Comparative annotation	Annotators must read and evaluate two full responses, not just one
Agreement overhead	Pairs where annotators disagree are often discarded
Data sourcing difficulty	Most real-world logs contain only a single model response per request, not two
Limited reuse	Preference data collected for one model version may not transfer cleanly to another
Intransitivity	Annotator preferences may cycle (A > B, B > C, C > A), violating Bradley-Terry assumptions
Cognitive load	Reading and weighing two long responses is slower than tagging a single one

Organizations that deploy chatbots, writing assistants, or code generation tools typically collect signals such as user thumbs-up/thumbs-down ratings, whether a session ended in a completed task, or whether a user deleted and rewrote a model output. These signals indicate that a given response was good or bad, but they do not involve a comparison between two candidate responses. Converting this kind of data into the pairwise format requires either fabricating a comparison or discarding much of the signal.

There is also a sampling asymmetry that pairwise methods struggle with. In a deployed system, the policy that produced response A is the only policy that existed at the moment that user interacted with the system. A second response would have to be generated post hoc from a different policy, which means the resulting pair conflates differences in quality with differences in the generative distribution. Annotation pipelines for academic preference datasets sidestep the issue by sampling both candidates from the same model in a controlled setting, but this is not how production logs accumulate.

KTO was designed to use this type of singleton feedback directly, without forcing a conversion step.

Kahneman-Tversky prospect theory

The theoretical foundation of KTO comes from a 1979 paper by Daniel Kahneman and Amos Tversky, "Prospect Theory: An Analysis of Decision under Risk," published in Econometrica. The paper is among the most cited in all of social science and was cited when Kahneman received the Nobel Memorial Prize in Economics in 2002.

Prospect theory describes how people actually evaluate uncertain outcomes, as opposed to how rational utility maximization says they should. Three properties of the theory are particularly relevant to KTO:

Reference dependence. People evaluate outcomes relative to a reference point (often the status quo), not in absolute terms. A gain of $50 feels different depending on whether you expected $0 or expected $100.

Loss aversion. Losses hurt more than equivalent gains feel good. In Kahneman and Tversky's experiments, people typically required a potential gain roughly twice the size of a potential loss before accepting a coin-flip bet. The value function for losses is steeper than for gains.

Diminishing sensitivity. The marginal impact of an additional gain or loss decreases as the magnitude grows. Moving from $0 to $10 feels larger than moving from $100 to $110. The value function is concave for gains and convex for losses.

The resulting value function has a characteristic S-shape: steep and approximately linear near the reference point, flattening out in both directions, and asymmetric (steeper on the loss side). In the gain region, concavity implies risk aversion, since a guaranteed gain is preferred to a coin-flip with the same expected value. In the loss region, convexity implies risk seeking, since a coin-flip that might avoid a loss is preferred to a guaranteed equivalent loss. This sign-dependent risk attitude is one of the empirical phenomena prospect theory was constructed to capture.

The KTO paper argues that this framework is directly applicable to language model alignment. When a model produces a response to a prompt, the human evaluating it is not operating in a vacuum; they have a prior expectation of quality. Whether the response clears that bar or falls short shapes how its value should be calculated during training. Methods that ignore this reference-point structure are leaving useful inductive bias on the table.

There is a second connection that the paper makes explicit. The KL-divergence regularization term that appears in both RLHF and DPO can be read as imposing a soft reference point: the reference model defines what the policy is implicitly being compared against. Prospect theory provides a principled vocabulary for that comparison, allowing the loss function to encode loss aversion and diminishing sensitivity rather than treating positive and negative deviations symmetrically.

The KTO paper

Ethayarajh and colleagues make two central contributions in the paper.

The first is a theoretical analysis showing that existing alignment objectives, including DPO and a clipped version of PPO, implicitly belong to a broader class of loss functions called Human-Aware Losses (HALOs). A HALO is defined as any loss function of the form:

f(π_θ, π_ref) = E_{x,y~D}[a_{x,y} · v(r_θ(x,y) - E_Q[r_θ(x,y')])] + C_D

where a_{x,y} is +1 or -1 depending on whether the sample is labeled desirable, v is a value function that is non-decreasing and concave in gains, and Q provides a reference distribution from which the expected reward is drawn. The paper proves (Theorem 3.5) that both DPO and PPO-Clip satisfy these conditions. This places the success of methods like DPO in a new light: part of what makes them work is that they are effectively encoding the same asymmetric, reference-dependent utility structure that Kahneman and Tversky documented in human subjects.

The second contribution is KTO itself, a HALO that is directly derived from the Kahneman-Tversky utility model rather than arriving at prospect-theoretic structure by accident. Because the derivation starts from a utility function over individual outputs rather than a likelihood over preference pairs, the resulting objective only requires knowing whether each output is desirable or undesirable, not which of two outputs is better.

Several minor results in the paper round out the analysis. The authors show that the cross-entropy loss used in plain supervised fine-tuning is not a HALO, which helps explain why SFT alone tends to under-perform on chat benchmarks. They show that the Bradley-Terry preference structure underlying DPO is a specific instance of the HALO family with a particular choice of value function. They also show that the HALO family is not closed under composition, which has implications for stacking multiple alignment objectives in a single training run.

The paper was accepted at ICML 2024 as a spotlight presentation. The authors released 56 aligned model checkpoints under the name Archangel, spanning multiple base models (Llama, Pythia) at scales from 1B to 30B parameters, each aligned with a different combination of method and dataset. Archangel was designed to enable head-to-head comparison of alignment methods under controlled conditions, since the base model, SFT data, and preference data are held fixed across method variants.

Algorithm

KTO training follows a familiar structure: start from a supervised fine-tuned (SFT) base model and a frozen reference copy of that model, then update the policy model to increase the utility of desirable outputs and decrease the utility of undesirable ones. The reference copy is held fixed throughout training. Like other methods in the family of preference optimization algorithms, the gap between policy and reference acts as the implicit reward signal.

The training data takes a simple form. Each example consists of:

A prompt x
A completion y
A binary label: whether y is desirable or undesirable given x

There is no requirement that each prompt have both a desirable and an undesirable example, though the training works best when both types appear in each batch. In practice, a dataset might come from customer support logs where successful resolutions are labeled desirable and unsuccessful ones undesirable, from user ratings on a deployed chatbot, or from any other source that provides a binary quality signal on individual outputs.

At each training step:

The policy model and the reference model both compute log probabilities over the completion tokens.
The implied reward for each example is computed as the log-probability ratio between the policy and the reference: r_θ(x,y) = log[π_θ(y|x) / π_ref(y|x)].
A reference point z_0 is estimated using the KL divergence between the policy and the reference on a batch of other completions.
The KTO value function maps each (reward, label) pair to a scalar utility.
The loss is the expected difference between target utility and achieved utility, summed over the batch.

The reference point z_0 is what anchors the prospect-theoretic framing. Rather than evaluating whether a reward is high or low in an absolute sense, KTO evaluates whether it is above or below the expected reward across the current batch. This is analogous to the reference point in prospect theory: what matters is not the absolute value, but the deviation from what the model currently expects to receive.

In practice, the KL estimate is computed using a microbatch-shifting heuristic rather than explicit sampling, keeping the computational overhead manageable:

ẑ_0 = max(0, (1/m) Σ log[π_θ(y_j|x_i) / π_ref(y_j|x_i)])

The max(0, ...) ensures the reference point stays non-negative. The within-batch nature of the estimate means that a single example does not change its own reference point; it shifts the reference for the other examples in the same step, providing a contrastive signal that does not require sampling fresh completions during training.

KTO loss function

The KTO loss is defined as the expected gap between a target weight and the value function:

L_KTO(π_θ, π_ref) = E_{x,y~D}[λ_y - v(x,y)]

The value function v(x,y) has two branches, one for desirable outputs and one for undesirable ones:

v(x,y) = λ_D · σ(β · (r_θ(x,y) - z_0))    if y is desirable
v(x,y) = λ_U · σ(β · (z_0 - r_θ(x,y)))    if y is undesirable

where σ is the logistic sigmoid function, β is a hyperparameter controlling how strongly the model is penalized for deviating from the reference, and λ_D and λ_U are weights for the desirable and undesirable loss terms respectively.

The two branches encode the prospect-theoretic asymmetry:

For a desirable output, utility increases when the policy assigns higher probability than the reference (positive reward minus reference point). The model is rewarded for generating outputs the human found good at a higher rate than baseline.
For an undesirable output, utility increases when the policy assigns lower probability than the reference (negative reward relative to reference point). The model is penalized for generating outputs the human found bad.

The sigmoid substitutes for the exponentiated power function in Kahneman and Tversky's original formulation, trading exact prospect-theoretic shape for numerical stability during gradient-based training. The result preserves the qualitative S-shape of the value function while making the gradients well-behaved across the full real line, which matters when the policy and reference can occasionally disagree by very large log-odds.

The λ_D and λ_U parameters play the role of the loss-aversion coefficient. By default they are both set to 1. When the training dataset is imbalanced, the recommendation from the paper and TRL documentation is to adjust them so that the ratio (λ_D × n_desirable) / (λ_U × n_undesirable) falls between 1:1 and 4:3, where n_desirable and n_undesirable are the counts of positive and negative examples in the dataset.

The KTO loss for a desirable output is minimized when the policy assigns high probability to that output relative to the reference. The loss for an undesirable output is minimized when the policy assigns low probability relative to the reference. Both terms share a common reference point z_0, which prevents the model from gaming the loss by simply scaling all probabilities up or down uniformly.

The gradient of the loss has a particularly clean form. Differentiating with respect to the policy parameters gives a term proportional to σ(βz)(1 - σ(βz)), a sigmoid product that peaks near the decision boundary and falls off to zero at both extremes. The result is that updates concentrate on examples whose current implied reward is close to the reference point, where the policy is most uncertain about how to behave, and largely ignore examples that are already classified strongly either way.

Noise robustness

The paper proves (Proposition 4.1) that KTO naturally down-weights examples with extreme implied rewards. If an example carries an implausibly large positive or negative reward, the sigmoid saturates and contributes little gradient. In practice, this means mislabeled or ambiguous examples in real-world feedback data have less influence on training than they would under a method that optimizes log-likelihood directly. DPO, by contrast, can overfit to noise in the preference labels because it maximizes the probability of the chosen response unconditionally.

This robustness property has practical consequences for production datasets. Thumbs-up and thumbs-down ratings collected from real users are noisy: users sometimes click the wrong button, sometimes thumbs down a response for reasons unrelated to its quality (such as a UI annoyance), and sometimes give positive feedback to outputs that an expert would judge harmful. KTO's saturation behavior means that a small fraction of bad labels has limited effect on the resulting policy, while clean labels in the middle of the difficulty distribution drive most of the learning.

Majority preference recovery

The paper also proves (Theorem 4.3) that when humans disagree about whether an output is good, KTO trained on the majority label deterministically recovers the majority-preferred output. DPO can fail in the worst case under contradictory preference scenarios, selecting minority-preferred outputs depending on how the pairs are constructed. This makes KTO particularly attractive in settings where annotator agreement is low, which is common when subjective qualities such as creativity, tone, or political balance are being evaluated.

Connection to f-divergence estimators

A subsequent line of analysis, including the BCO paper discussed below, observes that KTO is implicitly estimating a particular f-divergence between the desirable and undesirable conditional distributions. KTO's specific functional form corresponds to a total variation (TV) distance estimator, while BCO's corresponds to a Jensen-Shannon (JS) divergence estimator. Both belong to the same family of unpaired alignment methods, but the choice of f-divergence affects which kinds of distribution shift between the positive and negative subsets the method can correct.

Single-sample binary feedback

The core practical distinction between KTO and methods like DPO or RLHF is what the training data looks like.

RLHF requires: a prompt, two candidate responses, a human preference label indicating which response is better, a separately trained reward model, and a reinforcement learning loop (typically PPO) to update the policy against the reward signal.

DPO requires: a prompt, a chosen response, and a rejected response. The preference pair is the basic unit of data.

KTO requires: a prompt, a single response, and a binary label (desirable or undesirable).

This difference has significant practical consequences. Most deployed AI systems generate a single response per user turn. Collecting a second response for the same prompt to construct a preference pair either requires deliberately sampling a second output (adding inference cost and latency), using a synthetic negative from a different model, or pairing the current output against a historical output under different conditions. None of these approaches directly reflects how users interact with a deployed system.

Binary feedback, on the other hand, maps naturally onto data that organizations already collect: thumbs up/down ratings, whether a user accepted or regenerated a suggestion, whether a customer service interaction ended in resolution, whether a code suggestion was accepted or deleted. The paper notes that "every company has customer interaction data that can be marked as desirable (e.g., sale made) or undesirable (e.g., no sale made)." KTO is designed to use that data without a conversion step.

Another advantage: when KTO is given the same data as DPO in paired form, it can decompose each pair into two singleton examples, giving it up to twice the training examples from the same annotation budget. Empirically, breaking preference pairs into binary singletons for KTO training often matches or exceeds DPO performance on the same data.

A related advantage is robustness to annotator drift. If a feedback pipeline switches from one set of annotators to another over time, or if an organization gradually changes the criteria it uses to label outputs, KTO can incorporate the resulting data without re-pairing it. Pairwise methods are sensitive to the precise notion of preference encoded in each pair, since two annotators with different definitions of "better" may produce pairs that the model cannot reconcile.

Comparison with DPO and PPO

The following table summarizes the key differences between KTO, DPO, and PPO-based RLHF.

Property	KTO	DPO	PPO (RLHF)
Data format	Single response + binary label	Preference pair (chosen/rejected)	Preference pairs + reward model
Requires paired comparisons	No	Yes	Yes
Reward model required	No	No	Yes
Online sampling required	No	No	Yes
Reference model required	Yes	Yes	Yes (as KL penalty)
Training stability	High	High	Lower (RL instability)
Loss function type	Prospect-theoretic utility	Log-likelihood of preferences	PPO clipped policy gradient
Theoretical basis	Kahneman-Tversky utility model	Bradley-Terry preference model	Reward maximization with KL constraint
Implicit divergence estimated	Total variation between desirable and undesirable	Bradley-Terry-induced KL surrogate	Reward-shaped KL
SFT prerequisite	Optional for large models	Usually required	Usually required
Noise robustness	High (sigmoid saturation)	Moderate	Depends on reward model quality
Scales from 1B to 30B	Matches or exceeds DPO	Baseline	Variable

The paper's experimental results show KTO matching or exceeding DPO at all tested scales from 1B to 30B parameters. On the UltraFeedback dataset with a Zephyr-based model, KTO improved GSM8K (mathematical reasoning) accuracy by 13.5 percentage points over DPO, with additional gains on MMLU, HumanEval, and BigBench-Hard. On the OpenAssistant dataset with Llama-7B, KTO aligned with a single desirable or undesirable output per prompt still outperforms DPO trained on full preference pairs, even though this setup reduces the raw training data volume by 72%.

For larger models (13B+), the paper finds that KTO can sometimes skip the supervised fine-tuning stage entirely and still produce well-aligned models. DPO applied directly to a base model without SFT tends to produce outputs that ramble and hallucinate. KTO applied without prior SFT shows more stable behavior, likely because the prospect-theoretic loss function has stronger regularization properties.

Versus PPO-based RLHF, both KTO and DPO are substantially simpler to implement and more stable to train. RLHF requires a separately trained reward model, an online generation loop, and careful tuning of PPO hyperparameters. KTO and DPO are offline methods that use fixed datasets, removing the feedback loop that makes RLHF hard to debug. Among the offline methods, KTO has the additional advantage of not requiring the paired data format.

The paper also reports that on noisy public preference datasets such as SHP, OpenAssistant, and UltraFeedback, KTO outperforms DPO more decisively than on cleaner academic preference sets. The authors attribute the gap to intransitive preferences (where a > b, b > c, c > a) and to label noise from synthetic AI judges; KTO's reference-dependent loss tolerates both better than the Bradley-Terry objective.

Implementation in HuggingFace TRL

HuggingFace's TRL (Transformer Reinforcement Learning) library provides a KTOTrainer class that wraps the KTO training procedure for use with any causal language model from the Transformers library.

As of TRL v1.0, KTOTrainer and KTOConfig were moved to the trl.experimental.kto module while a refactor to align KTO with TRL's standard core trainer architecture was in progress. The API is functional but subject to change.

A minimal training script looks like:

from datasets import load_dataset
from trl.experimental.kto import KTOConfig, KTOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/kto-mix-14k", split="train")

training_args = KTOConfig(output_dir="Qwen2-0.5B-KTO")
trainer = KTOTrainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=train_dataset
)
trainer.train()

Dataset format

KTOTrainer expects an unpaired preference dataset where each example has three fields:

prompt: the input prompt as a string or list of chat messages
completion: the model's response
label: a boolean, True for desirable and False for undesirable

The trainer also accepts paired preference datasets (chosen/rejected format) and automatically converts them to unpaired binary examples by splitting each pair into two rows.

Recommended hyperparameters

Parameter	Recommended range	Notes
Learning rate	5e-7 to 5e-6	Default 1e-6; do not exceed 1e-6 for β=0.1
β (beta)	0.05 to 0.10	Controls KL penalty strength
Per-step batch size	At least 4	Smaller batches produce poor KL estimates
Effective batch size	16 to 128	Use gradient accumulation if needed
λ_D (desirable weight)	Adjust for imbalance	Target ratio λ_D×n_D / λ_U×n_U ∈ [1, 4/3]
λ_U (undesirable weight)	Adjust for imbalance	Default 1
Epochs	1 to 3	Prefer more epochs over higher LR if data is scarce
Gradient clipping	1.0	Standard value; reduces spikes from rare large rewards

The learning rate constraint is tighter than typical fine-tuning. The TRL documentation warns that exceeding the recommended range for a given β value degrades performance noticeably. If more iterations of training are needed with a small dataset, increasing epochs is preferable to increasing the learning rate.

The batch size recommendation comes from the KL estimate: z_0 is computed from the other examples in the same batch, so very small batches produce noisy reference points and destabilize training. The TRL implementation enforces a minimum batch size to keep this estimate well-defined; falling below it raises a warning during initialization.

Mixture of experts models

For MoE architectures such as Mixtral, enabling the auxiliary load-balancing loss during KTO training is recommended. This is done by setting output_router_logits=True in the model config and optionally adjusting router_aux_loss_coef. Without the auxiliary loss, the KTO objective can starve some experts of gradient, leading to routing collapse.

Memory and throughput

Because KTO retains a frozen reference model alongside the trainable policy, GPU memory consumption is roughly twice that of plain supervised fine-tuning at the same batch size. Practitioners often use LoRA or QLoRA adapters on the policy side and share the base weights with the reference, eliminating the duplication. The TRL KTOTrainer supports the PEFT library out of the box for this purpose.

Use cases

Production interaction logs

The most direct application of KTO is aligning a model on data that comes from a real deployment. When users interact with a chatbot, a writing assistant, or a code completion tool, their actions generate implicit binary signals: accepting a suggestion, clicking thumbs up, completing a task without regenerating, or conversely, ignoring a suggestion, regenerating, or reporting a problem. These signals are available at scale without requiring dedicated annotation.

KTO can consume this kind of data directly. An organization that logs which responses led to successful user sessions and which did not can train a KTO-aligned model on those logs, gradually improving the policy to generate more responses like the successful ones and fewer like the unsuccessful ones.

Helpfulness signals from downstream outcomes

Another class of KTO-compatible data comes from downstream task outcomes. A customer service platform might label interactions as desirable if the customer's issue was resolved (measured by whether they submitted another ticket within 24 hours) and undesirable otherwise. A coding assistant might label a suggestion as desirable if the developer accepted and committed it. A document summarization tool might label outputs as desirable if the user saved the summary or undesirable if they discarded it.

In each case, the label comes from the outcome of the interaction rather than a human evaluation of response quality. This kind of proxy feedback is imperfect but abundant, and KTO's noise robustness (the sigmoid saturation property) provides some protection against mislabeled examples.

Bootstrapping from rating data

Many products collect Likert-scale ratings (1 to 5 stars, or similar). Converting these to binary labels for KTO is straightforward: ratings above a threshold are desirable, ratings below are undesirable. The threshold can be set at the median or at a natural quality boundary.

This is simpler than the conversion needed for DPO. Converting rating data to DPO-format preferences requires pairing responses that received different ratings for the same prompt, which means either collecting multiple responses per prompt (expensive) or matching responses across different users who saw the same prompt (messy and potentially confounded by context differences).

Continued alignment of deployed models

Because KTO can be applied without prior SFT at larger scales, it is also a candidate for continued alignment of models that are already deployed and receiving user feedback. A model that has been SFT-trained and then deployed can accumulate binary feedback from users and then be re-aligned using KTO on that feedback, in a continuous loop that improves the model over time without requiring the construction of a fresh preference dataset for each update cycle.

Safety alignment with minimal data

Capital One's Enterprise AI team has reported using KTO in combination with SFT and DPO for safety-critical objectives, reporting that the stack improved attack detection rates by over 50% across several open-source models while using only modest annotation budgets. The case is illustrative because safety teams often have a long tail of single-example reports (one specific jailbreak, one specific refusal failure) rather than balanced preference pairs, which is precisely the regime where KTO has a structural advantage over pairwise methods.

Models trained with KTO

Several public models illustrate how KTO fits into a contemporary post-training stack.

Model	Base	Notes
Archangel KTO checkpoints	Llama, Pythia	56-model suite released alongside the paper, spanning 1B to 30B parameters
Contextual_KTO_Mistral_PairRM	Mistral 7B Instruct v0.2	Three iterations of KTO on the Snorkel-Mistral-PairRM-DPO-Dataset, with each previous iteration used as the reference for the next
KTO-aligned Zephyr variants	Mistral 7B	Replacing DPO with KTO in the Zephyr training pipeline, with GSM8K gains of roughly 13 points reported in the original paper
Various Llama derivatives	Llama family	Community-trained checkpoints on the kto-mix-14k dataset and on private production logs

The Contextual_KTO_Mistral_PairRM model is particularly notable. Released in March 2024, it reached score 33.23 on the verified Alpaca Eval 2.0 leaderboard, ranking second at the time. The training procedure took the Snorkel-Mistral-PairRM-DPO-Dataset (originally constructed as a preference dataset using PairRM to rank five sampled completions per prompt) and converted it into binary labels by treating the top-ranked completion as desirable and the bottom-ranked one as undesirable. KTO was then applied iteratively, with the previous iteration's model serving as the reference for the next. The result outperformed the matching DPO recipe on the same data, providing direct evidence that the choice of objective can dominate the choice of paired-vs-unpaired data format.

The Snorkel-Mistral-PairRM-DPO model, which served as the upstream baseline, reached 30.22 on AlpacaEval 2.0 using the same dataset trained with DPO. Comparing the two side by side gives a clean head-to-head: same base model, same data, different loss, with KTO winning by about three points.

Limitations

Weaker signal per example. A binary desirable/undesirable label contains less information than a preference pair. A preference pair tells the model not just that one output is good and another is bad, but also their relative quality and (implicitly) the dimensions on which they differ. KTO training does not receive this relative signal. In settings where high-quality preference data is available, DPO may learn faster from the same annotation budget.

Reference point estimation quality. The z_0 reference point is approximated from the batch, not computed exactly. With small batches (fewer than 4 per step), the estimate is too noisy to be useful, and the loss becomes less stable. This places a floor on the minimum effective batch size that does not exist for DPO. The constraint matters for memory-bound setups where large batches are difficult to fit, particularly for long-context models.

Imbalanced data sensitivity. The KTO loss is sensitive to the ratio of desirable to undesirable examples. If one type dominates heavily, the model may learn to reduce the probability of all outputs or increase it indiscriminately. The λ_D and λ_U parameters exist to compensate, but tuning them correctly requires some knowledge of the dataset composition. In production logs, positive feedback is usually rarer than implicit negative feedback (most users do not bother to rate at all, but the ones who do tend to be unhappy), so calibrating the weight ratio is an early step in any practical pipeline.

Experimental API status. As of TRL v1.0, the KTOTrainer is in the trl.experimental module and the API may change. Users building production pipelines on KTO should pin their TRL version and watch for breaking changes.

No direct theoretical guarantee of Pareto improvement over DPO. The paper shows that KTO matches or exceeds DPO empirically across many settings, but also states that "there is no universally superior HALO; optimal choice depends on setting-specific inductive biases." In settings with clean, abundant preference data and good annotator agreement, DPO may still be the right choice.

Memory overhead from the reference model. Like DPO, KTO requires a frozen reference model to compute the implied reward. This roughly doubles GPU memory at the same batch size unless adapters such as LoRA are used to share base weights. Reference-free alternatives such as SimPO eliminate this cost but lose some of the regularization benefits of the KL anchor.

Bias from biased feedback. KTO inherits the biases of whatever process generated the binary labels. If thumbs-up ratings are systematically correlated with response length, sycophancy, or other surface features rather than genuine quality, KTO will faithfully amplify those correlations. The method does not, on its own, solve the problem of poorly-defined or poorly-collected feedback.

Variants and successors

BCO (Binary Classifier Optimization) was introduced in April 2024 by Jung et al. (arXiv:2404.04656) and presented at ACL 2025. BCO also trains on binary feedback signals but uses a classification-based objective rather than a prospect-theoretic value function. The method estimates Jensen-Shannon divergence between the desirable and undesirable conditional distributions, whereas KTO can be read as estimating total variation distance. On paired preference datasets, BCO matches DPO and KTO. On real-world Likert-scale annotation data, where the underlying distributions for thumbs-up and thumbs-down subsets diverge, BCO outperforms both DPO and KTO across multiple base models and datasets. BCO and KTO represent two different points in the design space of unpaired alignment methods, each with theoretical motivations and empirical strengths.

Mo-KTO (Multi-Objective KTO) extends KTO to multi-objective settings where multiple distinct human preferences need to be balanced simultaneously. Introduced in a 2025 SSRN paper by Xie, Hu, and Zhang, Mo-KTO adapts the KTO value function to handle competing desirability criteria.

TKTO (TS Kahneman-Tversky Optimization) is a Thompson-sampling variant proposed for sequential decision settings, where the model collects fresh binary feedback during training rather than working only from a fixed dataset. The construction maintains the prospect-theoretic loss but pairs it with a sampling rule designed to balance exploration and exploitation.

ORPO (Odds Ratio Preference Optimization) is a related method that eliminates the reference model entirely by incorporating a preference signal directly into the supervised fine-tuning loss via an odds ratio term. ORPO requires paired data like DPO but avoids the computational cost of maintaining a reference model. It is sometimes grouped with KTO in surveys because both target the same broad goal of simplifying the alignment pipeline.

IPO (Identity Preference Optimization) was introduced to address DPO's tendency to overfit preference datasets by replacing the log-likelihood objective with a bounded function that does not saturate. IPO requires paired data but provides stronger theoretical guarantees against overfitting.

SimPO (Simple Preference Optimization) removes the explicit reference log-ratio term from the loss entirely, using the policy's own length-normalized log probability as the implicit reward. SimPO is reference-free and tends to be stabler than DPO under noisy labels, though it loses the KL anchor that prevents the policy from drifting too far from the SFT distribution.

SLiC (Sequence Likelihood Calibration) combines a max-margin loss on preferences with a standard language modeling loss. Like DPO and IPO, SLiC requires paired preferences.

Within the HALO framework introduced by the KTO paper, DPO, PPO-Clip, KTO, and other methods can all be understood as instances of the same general family. Future work may derive new HALOs with different inductive biases or stronger theoretical properties for specific application settings. A common pattern in the literature since 2024 is to stack multiple alignment objectives (for example SFT, then DPO, then KTO on a thin layer of production feedback), treating each method as a tool for a different stage of the pipeline rather than a single answer to the alignment problem.

Adoption

KTO attracted significant attention after the paper's publication in February 2024, partly because it addressed a practical bottleneck (the need for pairwise preference data) that practitioners had been working around in various ways. Contextual AI presented the work at NVIDIA GTC 2024 under the title "Better, Cheaper, Faster LLM Alignment with KTO."

HuggingFace integrated KTO into TRL shortly after the paper's release, making the training procedure accessible to practitioners using the standard Transformers ecosystem. The kto-mix-14k dataset on HuggingFace Hub provides a ready-to-use unpaired binary feedback dataset for experimenting with the method. Community implementations and tutorials accumulated rapidly through 2024 and 2025, with KTO becoming a standard entry in surveys of post-training methods alongside DPO, IPO, and SimPO.

The HALOs GitHub repository released the Archangel model suite, which comprises 56 model checkpoints aligned with different methods (DPO, KTO, PPO, and others) across different base models and scales. This set of checkpoints enabled direct empirical comparison of alignment methods under controlled conditions and has been used in subsequent research as a standardized benchmark for evaluating new alignment objectives.

Kawin Ethayarajh, the first author, describes KTO as "the industry standard for aligning LLMs on offline binary feedback," reflecting the method's uptake in production settings where preference data is not available but binary feedback is abundant. Industry adoption has been visible in alignment work at Capital One (safety alignment with mixed feedback types), Snorkel AI (preference distillation pipelines that produce both DPO and KTO-compatible datasets), and a long tail of smaller teams that integrate KTO into post-training scripts for domain-specific assistants.

The prospect-theoretic framing in the KTO paper also influenced a broader discussion in the alignment research community about whether alignment objectives should be derived from descriptive models of human psychology (how humans actually evaluate outputs) rather than normative models (how a rational agent would rank outputs). The HALO framework provides a mathematical vocabulary for this discussion by characterizing which existing methods already encode which human biases, implicitly or explicitly.

A separate downstream effect is on dataset curation practice. Because KTO can split a preference pair into two singleton examples and still learn effectively, dataset releases since 2024 have increasingly shipped both paired and unpaired views of the same data, with documentation that explains how to feed each format into different trainers. The kto-mix-14k dataset is one early example of this convention; the Snorkel preference distillation pipeline produces output usable in both formats by default.

References

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. *Proceedings of the 41st International Conference on Machine Learning (ICML 2024)*. arXiv:2402.01306.
Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. *Econometrica, 47*(2), 263-291.
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. *NeurIPS 2023*. arXiv:2305.18290.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. *NeurIPS 2022*. arXiv:2203.02155.
Jung, S., Han, G., Nam, D. W., & On, K.-W. (2024). Binary Classifier Optimization for Large Language Model Alignment. arXiv:2404.04656. ACL 2025.
HuggingFace TRL Documentation. KTO Trainer. https://huggingface.co/docs/trl/kto_trainer
ContextualAI. HALOs: A library with extensible implementations of DPO, KTO, PPO, ORPO, and other human-aware loss functions. https://github.com/ContextualAI/HALOs
Contextual AI. (2024). Better, Cheaper, Faster LLM Alignment with KTO. https://contextual.ai/better-cheaper-faster-llm-alignment-with-kto/
Azar, M. G., et al. (2023). A General Theoretical Paradigm to Understand Learning from Human Feedback. arXiv:2310.12036. (IPO)
Zhao, Y., et al. (2023). SLiC-HF: Sequence Likelihood Calibration with Human Feedback. arXiv:2305.10425.
Meng, Y., Xia, M., & Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. arXiv:2405.14734.
Hong, J., Lee, N., & Thorne, J. (2024). ORPO: Monolithic Preference Optimization without Reference Model. arXiv:2403.07691.
Tunstall, L., et al. (2023). Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944.
Snorkel AI. Snorkel-Mistral-PairRM-DPO. https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO.
Contextual AI. Contextual_KTO_Mistral_PairRM. https://huggingface.co/ContextualAI/Contextual_KTO_Mistral_PairRM.

KTO

Background: the problem with pairwise preference data

Kahneman-Tversky prospect theory

The KTO paper

Algorithm

KTO loss function

Noise robustness

Majority preference recovery

Connection to f-divergence estimators

Single-sample binary feedback

Comparison with DPO and PPO

Implementation in HuggingFace TRL

Dataset format

Recommended hyperparameters

Mixture of experts models

Memory and throughput

Use cases

Production interaction logs

Helpfulness signals from downstream outcomes

Bootstrapping from rating data

Continued alignment of deployed models

Safety alignment with minimal data

Models trained with KTO

Limitations

Variants and successors

Adoption

See also

References

Improve this article

Related Articles

GRPO

RLVR

Machine learning terms/Reinforcement Learning

AlphaGo

Constitutional Classifiers

ORPO

KTO

Background: the problem with pairwise preference data

Kahneman-Tversky prospect theory

The KTO paper

Algorithm

KTO loss function

Noise robustness

Majority preference recovery

Connection to f-divergence estimators

Single-sample binary feedback

Comparison with DPO and PPO

Implementation in HuggingFace TRL

Dataset format

Recommended hyperparameters

Mixture of experts models

Memory and throughput

Use cases

Production interaction logs

Helpfulness signals from downstream outcomes

Bootstrapping from rating data

Continued alignment of deployed models

Safety alignment with minimal data

Models trained with KTO

Limitations

Variants and successors

Adoption

See also

References

Related Articles

GRPO

RLVR

Machine learning terms/Reinforcement Learning

AlphaGo

Constitutional Classifiers

ORPO