KTO (Kahneman-Tversky Optimization) is a method for aligning large language models with human feedback using only binary signals indicating whether a model output is desirable or undesirable. Introduced in February 2024 by Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela of Stanford University and Contextual AI, KTO does not require paired preference comparisons and can instead learn directly from the kind of thumbs-up/thumbs-down feedback that organizations already collect in the ordinary course of deploying AI products.
The method is grounded in Kahneman and Tversky's prospect theory, a framework from behavioral economics that describes how humans perceive gains and losses asymmetrically relative to a reference point. The paper, titled "KTO: Model Alignment as Prospect Theoretic Optimization" (arXiv:2402.01306), was accepted at the International Conference on Machine Learning (ICML) 2024 as a spotlight presentation, placing it in the top 3.5% of accepted papers. The official implementation is available through ContextualAI/HALOs on GitHub, and TRL (Transformer Reinforcement Learning) provides a KTOTrainer for practical use.
The dominant paradigm for aligning language models before KTO was RLHF (Reinforcement Learning from Human Feedback), followed by offline alternatives such as DPO (Direct Preference Optimization). Both approaches depend on a specific kind of training data: pairs of outputs for the same prompt, one labeled as preferred over the other. A human annotator is shown two model responses and asked to indicate which one is better. This pairwise comparison structure is what the Bradley-Terry model (and its derivatives used in RLHF and DPO) was designed to learn from.
Pairwise preference data is expensive and difficult to collect at scale. Building it requires:
| Challenge | Description |
|---|---|
| Comparative annotation | Annotators must read and evaluate two full responses, not just one |
| Agreement overhead | Pairs where annotators disagree are often discarded |
| Data sourcing difficulty | Most real-world logs contain only a single model response per request, not two |
| Limited reuse | Preference data collected for one model version may not transfer cleanly to another |
Organizations that deploy chatbots, writing assistants, or code generation tools typically collect signals such as user thumbs-up/thumbs-down ratings, whether a session ended in a completed task, or whether a user deleted and rewrote a model output. These signals indicate that a given response was good or bad, but they do not involve a comparison between two candidate responses. Converting this kind of data into the pairwise format requires either fabricating a comparison or discarding much of the signal.
KTO was designed to use this type of singleton feedback directly.
The theoretical foundation of KTO comes from a 1979 paper by Daniel Kahneman and Amos Tversky, "Prospect Theory: An Analysis of Decision under Risk," published in Econometrica. The paper is among the most cited in all of social science and was cited when Kahneman received the Nobel Memorial Prize in Economics in 2002.
Prospect theory describes how people actually evaluate uncertain outcomes, as opposed to how rational utility maximization says they should. Three properties of the theory are particularly relevant to KTO:
Reference dependence. People evaluate outcomes relative to a reference point (often the status quo), not in absolute terms. A gain of $50 feels different depending on whether you expected $0 or expected $100.
Loss aversion. Losses hurt more than equivalent gains feel good. In Kahneman and Tversky's experiments, people typically required a potential gain roughly twice the size of a potential loss before accepting a coin-flip bet. The value function for losses is steeper than for gains.
Diminishing sensitivity. The marginal impact of an additional gain or loss decreases as the magnitude grows. Moving from $0 to $10 feels larger than moving from $100 to $110. The value function is concave for gains and convex for losses.
The resulting value function has a characteristic S-shape: steep and approximately linear near the reference point, flattening out in both directions, and asymmetric (steeper on the loss side).
The KTO paper argues that this framework is directly applicable to language model alignment. When a model produces a response to a prompt, the human evaluating it is not operating in a vacuum; they have a prior expectation of quality. Whether the response clears that bar or falls short shapes how its value should be calculated during training. Methods that ignore this reference-point structure are leaving useful inductive bias on the table.
Ethayarajh and colleagues make two central contributions in the paper.
The first is a theoretical analysis showing that existing alignment objectives, including DPO and a clipped version of PPO, implicitly belong to a broader class of loss functions called Human-Aware Losses (HALOs). A HALO is defined as any loss function of the form:
f(π_θ, π_ref) = E_{x,y~D}[a_{x,y} · v(r_θ(x,y) - E_Q[r_θ(x,y')])] + C_D
where a_{x,y} is +1 or -1 depending on whether the sample is labeled desirable, v is a value function that is non-decreasing and concave in gains, and Q provides a reference distribution from which the expected reward is drawn. The paper proves (Theorem 3.5) that both DPO and PPO-Clip satisfy these conditions. This places the success of methods like DPO in a new light: part of what makes them work is that they are effectively encoding the same asymmetric, reference-dependent utility structure that Kahneman and Tversky documented in human subjects.
The second contribution is KTO itself, a HALO that is directly derived from the Kahneman-Tversky utility model rather than arriving at prospect-theoretic structure by accident. Because the derivation starts from a utility function over individual outputs rather than a likelihood over preference pairs, the resulting objective only requires knowing whether each output is desirable or undesirable, not which of two outputs is better.
The paper was accepted at ICML 2024. The authors released 56 aligned model checkpoints under the name Archangel, spanning multiple base models (Llama, Pythia) at scales from 1B to 30B parameters, each aligned with a different combination of method and dataset.
KTO training follows a familiar structure: start from a supervised fine-tuned (SFT) base model and a frozen reference copy of that model, then update the policy model to increase the utility of desirable outputs and decrease the utility of undesirable ones.
The training data takes a simple form. Each example consists of:
xyy is desirable or undesirable given xThere is no requirement that each prompt have both a desirable and an undesirable example, though the training works best when both types appear in each batch. In practice, a dataset might come from customer support logs where successful resolutions are labeled desirable and unsuccessful ones undesirable, from user ratings on a deployed chatbot, or from any other source that provides a binary quality signal on individual outputs.
At each training step:
r_θ(x,y) = log[π_θ(y|x) / π_ref(y|x)].z_0 is estimated using the KL divergence between the policy and the reference on a batch of other completions.The reference point z_0 is what anchors the prospect-theoretic framing. Rather than evaluating whether a reward is high or low in an absolute sense, KTO evaluates whether it is above or below the expected reward across the current batch. This is analogous to the reference point in prospect theory: what matters is not the absolute value, but the deviation from what the model currently expects to receive.
In practice, the KL estimate is computed using a microbatch-shifting heuristic rather than explicit sampling, keeping the computational overhead manageable:
ẑ_0 = max(0, (1/m) Σ log[π_θ(y_j|x_i) / π_ref(y_j|x_i)])
The max(0, ...) ensures the reference point stays non-negative.
The KTO loss is defined as the expected gap between a target weight and the value function:
L_KTO(π_θ, π_ref) = E_{x,y~D}[λ_y - v(x,y)]
The value function v(x,y) has two branches, one for desirable outputs and one for undesirable ones:
v(x,y) = λ_D · σ(β · (r_θ(x,y) - z_0)) if y is desirable
v(x,y) = λ_U · σ(β · (z_0 - r_θ(x,y))) if y is undesirable
where σ is the logistic sigmoid function, β is a hyperparameter controlling how strongly the model is penalized for deviating from the reference, and λ_D and λ_U are weights for the desirable and undesirable loss terms respectively.
The two branches encode the prospect-theoretic asymmetry:
The sigmoid substitutes for the exponentiated power function in Kahneman and Tversky's original formulation, trading exact prospect-theoretic shape for numerical stability during gradient-based training.
The λ_D and λ_U parameters play the role of the loss-aversion coefficient. By default they are both set to 1. When the training dataset is imbalanced, the recommendation from the paper and TRL documentation is to adjust them so that the ratio (λ_D × n_desirable) / (λ_U × n_undesirable) falls between 1:1 and 4:3, where n_desirable and n_undesirable are the counts of positive and negative examples in the dataset.
The KTO loss for a desirable output is minimized when the policy assigns high probability to that output relative to the reference. The loss for an undesirable output is minimized when the policy assigns low probability relative to the reference. Both terms share a common reference point z_0, which prevents the model from gaming the loss by simply scaling all probabilities up or down uniformly.
The paper proves (Proposition 4.1) that KTO naturally down-weights examples with extreme implied rewards. If an example carries an implausibly large positive or negative reward, the sigmoid saturates and contributes little gradient. In practice, this means mislabeled or ambiguous examples in real-world feedback data have less influence on training than they would under a method that optimizes log-likelihood directly. DPO, by contrast, can overfit to noise in the preference labels because it maximizes the probability of the chosen response unconditionally.
The paper also proves (Theorem 4.3) that when humans disagree about whether an output is good, KTO trained on the majority label deterministically recovers the majority-preferred output. DPO can fail in the worst case under contradictory preference scenarios, selecting minority-preferred outputs depending on how the pairs are constructed.
The core practical distinction between KTO and methods like DPO or RLHF is what the training data looks like.
RLHF requires: a prompt, two candidate responses, a human preference label indicating which response is better, a separately trained reward model, and a reinforcement learning loop (typically PPO) to update the policy against the reward signal.
DPO requires: a prompt, a chosen response, and a rejected response. The preference pair is the basic unit of data.
KTO requires: a prompt, a single response, and a binary label (desirable or undesirable).
This difference has significant practical consequences. Most deployed AI systems generate a single response per user turn. Collecting a second response for the same prompt to construct a preference pair either requires deliberately sampling a second output (adding inference cost and latency), using a synthetic negative from a different model, or pairing the current output against a historical output under different conditions. None of these approaches directly reflects how users interact with a deployed system.
Binary feedback, on the other hand, maps naturally onto data that organizations already collect: thumbs up/down ratings, whether a user accepted or regenerated a suggestion, whether a customer service interaction ended in resolution, whether a code suggestion was accepted or deleted. The paper notes: "every company has customer interaction data that can be marked as desirable (e.g., sale made) or undesirable (e.g., no sale made)." KTO is designed to use that data without a conversion step.
Another advantage: when KTO is given the same data as DPO in paired form, it can decompose each pair into two singleton examples, giving it up to twice the training examples from the same annotation budget. Empirically, breaking preference pairs into binary singletons for KTO training often matches or exceeds DPO performance on the same data.
The following table summarizes the key differences between KTO, DPO, and PPO-based RLHF.
| Property | KTO | DPO | PPO (RLHF) |
|---|---|---|---|
| Data format | Single response + binary label | Preference pair (chosen/rejected) | Preference pairs + reward model |
| Requires paired comparisons | No | Yes | Yes |
| Reward model required | No | No | Yes |
| Online sampling required | No | No | Yes |
| Reference model required | Yes | Yes | Yes (as KL penalty) |
| Training stability | High | High | Lower (RL instability) |
| Loss function type | Prospect-theoretic utility | Log-likelihood of preferences | PPO clipped policy gradient |
| Theoretical basis | Kahneman-Tversky utility model | Bradley-Terry preference model | Reward maximization with KL constraint |
| SFT prerequisite | Optional for large models | Usually required | Usually required |
| Noise robustness | High (sigmoid saturation) | Moderate | Depends on reward model quality |
| Scales from 1B to 30B | Matches or exceeds DPO | Baseline | Variable |
The paper's experimental results show KTO matching or exceeding DPO at all tested scales from 1B to 30B parameters. On the UltraFeedback dataset with a Zephyr-based model, KTO improved GSM8K (mathematical reasoning) accuracy by 13.5 percentage points over DPO. On the OpenAssistant dataset with Llama-7B, KTO aligned with a single desirable or undesirable output per prompt still outperforms DPO trained on full preference pairs, even though this setup reduces the raw training data volume by 72%.
For larger models (13B+), the paper finds that KTO can sometimes skip the supervised fine-tuning stage entirely and still produce well-aligned models. DPO applied directly to a base model without SFT tends to produce outputs that ramble and hallucinate. KTO applied without prior SFT shows more stable behavior, likely because the prospect-theoretic loss function has stronger regularization properties.
Versus PPO-based RLHF, both KTO and DPO are substantially simpler to implement and more stable to train. RLHF requires a separately trained reward model, an online generation loop, and careful tuning of PPO hyperparameters. KTO and DPO are offline methods that use fixed datasets, removing the feedback loop that makes RLHF hard to debug. Among the offline methods, KTO has the additional advantage of not requiring the paired data format.
HuggingFace's TRL (Transformer Reinforcement Learning) library provides a KTOTrainer class that wraps the KTO training procedure for use with any causal language model from the Transformers library.
As of TRL v1.0, KTOTrainer and KTOConfig were moved to the trl.experimental.kto module while a refactor to align KTO with TRL's standard core trainer architecture was in progress. The API is functional but subject to change.
A minimal training script looks like:
from datasets import load_dataset
from trl.experimental.kto import KTOConfig, KTOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
train_dataset = load_dataset("trl-lib/kto-mix-14k", split="train")
training_args = KTOConfig(output_dir="Qwen2-0.5B-KTO")
trainer = KTOTrainer(
model=model,
args=training_args,
processing_class=tokenizer,
train_dataset=train_dataset
)
trainer.train()
KTOTrainer expects an unpaired preference dataset where each example has three fields:
prompt: the input prompt as a string or list of chat messagescompletion: the model's responselabel: a boolean, True for desirable and False for undesirableThe trainer also accepts paired preference datasets (chosen/rejected format) and automatically converts them to unpaired binary examples by splitting each pair into two rows.
| Parameter | Recommended range | Notes |
|---|---|---|
| Learning rate | 5e-7 to 5e-6 | Default 1e-6; do not exceed 1e-6 for β=0.1 |
| β (beta) | 0.05 to 0.10 | Controls KL penalty strength |
| Per-step batch size | At least 4 | Smaller batches produce poor KL estimates |
| Effective batch size | 16 to 128 | Use gradient accumulation if needed |
| λ_D (desirable weight) | Adjust for imbalance | Target ratio λ_D×n_D / λ_U×n_U ∈ [1, 4/3] |
| λ_U (undesirable weight) | Adjust for imbalance | Default 1 |
The learning rate constraint is tighter than typical fine-tuning. The TRL documentation warns that exceeding the recommended range for a given β value degrades performance noticeably. If more iterations of training are needed with a small dataset, increasing epochs is preferable to increasing the learning rate.
The batch size recommendation comes from the KL estimate: z_0 is computed from the other examples in the same batch, so very small batches produce noisy reference points and destabilize training.
For MoE architectures such as Mixtral, enabling the auxiliary load-balancing loss during KTO training is recommended. This is done by setting output_router_logits=True in the model config and optionally adjusting router_aux_loss_coef.
The most direct application of KTO is aligning a model on data that comes from a real deployment. When users interact with a chatbot, a writing assistant, or a code completion tool, their actions generate implicit binary signals: accepting a suggestion, clicking thumbs up, completing a task without regenerating, or conversely, ignoring a suggestion, regenerating, or reporting a problem. These signals are available at scale without requiring dedicated annotation.
KTO can consume this kind of data directly. An organization that logs which responses led to successful user sessions and which did not can train a KTO-aligned model on those logs, gradually improving the policy to generate more responses like the successful ones and fewer like the unsuccessful ones.
Another class of KTO-compatible data comes from downstream task outcomes. A customer service platform might label interactions as desirable if the customer's issue was resolved (measured by whether they submitted another ticket within 24 hours) and undesirable otherwise. A coding assistant might label a suggestion as desirable if the developer accepted and committed it. A document summarization tool might label outputs as desirable if the user saved the summary or undesirable if they discarded it.
In each case, the label comes from the outcome of the interaction rather than a human evaluation of response quality. This kind of proxy feedback is imperfect but abundant, and KTO's noise robustness (the sigmoid saturation property) provides some protection against mislabeled examples.
Many products collect Likert-scale ratings (1 to 5 stars, or similar). Converting these to binary labels for KTO is straightforward: ratings above a threshold are desirable, ratings below are undesirable. The threshold can be set at the median or at a natural quality boundary.
This is simpler than the conversion needed for DPO. Converting rating data to DPO-format preferences requires pairing responses that received different ratings for the same prompt, which means either collecting multiple responses per prompt (expensive) or matching responses across different users who saw the same prompt (messy and potentially confounded by context differences).
Because KTO can be applied without prior SFT at larger scales, it is also a candidate for continued alignment of models that are already deployed and receiving user feedback. A model that has been SFT-trained and then deployed can accumulate binary feedback from users and then be re-aligned using KTO on that feedback, in a continuous loop that improves the model over time without requiring the construction of a fresh preference dataset for each update cycle.
Weaker signal per example. A binary desirable/undesirable label contains less information than a preference pair. A preference pair tells the model not just that one output is good and another is bad, but also their relative quality and (implicitly) the dimensions on which they differ. KTO training does not receive this relative signal. In settings where high-quality preference data is available, DPO may learn faster from the same annotation budget.
Reference point estimation quality. The z_0 reference point is approximated from the batch, not computed exactly. With small batches (fewer than 4 per step), the estimate is too noisy to be useful, and the loss becomes less stable. This places a floor on the minimum effective batch size that does not exist for DPO.
Imbalanced data sensitivity. The KTO loss is sensitive to the ratio of desirable to undesirable examples. If one type dominates heavily, the model may learn to reduce the probability of all outputs or increase it indiscriminately. The λ_D and λ_U parameters exist to compensate, but tuning them correctly requires some knowledge of the dataset composition.
Experimental API status. As of TRL v1.0, the KTOTrainer is in the trl.experimental module and the API may change. Users building production pipelines on KTO should pin their TRL version and watch for breaking changes.
No direct theoretical guarantee of Pareto improvement over DPO. The paper shows that KTO matches or exceeds DPO empirically across many settings, but also states that "there is no universally superior HALO; optimal choice depends on setting-specific inductive biases." In settings with clean, abundant preference data and good annotator agreement, DPO may still be the right choice.
BCO (Binary Classifier Optimization) was introduced in April 2024 by Jung et al. (arXiv:2404.04656). BCO also trains on binary feedback signals but uses a classification-based objective rather than a prospect-theoretic value function. On paired preference datasets, BCO surpasses KTO and performs comparably to DPO. On real-world Likert-scale annotation data, BCO outperforms both DPO and KTO. BCO and KTO represent two different approaches to the same problem of learning from binary feedback without pairwise comparisons.
Mo-KTO (Multi-Objective KTO) extends KTO to multi-objective settings where multiple distinct human preferences need to be balanced simultaneously. Introduced in a 2025 SSRN paper by Xie, Hu, and Zhang, Mo-KTO adapts the KTO value function to handle competing desirability criteria.
ORPO (Odds Ratio Preference Optimization) is a related method that eliminates the reference model entirely by incorporating a preference signal directly into the supervised fine-tuning loss via an odds ratio term. ORPO requires paired data like DPO but avoids the computational cost of maintaining a reference model.
IPO (Identity Preference Optimization) was introduced to address DPO's tendency to overfit preference datasets by replacing the log-likelihood objective with a bounded function that does not saturate. IPO requires paired data but provides stronger theoretical guarantees against overfitting.
SLiC (Sequence Likelihood Calibration) combines a max-margin loss on preferences with a standard language modeling loss. Like DPO and IPO, SLiC requires paired preferences.
Within the HALO framework introduced by the KTO paper, DPO, PPO-Clip, KTO, and other methods can all be understood as instances of the same general family. Future work may derive new HALOs with different inductive biases or stronger theoretical properties for specific application settings.
KTO attracted significant attention after the paper's publication in February 2024, partly because it addressed a practical bottleneck (the need for pairwise preference data) that practitioners had been working around in various ways. Contextual AI presented the work at NVIDIA GTC 2024 under the title "Better, Cheaper, Faster LLM Alignment with KTO."
HuggingFace integrated KTO into TRL shortly after the paper's release, making the training procedure accessible to practitioners using the standard Transformers ecosystem. The kto-mix-14k dataset on HuggingFace Hub provides a ready-to-use unpaired binary feedback dataset for experimenting with the method.
The HALOs GitHub repository released the Archangel model suite, which comprises 56 model checkpoints aligned with different methods (DPO, KTO, PPO, and others) across different base models and scales. This set of checkpoints enabled direct empirical comparison of alignment methods under controlled conditions and has been used in subsequent research.
Kawin Ethayarajh, the first author, describes KTO as "the industry standard for aligning LLMs on offline binary feedback," reflecting the method's uptake in production settings where preference data is not available but binary feedback is abundant.
The prospect-theoretic framing in the KTO paper also influenced a broader discussion in the alignment research community about whether alignment objectives should be derived from descriptive models of human psychology (how humans actually evaluate outputs) rather than normative models (how a rational agent would rank outputs). The HALO framework provides a mathematical vocabulary for this discussion by characterizing which existing methods already encode which human biases, implicitly or explicitly.