Contrastive decoding
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,032 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,032 words
Add missing citations, update stale details, or suggest a clearer explanation.
Contrastive decoding (CD) is a decoding strategy for text generation from a large language model that selects tokens by contrasting two models of different sizes. It computes the difference between the log-probabilities of a large, capable "expert" model and a small, weaker "amateur" model, and prefers tokens that the expert favors much more than the amateur does. The intuition is that systematic failure modes of language models, such as repetition, genericness, and incoherence, are more pronounced in the smaller model, so subtracting the amateur's predictions cancels out those shared errors while preserving the knowledge that is unique to the expert. A plausibility filter prevents the contrast from elevating tokens the expert considers unlikely. The method was introduced in "Contrastive Decoding: Open-ended Text Generation as Optimization" by Xiang Lisa Li and collaborators at Stanford University, Meta AI, the University of Washington, Carnegie Mellon University, and Johns Hopkins University, first posted in October 2022 and published at ACL 2023. [1][7]
Contrastive decoding requires no additional training and adds only the cost of running a small second model. It began as a way to improve open-ended generation, but a 2023 follow-up showed it also raises accuracy on reasoning benchmarks, and a related technique, DoLa, applies the same contrastive idea within a single model by contrasting its layers to reduce hallucination. [2][3]
Decoding from a language model means turning its next-token probability distributions into an actual sequence of tokens, and the choice of method has a large effect on quality. Maximizing probability through greedy decoding or beam search is a poor objective for open-ended generation: it tends to produce short, dull, and repetitive text that loops on itself. This degeneration was documented by Ari Holtzman and colleagues in 2019, the same work that introduced nucleus sampling, and Holtzman is also a co-author of contrastive decoding. [4] Stochastic methods such as nucleus (top-p) and top-k sampling restore diversity, but because they sample from the tail of the distribution they can pick low-probability tokens that cause topic drift, factual errors, and incoherence. Open-ended decoding therefore faces a tension between coherence and diversity that neither extreme resolves well. [1]
Contrastive decoding attacks this from a different angle, using a second model as a foil instead of reweighting one model's distribution by a fixed rule. Because small language models exhibit the same undesirable tendencies as large ones, only more strongly, the amateur serves as a model of what to avoid. Tokens where the two models agree are usually either trivially correct or shared mistakes, whereas tokens where the expert is far more confident than the amateur tend to reflect the expert's superior fluency and world knowledge. Scoring tokens on this gap keeps generation coherent without collapsing into repetition. [1]
The original paper frames decoding as optimization: search for the output string that maximizes a contrastive objective, defined as the difference between the sequence log-likelihood under the expert and under the amateur. For a candidate token w in context c, the per-token score is
score(w; c) = log p_expert(w | c) - log p_amateur(w | c),
which is the log of the ratio between the two models' probabilities. Summed over a sequence, maximizing this quantity favors continuations the expert finds far more likely than the amateur. [1]
Used alone, this log-ratio misbehaves. The amateur is often right about the easy, high-probability next token (for example, completing a common phrase), and the raw difference would penalize exactly those correct choices. It can also reward bizarre tokens that both models consider unlikely but the amateur dislikes slightly more. To prevent this, CD applies an adaptive plausibility constraint that first restricts the candidate set to tokens the expert already finds credible. Only tokens whose expert probability is at least a fraction alpha of the most probable token's probability remain eligible:
V_valid(c) = { w : p_expert(w | c) >= alpha * max_w' p_expert(w' | c) }.
The contrast is then applied only within this set. The default alpha is 0.1, meaning any token with less than one-tenth of the top token's probability is discarded before contrasting. This keeps the expert in control of plausibility while the amateur is used only to break ties among already-reasonable options. In the original experiments the expert and amateur came from the same family, such as OPT 13B contrasted with OPT 125M, or GPT-2 1.5B contrasted with a smaller GPT-2. Across Wikipedia, news, and story domains, CD significantly outperformed nucleus and top-k sampling on automatic metrics and human evaluations of fluency, coherence, and diversity. [1]
A 2023 reformulation by Sean O'Brien and Mike Lewis (UC San Diego and Meta AI) recast the method at the level of logits, which is how most implementations now work. The combined score for a token i is
s_CD(i) = (1 + beta) * s_expert(i) - beta * s_amateur(i)
for tokens that pass the alpha mask, and negative infinity otherwise, where s denotes a model's logits. The strength hyperparameter beta controls how aggressively the amateur is subtracted; beta = 0 recovers the expert alone. Their default settings were alpha = 0.1 and beta = 0.5. [2]
The reasoning study by O'Brien and Lewis paired a LLaMA 65B expert with a 1.5-billion-parameter LLaMA amateur trained on the same data, and found that contrastive decoding improves not just stylistic quality but task accuracy. With CD applied on top of chain-of-thought prompting, LLaMA 65B reached 57.7 on the GSM8K grade-school math benchmark and 88.0 on HellaSwag commonsense reasoning, in both cases surpassing substantially larger or more expensively trained systems. The method raised GSM8K accuracy by up to roughly 8 absolute points across LLaMA model sizes. [2]
| System | GSM8K accuracy | HellaSwag accuracy |
|---|---|---|
| LLaMA 65B with contrastive decoding | 57.7 | 88.0 |
| GPT-3.5 | 57.1 | 85.5 |
| LLaMA 2 | 56.8 | 85.3 |
| PaLM 540B (GSM8K) / PaLM 2-Large (HellaSwag) | 56.5 | 86.8 |
The gains were not universal. The authors report that CD helps most on multi-step reasoning expressed through generated chains of thought, where the amateur's tendency toward shortcuts and copying is what gets suppressed. It gave little benefit on the harder MATH dataset, could hurt some commonsense tasks for smaller experts, and slightly degraded performance on tasks that depend on raw factual recall, such as parts of TriviaQA and OpenBookQA, because down-weighting the amateur can also down-weight memorized facts that both models know. CD also provided no consistent improvement when chain-of-thought prompting was absent. [2]
DoLa, short for "Decoding by Contrasting Layers," adapts the contrastive idea to operate inside a single model rather than across two models. Introduced by Yung-Sung Chuang and colleagues at MIT and Microsoft (posted September 2023, published at ICLR 2024), it targets factuality and hallucination. [3] The method builds on the observation that a transformer refines its predictions layer by layer, with lower-level and factual information settling at different depths. DoLa treats the final layer as a "mature" distribution and an earlier layer as a "premature" one, projecting each layer's hidden state through the model's output head (an early-exit or logit-lens read-out) to obtain a next-token distribution per layer. The decoding score is the log-ratio between the mature and premature distributions, which amplifies the knowledge that emerges in the upper layers and suppresses the more superficial patterns already present lower down.
Rather than fixing which earlier layer to use, DoLa selects the premature layer dynamically at each step, choosing from a candidate set the layer whose distribution is most divergent from the final layer, measured by Jensen-Shannon divergence. It applies the same style of adaptive plausibility constraint as contrastive decoding, plus a repetition penalty. On the TruthfulQA benchmark, DoLa improved the truthfulness of LLaMA models by roughly 12 to 17 absolute points, and it also helped on factual reasoning tasks such as FACTOR and StrategyQA, all without retrieval or fine-tuning. [3]
Contrastive decoding belongs to a broader family of inference-time techniques that adjust a model's output distribution by combining it with another distribution, a product of experts style of reasoning in which the amateur acts as a negative expert.
Classifier-free guidance (CFG) is the closest relative. Originally a diffusion model technique, CFG was adapted to language models by Guillaume Sanchez and colleagues in 2023. [5] It uses the same logit-arithmetic shape as CD, (1 + gamma) times the conditional logits minus gamma times an unconditional or weakly conditioned term. The difference is what plays the role of the subtracted distribution: CFG contrasts the same model prompted versus unprompted, whereas CD contrasts a strong model against a separate weak model. If the amateur in CD were replaced by a less-conditioned copy of the expert, the two methods would coincide.
Speculative decoding also pairs a large model with a small one, but with the opposite goal and effect. Speculative decoding uses the small model to draft tokens that the large model then verifies, accelerating inference while provably preserving the large model's exact output distribution. [6] Contrastive decoding deliberately changes the output distribution to improve it, and it adds cost rather than saving it, since both models must score every step. The shared two-model setup nonetheless invites combination: Speculative Contrastive Decoding, proposed by Hongyi Yuan and colleagues in 2023, reuses a single small model both to draft tokens for speed and to provide the amateur signal for quality, getting acceleration and improved generation from the same auxiliary model. [8]
| Method | Auxiliary signal | Goal | Effect on output |
|---|---|---|---|
| Contrastive decoding | Separate small amateur model | Higher quality | Changes output |
| DoLa | Earlier layer of the same model | Less hallucination | Changes output |
| Classifier-free guidance | Same model, weaker conditioning | Stronger prompt adherence | Changes output |
| Speculative decoding | Separate small draft model | Faster inference | Preserves output exactly |
Contrastive decoding has clear boundaries. It requires access to a second model whose vocabulary and tokenizer align with the expert's, and running that amateur adds latency and memory, even though the amateur is small. Output quality is sensitive to the choice of amateur and to the hyperparameters alpha and beta: too weak a contrast does little, while too strong a contrast can push generation toward odd tokens despite the plausibility filter. As the reasoning study showed, the method is not uniformly beneficial. It can reduce accuracy on tasks that hinge on factual recall, because suppressing the amateur also suppresses facts both models have memorized, and its reasoning gains depend on chain-of-thought style generation. [1][2] DoLa avoids the need for a second model but introduces its own assumption, since contrasting layers presumes that later layers are reliably more factual than earlier ones, which does not hold for every model or every token. [3] More broadly, contrastive decoding reshapes a model's existing distribution rather than adding new knowledge, so it cannot fix errors the expert has no signal about.