Contrastive decoding

AI Infrastructure Machine Learning

10 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,034 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Contrastive decoding (CD) is a decoding strategy for text generation from a large language model that selects tokens by contrasting two models of different sizes. It computes the difference between the log-probabilities of a large, capable "expert" model and a small, weaker "amateur" model, and prefers tokens that the expert favors much more than the amateur does. The intuition is that systematic failure modes of language models, such as repetition, genericness, and incoherence, are more pronounced in the smaller model, so subtracting the amateur's predictions cancels out those shared errors while preserving the knowledge that is unique to the expert. A plausibility filter prevents the contrast from elevating tokens the expert considers unlikely. The method was introduced in "Contrastive Decoding: Open-ended Text Generation as Optimization" by Xiang Lisa Li and collaborators at Stanford University, Meta AI, the University of Washington, Carnegie Mellon University, and Johns Hopkins University, first posted in October 2022 and published at ACL 2023. ^[1]^[7]

Contrastive decoding requires no additional training and adds only the cost of running a small second model. It began as a way to improve open-ended generation, but a 2023 follow-up showed it also raises accuracy on reasoning benchmarks, and a related technique, DoLa, applies the same contrastive idea within a single model by contrasting its layers to reduce hallucination. ^[2]^[3]

Motivation

Decoding from a language model means turning its next-token probability distributions into an actual sequence of tokens, and the choice of method has a large effect on quality. Maximizing probability through greedy decoding or beam search is a poor objective for open-ended generation: it tends to produce short, dull, and repetitive text that loops on itself. This degeneration was documented by Ari Holtzman and colleagues in 2019, the same work that introduced nucleus sampling, and Holtzman is also a co-author of contrastive decoding. ^[4] Stochastic methods such as nucleus (top-p) and top-k sampling restore diversity, but because they sample from the tail of the distribution they can pick low-probability tokens that cause topic drift, factual errors, and incoherence. Open-ended decoding therefore faces a tension between coherence and diversity that neither extreme resolves well. ^[1]

Contrastive decoding attacks this from a different angle, using a second model as a foil instead of reweighting one model's distribution by a fixed rule. Because small language models exhibit the same undesirable tendencies as large ones, only more strongly, the amateur serves as a model of what to avoid. Tokens where the two models agree are usually either trivially correct or shared mistakes, whereas tokens where the expert is far more confident than the amateur tend to reflect the expert's superior fluency and world knowledge. Scoring tokens on this gap keeps generation coherent without collapsing into repetition. ^[1]

How contrastive decoding works

The original paper frames decoding as optimization: search for the output string that maximizes a contrastive objective, defined as the difference between the sequence log-likelihood under the expert and under the amateur. For a candidate token $w$ in context $c$ , the per-token score is

\text{score}(w; c) = \log p_{\text{expert}}(w \mid c) - \log p_{\text{amateur}}(w \mid c)

which is the log of the ratio between the two models' probabilities. Summed over a sequence, maximizing this quantity favors continuations the expert finds far more likely than the amateur. ^[1]

Used alone, this log-ratio misbehaves. The amateur is often right about the easy, high-probability next token (for example, completing a common phrase), and the raw difference would penalize exactly those correct choices. It can also reward bizarre tokens that both models consider unlikely but the amateur dislikes slightly more. To prevent this, CD applies an adaptive plausibility constraint that first restricts the candidate set to tokens the expert already finds credible. Only tokens whose expert probability is at least a fraction $\alpha$ of the most probable token's probability remain eligible:

V_{\text{valid}}(c) = \left\{ w : p_{\text{expert}}(w \mid c) \ge \alpha \max_{w'} p_{\text{expert}}(w' \mid c) \right\}

The contrast is then applied only within this set. The default $\alpha$ is 0.1, meaning any token with less than one-tenth of the top token's probability is discarded before contrasting. This keeps the expert in control of plausibility while the amateur is used only to break ties among already-reasonable options. In the original experiments the expert and amateur came from the same family, such as OPT 13B contrasted with OPT 125M, or GPT-2 1.5B contrasted with a smaller GPT-2. Across Wikipedia, news, and story domains, CD significantly outperformed nucleus and top-k sampling on automatic metrics and human evaluations of fluency, coherence, and diversity. ^[1]

A 2023 reformulation by Sean O'Brien and Mike Lewis (UC San Diego and Meta AI) recast the method at the level of logits, which is how most implementations now work. The combined score for a token $i$ is

s_{\text{CD}}(i) = (1 + \beta) s_{\text{expert}}(i) - \beta s_{\text{amateur}}(i)

for tokens that pass the $\alpha$ mask, and $-\infty$ otherwise, where $s$ denotes a model's logits. The strength hyperparameter $\beta$ controls how aggressively the amateur is subtracted; $\beta = 0$ recovers the expert alone. Their default settings were $\alpha = 0.1$ and $\beta = 0.5$ . ^[2]

Effects on reasoning and factuality

The reasoning study by O'Brien and Lewis paired a LLaMA 65B expert with a 1.5-billion-parameter LLaMA amateur trained on the same data, and found that contrastive decoding improves not just stylistic quality but task accuracy. With CD applied on top of chain-of-thought prompting, LLaMA 65B reached 57.7 on the GSM8K grade-school math benchmark and 88.0 on HellaSwag commonsense reasoning, in both cases surpassing substantially larger or more expensively trained systems. The method raised GSM8K accuracy by up to roughly 8 absolute points across LLaMA model sizes. ^[2]

System	GSM8K accuracy	HellaSwag accuracy
LLaMA 65B with contrastive decoding	57.7	88.0
GPT-3.5	57.1	85.5
LLaMA 2	56.8	85.3
PaLM 540B (GSM8K) / PaLM 2-Large (HellaSwag)	56.5	86.8

The gains were not universal. The authors report that CD helps most on multi-step reasoning expressed through generated chains of thought, where the amateur's tendency toward shortcuts and copying is what gets suppressed. It gave little benefit on the harder MATH dataset, could hurt some commonsense tasks for smaller experts, and slightly degraded performance on tasks that depend on raw factual recall, such as parts of TriviaQA and OpenBookQA, because down-weighting the amateur can also down-weight memorized facts that both models know. CD also provided no consistent improvement when chain-of-thought prompting was absent. ^[2]

DoLa (contrasting layers)

DoLa, short for "Decoding by Contrasting Layers," adapts the contrastive idea to operate inside a single model rather than across two models. Introduced by Yung-Sung Chuang and colleagues at MIT and Microsoft (posted September 2023, published at ICLR 2024), it targets factuality and hallucination. ^[3] The method builds on the observation that a transformer refines its predictions layer by layer, with lower-level and factual information settling at different depths. DoLa treats the final layer as a "mature" distribution and an earlier layer as a "premature" one, projecting each layer's hidden state through the model's output head (an early-exit or logit-lens read-out) to obtain a next-token distribution per layer. The decoding score is the log-ratio between the mature and premature distributions, which amplifies the knowledge that emerges in the upper layers and suppresses the more superficial patterns already present lower down.

Rather than fixing which earlier layer to use, DoLa selects the premature layer dynamically at each step, choosing from a candidate set the layer whose distribution is most divergent from the final layer, measured by Jensen-Shannon divergence. It applies the same style of adaptive plausibility constraint as contrastive decoding, plus a repetition penalty. On the TruthfulQA benchmark, DoLa improved the truthfulness of LLaMA models by roughly 12 to 17 absolute points, and it also helped on factual reasoning tasks such as FACTOR and StrategyQA, all without retrieval or fine-tuning. ^[3]

Relationship to other methods

Contrastive decoding belongs to a broader family of inference-time techniques that adjust a model's output distribution by combining it with another distribution, a product of experts style of reasoning in which the amateur acts as a negative expert.

Classifier-free guidance (CFG) is the closest relative. Originally a diffusion model technique, CFG was adapted to language models by Guillaume Sanchez and colleagues in 2023. ^[5] It uses the same logit-arithmetic shape as CD, $(1 + \gamma)$ times the conditional logits minus $\gamma$ times an unconditional or weakly conditioned term. The difference is what plays the role of the subtracted distribution: CFG contrasts the same model prompted versus unprompted, whereas CD contrasts a strong model against a separate weak model. If the amateur in CD were replaced by a less-conditioned copy of the expert, the two methods would coincide.

Speculative decoding also pairs a large model with a small one, but with the opposite goal and effect. Speculative decoding uses the small model to draft tokens that the large model then verifies, accelerating inference while provably preserving the large model's exact output distribution. ^[6] Contrastive decoding deliberately changes the output distribution to improve it, and it adds cost rather than saving it, since both models must score every step. The shared two-model setup nonetheless invites combination: Speculative Contrastive Decoding, proposed by Hongyi Yuan and colleagues in 2023, reuses a single small model both to draft tokens for speed and to provide the amateur signal for quality, getting acceleration and improved generation from the same auxiliary model. ^[8]

Method	Auxiliary signal	Goal	Effect on output
Contrastive decoding	Separate small amateur model	Higher quality	Changes output
DoLa	Earlier layer of the same model	Less hallucination	Changes output
Classifier-free guidance	Same model, weaker conditioning	Stronger prompt adherence	Changes output
Speculative decoding	Separate small draft model	Faster inference	Preserves output exactly

Limitations

Contrastive decoding has clear boundaries. It requires access to a second model whose vocabulary and tokenizer align with the expert's, and running that amateur adds latency and memory, even though the amateur is small. Output quality is sensitive to the choice of amateur and to the hyperparameters $\alpha$ and $\beta$ : too weak a contrast does little, while too strong a contrast can push generation toward odd tokens despite the plausibility filter. As the reasoning study showed, the method is not uniformly beneficial. It can reduce accuracy on tasks that hinge on factual recall, because suppressing the amateur also suppresses facts both models have memorized, and its reasoning gains depend on chain-of-thought style generation. ^[1]^[2] DoLa avoids the need for a second model but introduces its own assumption, since contrasting layers presumes that later layers are reliably more factual than earlier ones, which does not hold for every model or every token. ^[3] More broadly, contrastive decoding reshapes a model's existing distribution rather than adding new knowledge, so it cannot fix errors the expert has no signal about.

References

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, Mike Lewis. "Contrastive Decoding: Open-ended Text Generation as Optimization." arXiv:2210.15097, October 2022. https://arxiv.org/abs/2210.15097 ↩
Sean O'Brien, Mike Lewis. "Contrastive Decoding Improves Reasoning in Large Language Models." arXiv:2309.09117, September 2023. https://arxiv.org/abs/2309.09117 ↩
Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He. "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models." ICLR 2024. arXiv:2309.03883, September 2023. https://arxiv.org/abs/2309.03883 ↩
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi. "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751, April 2019. https://arxiv.org/abs/1904.09751 ↩
Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan Sasanka Ammanamanchi, Stella Biderman. "Stay on topic with Classifier-Free Guidance." ICML 2024. arXiv:2306.17806, June 2023. https://arxiv.org/abs/2306.17806 ↩
Yaniv Leviathan, Matan Kalman, Yossi Matias. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192, November 2022. https://arxiv.org/abs/2211.17192 ↩
Xiang Lisa Li et al. "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023 (Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics). https://aclanthology.org/2023.acl-long.687/ ↩
Hongyi Yuan, Keming Lu, Fei Huang, Zheng Yuan, Chang Zhou. "Speculative Contrastive Decoding." arXiv:2311.08981, November 2023. https://arxiv.org/abs/2311.08981 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Classifier-Free Guidance (CFG)