Decoding strategies

Large Language Models Machine Learning Natural Language Processing

18 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v5 · 3,598 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Decoding strategies are the algorithms that select output tokens from a language model's next-token probability distribution during text generation. At each step an autoregressive model produces a probability distribution over its vocabulary, and the decoding strategy decides which token to emit, an algorithmic choice that, given the same model and prompt, can produce drastically different output. The major families are deterministic methods, greedy decoding and beam search, and stochastic sampling methods, including temperature scaling, top-k sampling, top-p (nucleus) sampling, min-p sampling, typical sampling, and contrastive search. A central finding of the field is that maximization-based decoding, always choosing the most probable token or sequence, tends to produce text that is bland and repetitive, while unconstrained sampling produces text that is incoherent, so practical decoding navigates a tradeoff between quality and diversity.^[2]^[3]

Decoding strategies are used by large language models (LLMs) and other autoregressive neural networks. They range from simple deterministic methods like greedy decoding and beam search to stochastic sampling techniques like top-k and top-p sampling. More recent methods, including contrastive search and speculative decoding, aim to improve quality and speed further. Understanding these strategies is essential for anyone working with LLMs, as the same model can produce drastically different outputs depending on how tokens are selected.^[2]

How does autoregressive generation work?

Modern language models based on the Transformer architecture generate text one token at a time, from left to right. At each time step t, the model takes the sequence of previously generated tokens (plus any input prompt) and produces a probability distribution $P(x_t \mid x_1, \ldots, x_{t-1})$ over the entire vocabulary $V$ .

The decoding strategy is the algorithm that selects $x_t$ from this distribution. Once $x_t$ is selected, it is appended to the sequence, and the process repeats until an end-of-sequence token is generated or a maximum length is reached. Because each token requires its own forward pass and each pass depends on the previously chosen token, generating $K$ tokens conventionally takes $K$ serial runs of the model, a constraint that motivates acceleration methods like speculative decoding.^[5]

The raw model outputs before the softmax function are called logits. Various preprocessing steps (temperature scaling, repetition penalties) may be applied to the logits before the decoding strategy makes its selection.

Greedy decoding

Greedy decoding is the simplest deterministic decoding strategy. At each step, it selects the single token with the highest probability:

x_t = \arg\max_{x_t} P(x_t \mid x_1, \ldots, x_{t-1})

How it works

The model computes probabilities over the vocabulary.
The token with the highest probability is selected.
The selected token is appended to the sequence.
Repeat until done.

Advantages

Extremely fast: requires only one forward pass per token with no additional computation.
Deterministic: produces the same output every time for the same input.
Simple to implement.

Limitations

Repetition: Greedy decoding is highly prone to generating repetitive loops, especially for longer outputs. Once the model enters a repetitive pattern, the most probable next token is often the one that continues the pattern.
Suboptimal sequences: Choosing the locally best token at each step does not guarantee the globally best sequence. A slightly less probable token at step t might lead to a much better sequence overall.
Bland output: Greedy decoding tends to produce generic, "safe" text because it always picks the most likely continuation, avoiding the surprising or creative choices that characterize natural human language. Holtzman et al. found that "using likelihood as a decoding objective leads to text that is bland and strangely repetitive," the phenomenon they named neural text degeneration.^[2]

Greedy decoding is suitable for tasks where determinism and speed are priorities, such as simple classification or structured extraction, but is generally unsuitable for open-ended text generation.

Beam search

Beam search is a more sophisticated deterministic strategy that addresses greedy decoding's tendency to find locally optimal but globally suboptimal sequences. Instead of keeping only the single best token at each step, beam search maintains multiple candidate sequences (called "beams") in parallel.

How it works

At each step, each of the B current beams is extended by every possible next token.
This produces $B \cdot |V|$ candidate sequences (where $|V|$ is the vocabulary size).
The top B candidates (by cumulative log-probability) are kept.
The process repeats until all beams reach an end-of-sequence token or the maximum length.
The beam with the highest overall score is selected as the final output.

Beam width

The beam width $B$ is the key hyperparameter. It controls the trade-off between quality and computational cost.

Beam width	Behavior
$B = 1$	Equivalent to greedy decoding
$B = 2\text{-}5$	Common in machine translation and summarization
$B = 5\text{-}10$	Used when higher quality is needed
$B = \infty$	Equivalent to exhaustive (best-first) search; impractical for large vocabularies

Increasing the beam width generally improves output quality up to a point, after which returns diminish and computational cost increases linearly.

Length penalty

Beam search has a natural bias toward shorter sequences because log-probabilities are negative and accumulate with each step. Longer sequences have lower (more negative) cumulative scores even if they are of higher quality. Length normalization (or length penalty) addresses this by dividing the cumulative log-probability by a function of the sequence length:

\mathrm{score}(Y) = \frac{\log P(Y)}{\mathrm{length}(Y)^\alpha}

The parameter $\alpha$ controls the strength of the penalty. When $\alpha = 0$ , there is no length penalty. When $\alpha = 1$ , the score is normalized by length. A typical value is $\alpha = 0.6$ to $0.8$ , which encourages longer, more complete outputs without overly penalizing short ones.

Why does beam search produce bland, repetitive text?

Repetition: Like greedy decoding, beam search can produce repetitive text, though n-gram blocking (preventing the generation of repeated n-grams) partially mitigates this.
Lack of diversity: Standard beam search tends to produce beams that are very similar to each other, offering little diversity.
Blandness: Beam search still prefers high-probability sequences, which tend to be generic. Su et al. observe that "maximization-based decoding methods (e.g., beam search) of neural language models often lead to degenerate solutions, the generated text is unnatural and contains undesirable repetitions."^[3]
Computational cost: Each additional beam requires a full forward pass, making beam search B times more expensive than greedy decoding.

Diverse beam search

Diverse beam search (DBS), introduced by Vijayakumar et al. (2016), addresses the lack of diversity in standard beam search.^[8] DBS divides the beams into groups and adds a diversity penalty that discourages beams in different groups from being too similar. At each step, a Hamming diversity term penalizes tokens that have been selected by beams in other groups. This produces a set of outputs that are both high-quality and distinct from each other.

Sampling-based methods

Sampling-based decoding methods introduce stochasticity by drawing tokens from the probability distribution rather than always selecting the most likely token. This produces more diverse, natural-sounding text at the cost of some consistency.

Pure (ancestral) sampling

Pure sampling draws the next token directly from the full probability distribution without any filtering. While this produces maximum diversity, it frequently selects very low-probability tokens, leading to incoherent or nonsensical text. The motivation for truncated samplers (top-k, top-p, min-p) is that the unreliable low-probability "tail" of the distribution is the main source of this incoherence.^[2]

Temperature sampling

Temperature scaling modifies the probability distribution before sampling by dividing the logits by a temperature parameter T:

P(x_i) = \frac{\exp(\mathrm{logit}_i / T)}{\sum_j \exp(\mathrm{logit}_j / T)}

Temperature	Effect
$T < 1$	Sharpens distribution; more deterministic
$T = 1$	Original distribution
$T > 1$	Flattens distribution; more random
$T \to 0$	Converges to greedy decoding

Top-k sampling

Top-k sampling restricts sampling to the $k$ most probable tokens. All other tokens have their probabilities set to zero, and the remaining probabilities are renormalized. Top-k was popularized by Fan et al. (2018) for neural story generation, which used a fixed pool of $k = 40$ candidate tokens.^[1]

The limitation of top-k is that the fixed value of k does not adapt to the shape of the distribution. In high-confidence contexts, k may include many irrelevant tokens. In low-confidence contexts, k may exclude plausible alternatives.

Top-p (nucleus) sampling

Top-p sampling (Holtzman et al., 2019) addresses top-k's inflexibility by dynamically adjusting the candidate pool. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold p, the "nucleus" of the distribution. This means the number of candidates grows when the model is uncertain and shrinks when the model is confident. The method works because, as the authors put it, "the vast majority of probability mass at each time step is concentrated in the nucleus, a small subset of the vocabulary that tends to range between one and a thousand candidates."^[2] The paper concluded that nucleus sampling was the best overall decoding strategy in their human and automatic evaluations.^[2]

Min-p sampling

Min-p sampling (Nguyen et al., 2024) sets a dynamic threshold based on the top token's probability. A token is included only if its probability exceeds min_p times the probability of the most likely token. This scales naturally with model confidence. The original paper reported that human evaluations "show a clear preference for min-p sampling, in both text quality and creativity," especially at higher temperatures, and the work was selected for an Oral presentation at ICLR 2025.^[9] A later critical re-analysis (Hwang et al., 2025) disputed these results, arguing that the original human evaluations omitted data and that on reanalysis min-p did not clearly outperform top-p, illustrating that decoding-quality claims remain actively contested.^[12]

Contrastive search

Contrastive search, introduced by Su et al. (2022), is a decoding method designed to overcome the common pitfalls of both greedy/beam search (repetition, blandness) and sampling methods (incoherence).^[3] The paper traces degeneration to a structural cause: "an underlying reason for model degeneration is the anisotropic distribution of token representations."^[3] Contrastive search selects each token by balancing two objectives:

Model confidence: The token should have high probability under the model.
Degeneration penalty: The token's hidden representation should be dissimilar to the hidden representations of previously generated tokens.

The selection criterion is:

x_t = \arg\max_v \left[ (1 - \alpha) P(v) - \alpha \max_{x_j} \mathrm{sim}(h_v, h_{x_j}) \right]

for all candidate tokens $v$ and all previously generated tokens $x_j$ , where $h$ denotes hidden representations, $\mathrm{sim}$ is cosine similarity, and $\alpha$ is a hyperparameter (typically 0.6) that balances the two terms.

How does contrastive search prevent degeneration?

Repetitive text is characterized by hidden representations that cluster tightly together. By penalizing new tokens whose representations are similar to previous ones, contrastive search naturally steers away from repetition while staying close to the model's confident predictions. Empirical evaluations show that contrastive search produces text that is more coherent than sampling methods and more diverse than beam search, achieving a strong balance between the two.^[3]

Property	Greedy	Beam search	Top-p sampling	Contrastive search
Deterministic	Yes	Yes	No	Yes
Repetition risk	High	Moderate	Low	Low
Diversity	Very low	Low	High	Moderate-high
Coherence	High (short text)	High	Variable	High
Computational cost	Very low	B times greedy	Low	Moderate

Contrastive decoding

Contrastive decoding, introduced by Li et al. (2023), is a related but distinct method from contrastive search.^[4] It uses two models: a strong "expert" model and a weaker "amateur" model. The next token is chosen to maximize the difference between the expert's and amateur's log-probabilities:

x_t = \arg\max_v \left[ \log P_{\text{expert}}(v) - \log P_{\text{amateur}}(v) \right]

The intuition is that tokens where the expert strongly outperforms the amateur are more likely to be high-quality, while tokens where both models agree are more likely to be generic. Contrastive decoding encourages the generation of text that captures the expert model's unique capabilities.

Speculative decoding

Speculative decoding is a technique for accelerating inference without changing the output distribution. It addresses a fundamental bottleneck in autoregressive generation, which Leviathan et al. summarize as the fact that "decoding K tokens takes K serial runs of the model": each token requires a full forward pass, and these passes must happen sequentially because each depends on the previous token.^[5]

How does speculative decoding work?

A small, fast draft model generates several candidate tokens in quick succession.
The large target model verifies all candidate tokens in a single forward pass (which can process multiple positions in parallel).
Tokens that the target model agrees with are accepted. The first rejected token (if any) is resampled from the target model's distribution.

Because the verification step processes multiple tokens in parallel, speculative decoding can generate multiple tokens per forward pass of the large model. Leviathan et al. demonstrated the method on T5-XXL and reported a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs, that is, no change to the output quality.^[5]^[6]

Variants of speculative decoding

Variant	Draft mechanism	Key feature
Standard speculative decoding (Leviathan et al., 2023; Chen et al., 2023)	Separate small model	Original formulation; mathematically preserves output distribution
Medusa (Cai et al., 2024)	Lightweight heads on target model	No separate draft model needed; parameter-efficient
EAGLE (Li et al., 2024)	Feature-level prediction	Uses hidden states rather than token probabilities for drafting
Self-speculative decoding	Target model with layer skipping	Reuses the target model itself for drafting
Lookahead decoding	N-gram cache from generation history	No separate model; uses observed patterns

Medusa

Medusa, introduced by Cai et al. (2024), avoids the need for a separate draft model by attaching multiple lightweight prediction heads to the target model.^[7] Each head predicts a token at a different future position. During inference, these heads produce candidate token sequences, which are then verified using a tree-based attention mechanism. The paper reports that Medusa-1 (heads fine-tuned on a frozen backbone) achieves over 2.2x speedup without compromising generation quality, while Medusa-2 (heads fine-tuned together with the backbone) reaches 2.3x to 2.8x.^[7]

Comparison of decoding strategies

The following table provides a comprehensive comparison of the major decoding strategies.

Strategy	Type	Quality	Diversity	Speed	Repetition handling	Best for
Greedy	Deterministic	Moderate	None	Fastest	Poor	Simple tasks, classification
Beam search	Deterministic	High	Very low	Moderate	Moderate (with n-gram blocking)	Translation, summarization
Diverse beam search	Deterministic	High	Moderate	Moderate	Good	Generating multiple distinct outputs
Top-k sampling	Stochastic	Good	High	Fast	Good	Creative text generation
Top-p (nucleus) sampling	Stochastic	Good	High	Fast	Good	General-purpose generation
Min-p sampling	Stochastic	Very good	High	Fast	Good	Creative generation at high temperature
Contrastive search	Hybrid	Very high	Moderate-high	Moderate	Very good	Open-ended generation
Contrastive decoding	Hybrid	Very high	Moderate	Slow (two models)	Very good	High-quality generation
Speculative decoding	Acceleration	Same as target model	Same as target model	2-3x faster	Same as target model	Inference acceleration

How do LLMs actually generate text in practice?

In deployed systems, text generation involves several layers of processing beyond the basic decoding strategy.

Preprocessing pipeline

Before the decoding strategy selects a token, the raw logits undergo several transformations:

Repetition penalty: Logits of previously generated tokens are reduced.
Frequency/presence penalty: Additional penalties based on token occurrence counts.
Temperature scaling: Logits are divided by the temperature parameter.
Top-k filtering: Only the top k logits are kept.
Top-p filtering: Tokens below the cumulative probability threshold are removed.
Min-p filtering: Tokens below the dynamic threshold are removed.
Renormalization: Remaining probabilities are rescaled to sum to 1.

Stopping conditions

Generation stops when any of the following conditions are met:

An end-of-sequence (EOS) token is generated.
The maximum token limit (max_tokens) is reached.
A stop sequence (a predefined string) appears in the output.
The user or system interrupts generation (streaming cancellation).

Streaming

Most modern LLM deployments use streaming to send tokens to the user as they are generated rather than waiting for the full response. This significantly improves perceived responsiveness because the user begins reading almost immediately. The decoding strategy operates identically in streaming mode; the only difference is that tokens are transmitted incrementally.

What decoding parameters do LLM APIs expose?

LLM APIs expose various generation parameters that control the decoding process. The most universal are temperature (distribution sharpness) and top_p (nucleus threshold); OpenAI documents temperature on a 0 to 2 scale and top_p on a 0 to 1 scale, and recommends altering one or the other but not both.^[13]

Parameter	Provider support	Typical range	Effect
temperature	OpenAI, Anthropic, Google, Mistral	0.0 - 2.0	Controls distribution sharpness
top_p	OpenAI, Anthropic, Google, Mistral	0.0 - 1.0	Nucleus sampling threshold
top_k	Anthropic, Google, HuggingFace	1 - vocabulary size	Fixed candidate pool size
max_tokens	All providers	1 - context limit	Maximum output length
stop	OpenAI, Anthropic	Strings	Generation stops on match
presence_penalty	OpenAI	-2.0 to 2.0	Penalizes repeated topics
frequency_penalty	OpenAI	-2.0 to 2.0	Penalizes repeated tokens

When did the major decoding strategies appear?

Decoding strategies have evolved alongside language model architectures.

Period	Key development
Pre-2017	Beam search dominates in machine translation and captioning; greedy decoding used for speed-critical applications
2018	Fan et al. introduce top-k sampling for story generation, showing that sampling produces more natural text than beam search^[1]
2019	Holtzman et al. propose nucleus (top-p) sampling, identifying "neural text degeneration" as a core problem^[2]
2021	Basu et al. introduce Mirostat, a perplexity-controlled decoding method^[11]
2022	Su et al. propose contrastive search; Meister et al. propose locally typical sampling^[3]^[10]
2023	Speculative decoding gains attention as an inference acceleration technique (Leviathan et al., Chen et al.)^[5]^[6]
2023	Li et al. introduce contrastive decoding using expert-amateur model pairs^[4]
2024	Min-p sampling proposed; Medusa and EAGLE offer new speculative decoding variants^[7]^[9]

Which decoding strategy should you choose?

The optimal decoding strategy depends on the application requirements.

Requirement	Recommended strategy
Maximum speed, minimal cost	Greedy decoding (optionally with speculative decoding for acceleration)
High quality for translation or summarization	Beam search with length penalty and n-gram blocking
Natural, diverse text for chatbots	Top-p sampling (p = 0.9) with temperature 0.7-1.0
Creative writing	Top-p or min-p sampling with higher temperature (1.0-1.2)
Avoiding repetition without randomness	Contrastive search ( $\alpha = 0.6$ )
Multiple distinct outputs	Diverse beam search
Reducing inference latency	Speculative decoding or Medusa

References

Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. https://aclanthology.org/P18-1082/ ↩
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751. https://arxiv.org/abs/1904.09751 ↩
Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., & Collier, N. (2022). "A Contrastive Framework for Neural Text Generation." NeurIPS 2022 (Spotlight). arXiv:2202.06417. https://arxiv.org/abs/2202.06417 ↩
Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., & Lewis, M. (2023). "Contrastive Decoding: Open-ended Text Generation as Optimization." ACL 2023. arXiv:2210.15097. https://arxiv.org/abs/2210.15097 ↩
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192. https://arxiv.org/abs/2211.17192 ↩
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318. https://arxiv.org/abs/2302.01318 ↩
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. (2024). "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." ICML 2024. arXiv:2401.10774. https://arxiv.org/abs/2401.10774 ↩
Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D., & Batra, D. (2016). "Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models." arXiv:1610.02424. https://arxiv.org/abs/1610.02424 ↩
Nguyen, M., Baker, A., Kirsch, A., & Gal, Y. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs." ICLR 2025 (Oral). arXiv:2407.01082. https://arxiv.org/abs/2407.01082 ↩
Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Locally Typical Sampling." Transactions of the Association for Computational Linguistics, 11, 102-121. https://aclanthology.org/2023.tacl-1.7/ ↩
Basu, S., Gupta, G., Mishra, S., & Ganguli, S. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity." ICLR 2021. arXiv:2007.14966. https://arxiv.org/abs/2007.14966 ↩
"Turning Down the Heat: A Critical Analysis of Min-p Sampling in Language Models" (2025). arXiv:2506.13681. https://arxiv.org/abs/2506.13681 ↩
OpenAI. "API Reference: Chat Completions (temperature and top_p parameters)." OpenAI Platform Documentation. https://platform.openai.com/docs/api-reference/chat ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Contrastive decoding Lookahead Decoding Mercury (Inception Labs)Outlines (library)Skeleton-of-Thought Tokens per second Top-p sampling

How does autoregressive generation work?

Greedy decoding

How it works

Advantages

Limitations

Beam search

How it works

Beam width

Length penalty

Why does beam search produce bland, repetitive text?

Diverse beam search

Sampling-based methods

Pure (ancestral) sampling

Temperature sampling

Top-k sampling

Top-p (nucleus) sampling

Min-p sampling

Contrastive search

How does contrastive search prevent degeneration?

Contrastive decoding

Speculative decoding

How does speculative decoding work?

Variants of speculative decoding

Medusa

Comparison of decoding strategies

How do LLMs actually generate text in practice?

Preprocessing pipeline

Stopping conditions

Streaming

What decoding parameters do LLM APIs expose?

When did the major decoding strategies appear?

Which decoding strategy should you choose?

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here