Decoding strategies
Last reviewed
Sources
13 citations
Review status
Source-backed
Revision
v4 ยท 3,588 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
13 citations
Review status
Source-backed
Revision
v4 ยท 3,588 words
Add missing citations, update stale details, or suggest a clearer explanation.
Decoding strategies are the algorithms that select output tokens from a language model's next-token probability distribution during text generation. At each step an autoregressive model produces a probability distribution over its vocabulary, and the decoding strategy decides which token to emit, an algorithmic choice that, given the same model and prompt, can produce drastically different output. The major families are deterministic methods, greedy decoding and beam search, and stochastic sampling methods, including temperature scaling, top-k sampling, top-p (nucleus) sampling, min-p sampling, typical sampling, and contrastive search. A central finding of the field is that maximization-based decoding, always choosing the most probable token or sequence, tends to produce text that is bland and repetitive, while unconstrained sampling produces text that is incoherent, so practical decoding navigates a tradeoff between quality and diversity.[2][3]
Decoding strategies are used by large language models (LLMs) and other autoregressive neural networks. They range from simple deterministic methods like greedy decoding and beam search to stochastic sampling techniques like top-k and top-p sampling. More recent methods, including contrastive search and speculative decoding, aim to improve quality and speed further. Understanding these strategies is essential for anyone working with LLMs, as the same model can produce drastically different outputs depending on how tokens are selected.[2]
Modern language models based on the Transformer architecture generate text one token at a time, from left to right. At each time step t, the model takes the sequence of previously generated tokens (plus any input prompt) and produces a probability distribution P(x_t | x_1, ..., x_{t-1}) over the entire vocabulary V.
The decoding strategy is the algorithm that selects x_t from this distribution. Once x_t is selected, it is appended to the sequence, and the process repeats until an end-of-sequence token is generated or a maximum length is reached. Because each token requires its own forward pass and each pass depends on the previously chosen token, generating K tokens conventionally takes K serial runs of the model, a constraint that motivates acceleration methods like speculative decoding.[5]
The raw model outputs before the softmax function are called logits. Various preprocessing steps (temperature scaling, repetition penalties) may be applied to the logits before the decoding strategy makes its selection.
Greedy decoding is the simplest deterministic decoding strategy. At each step, it selects the single token with the highest probability:
x_t = argmax P(x_t | x_1, ..., x_{t-1})
Greedy decoding is suitable for tasks where determinism and speed are priorities, such as simple classification or structured extraction, but is generally unsuitable for open-ended text generation.
Beam search is a more sophisticated deterministic strategy that addresses greedy decoding's tendency to find locally optimal but globally suboptimal sequences. Instead of keeping only the single best token at each step, beam search maintains multiple candidate sequences (called "beams") in parallel.
The beam width B is the key hyperparameter. It controls the trade-off between quality and computational cost.
| Beam width | Behavior |
|---|---|
| B = 1 | Equivalent to greedy decoding |
| B = 2-5 | Common in machine translation and summarization |
| B = 5-10 | Used when higher quality is needed |
| B = infinity | Equivalent to exhaustive (best-first) search; impractical for large vocabularies |
Increasing the beam width generally improves output quality up to a point, after which returns diminish and computational cost increases linearly.
Beam search has a natural bias toward shorter sequences because log-probabilities are negative and accumulate with each step. Longer sequences have lower (more negative) cumulative scores even if they are of higher quality. Length normalization (or length penalty) addresses this by dividing the cumulative log-probability by a function of the sequence length:
score(Y) = log P(Y) / length(Y)^alpha
The parameter alpha controls the strength of the penalty. When alpha = 0, there is no length penalty. When alpha = 1, the score is normalized by length. A typical value is alpha = 0.6 to 0.8, which encourages longer, more complete outputs without overly penalizing short ones.
Diverse beam search (DBS), introduced by Vijayakumar et al. (2016), addresses the lack of diversity in standard beam search.[8] DBS divides the beams into groups and adds a diversity penalty that discourages beams in different groups from being too similar. At each step, a Hamming diversity term penalizes tokens that have been selected by beams in other groups. This produces a set of outputs that are both high-quality and distinct from each other.
Sampling-based decoding methods introduce stochasticity by drawing tokens from the probability distribution rather than always selecting the most likely token. This produces more diverse, natural-sounding text at the cost of some consistency.
Pure sampling draws the next token directly from the full probability distribution without any filtering. While this produces maximum diversity, it frequently selects very low-probability tokens, leading to incoherent or nonsensical text. The motivation for truncated samplers (top-k, top-p, min-p) is that the unreliable low-probability "tail" of the distribution is the main source of this incoherence.[2]
Temperature scaling modifies the probability distribution before sampling by dividing the logits by a temperature parameter T:
P(x_i) = exp(logit_i / T) / sum(exp(logit_j / T))
| Temperature | Effect |
|---|---|
| T < 1 | Sharpens distribution; more deterministic |
| T = 1 | Original distribution |
| T > 1 | Flattens distribution; more random |
| T -> 0 | Converges to greedy decoding |
Top-k sampling restricts sampling to the k most probable tokens. All other tokens have their probabilities set to zero, and the remaining probabilities are renormalized. Top-k was popularized by Fan et al. (2018) for neural story generation, which used a fixed pool of k = 40 candidate tokens.[1]
The limitation of top-k is that the fixed value of k does not adapt to the shape of the distribution. In high-confidence contexts, k may include many irrelevant tokens. In low-confidence contexts, k may exclude plausible alternatives.
Top-p sampling (Holtzman et al., 2019) addresses top-k's inflexibility by dynamically adjusting the candidate pool. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold p, the "nucleus" of the distribution. This means the number of candidates grows when the model is uncertain and shrinks when the model is confident. The method works because, as the authors put it, "the vast majority of probability mass at each time step is concentrated in the nucleus, a small subset of the vocabulary that tends to range between one and a thousand candidates."[2] The paper concluded that nucleus sampling was the best overall decoding strategy in their human and automatic evaluations.[2]
Min-p sampling (Nguyen et al., 2024) sets a dynamic threshold based on the top token's probability. A token is included only if its probability exceeds min_p times the probability of the most likely token. This scales naturally with model confidence. The original paper reported that human evaluations "show a clear preference for min-p sampling, in both text quality and creativity," especially at higher temperatures, and the work was selected for an Oral presentation at ICLR 2025.[9] A later critical re-analysis (Hwang et al., 2025) disputed these results, arguing that the original human evaluations omitted data and that on reanalysis min-p did not clearly outperform top-p, illustrating that decoding-quality claims remain actively contested.[12]
Contrastive search, introduced by Su et al. (2022), is a decoding method designed to overcome the common pitfalls of both greedy/beam search (repetition, blandness) and sampling methods (incoherence).[3] The paper traces degeneration to a structural cause: "an underlying reason for model degeneration is the anisotropic distribution of token representations."[3] Contrastive search selects each token by balancing two objectives:
The selection criterion is:
x_t = argmax [(1 - alpha) * P(v) - alpha * max(sim(h_v, h_{x_j}))]
for all candidate tokens v and all previously generated tokens x_j, where h denotes hidden representations, sim is cosine similarity, and alpha is a hyperparameter (typically 0.6) that balances the two terms.
Repetitive text is characterized by hidden representations that cluster tightly together. By penalizing new tokens whose representations are similar to previous ones, contrastive search naturally steers away from repetition while staying close to the model's confident predictions. Empirical evaluations show that contrastive search produces text that is more coherent than sampling methods and more diverse than beam search, achieving a strong balance between the two.[3]
| Property | Greedy | Beam search | Top-p sampling | Contrastive search |
|---|---|---|---|---|
| Deterministic | Yes | Yes | No | Yes |
| Repetition risk | High | Moderate | Low | Low |
| Diversity | Very low | Low | High | Moderate-high |
| Coherence | High (short text) | High | Variable | High |
| Computational cost | Very low | B times greedy | Low | Moderate |
Contrastive decoding, introduced by Li et al. (2023), is a related but distinct method from contrastive search.[4] It uses two models: a strong "expert" model and a weaker "amateur" model. The next token is chosen to maximize the difference between the expert's and amateur's log-probabilities:
x_t = argmax [log P_expert(v) - log P_amateur(v)]
The intuition is that tokens where the expert strongly outperforms the amateur are more likely to be high-quality, while tokens where both models agree are more likely to be generic. Contrastive decoding encourages the generation of text that captures the expert model's unique capabilities.
Speculative decoding is a technique for accelerating inference without changing the output distribution. It addresses a fundamental bottleneck in autoregressive generation, which Leviathan et al. summarize as the fact that "decoding K tokens takes K serial runs of the model": each token requires a full forward pass, and these passes must happen sequentially because each depends on the previous token.[5]
Because the verification step processes multiple tokens in parallel, speculative decoding can generate multiple tokens per forward pass of the large model. Leviathan et al. demonstrated the method on T5-XXL and reported a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs, that is, no change to the output quality.[5][6]
| Variant | Draft mechanism | Key feature |
|---|---|---|
| Standard speculative decoding (Leviathan et al., 2023; Chen et al., 2023) | Separate small model | Original formulation; mathematically preserves output distribution |
| Medusa (Cai et al., 2024) | Lightweight heads on target model | No separate draft model needed; parameter-efficient |
| EAGLE (Li et al., 2024) | Feature-level prediction | Uses hidden states rather than token probabilities for drafting |
| Self-speculative decoding | Target model with layer skipping | Reuses the target model itself for drafting |
| Lookahead decoding | N-gram cache from generation history | No separate model; uses observed patterns |
Medusa, introduced by Cai et al. (2024), avoids the need for a separate draft model by attaching multiple lightweight prediction heads to the target model.[7] Each head predicts a token at a different future position. During inference, these heads produce candidate token sequences, which are then verified using a tree-based attention mechanism. The paper reports that Medusa-1 (heads fine-tuned on a frozen backbone) achieves over 2.2x speedup without compromising generation quality, while Medusa-2 (heads fine-tuned together with the backbone) reaches 2.3x to 2.8x.[7]
The following table provides a comprehensive comparison of the major decoding strategies.
| Strategy | Type | Quality | Diversity | Speed | Repetition handling | Best for |
|---|---|---|---|---|---|---|
| Greedy | Deterministic | Moderate | None | Fastest | Poor | Simple tasks, classification |
| Beam search | Deterministic | High | Very low | Moderate | Moderate (with n-gram blocking) | Translation, summarization |
| Diverse beam search | Deterministic | High | Moderate | Moderate | Good | Generating multiple distinct outputs |
| Top-k sampling | Stochastic | Good | High | Fast | Good | Creative text generation |
| Top-p (nucleus) sampling | Stochastic | Good | High | Fast | Good | General-purpose generation |
| Min-p sampling | Stochastic | Very good | High | Fast | Good | Creative generation at high temperature |
| Contrastive search | Hybrid | Very high | Moderate-high | Moderate | Very good | Open-ended generation |
| Contrastive decoding | Hybrid | Very high | Moderate | Slow (two models) | Very good | High-quality generation |
| Speculative decoding | Acceleration | Same as target model | Same as target model | 2-3x faster | Same as target model | Inference acceleration |
In deployed systems, text generation involves several layers of processing beyond the basic decoding strategy.
Before the decoding strategy selects a token, the raw logits undergo several transformations:
Generation stops when any of the following conditions are met:
Most modern LLM deployments use streaming to send tokens to the user as they are generated rather than waiting for the full response. This significantly improves perceived responsiveness because the user begins reading almost immediately. The decoding strategy operates identically in streaming mode; the only difference is that tokens are transmitted incrementally.
LLM APIs expose various generation parameters that control the decoding process. The most universal are temperature (distribution sharpness) and top_p (nucleus threshold); OpenAI documents temperature on a 0 to 2 scale and top_p on a 0 to 1 scale, and recommends altering one or the other but not both.[13]
| Parameter | Provider support | Typical range | Effect |
|---|---|---|---|
| temperature | OpenAI, Anthropic, Google, Mistral | 0.0 - 2.0 | Controls distribution sharpness |
| top_p | OpenAI, Anthropic, Google, Mistral | 0.0 - 1.0 | Nucleus sampling threshold |
| top_k | Anthropic, Google, HuggingFace | 1 - vocabulary size | Fixed candidate pool size |
| max_tokens | All providers | 1 - context limit | Maximum output length |
| stop | OpenAI, Anthropic | Strings | Generation stops on match |
| presence_penalty | OpenAI | -2.0 to 2.0 | Penalizes repeated topics |
| frequency_penalty | OpenAI | -2.0 to 2.0 | Penalizes repeated tokens |
Decoding strategies have evolved alongside language model architectures.
| Period | Key development |
|---|---|
| Pre-2017 | Beam search dominates in machine translation and captioning; greedy decoding used for speed-critical applications |
| 2018 | Fan et al. introduce top-k sampling for story generation, showing that sampling produces more natural text than beam search[1] |
| 2019 | Holtzman et al. propose nucleus (top-p) sampling, identifying "neural text degeneration" as a core problem[2] |
| 2021 | Basu et al. introduce Mirostat, a perplexity-controlled decoding method[11] |
| 2022 | Su et al. propose contrastive search; Meister et al. propose locally typical sampling[3][10] |
| 2023 | Speculative decoding gains attention as an inference acceleration technique (Leviathan et al., Chen et al.)[5][6] |
| 2023 | Li et al. introduce contrastive decoding using expert-amateur model pairs[4] |
| 2024 | Min-p sampling proposed; Medusa and EAGLE offer new speculative decoding variants[7][9] |
The optimal decoding strategy depends on the application requirements.
| Requirement | Recommended strategy |
|---|---|
| Maximum speed, minimal cost | Greedy decoding (optionally with speculative decoding for acceleration) |
| High quality for translation or summarization | Beam search with length penalty and n-gram blocking |
| Natural, diverse text for chatbots | Top-p sampling (p = 0.9) with temperature 0.7-1.0 |
| Creative writing | Top-p or min-p sampling with higher temperature (1.0-1.2) |
| Avoiding repetition without randomness | Contrastive search (alpha = 0.6) |
| Multiple distinct outputs | Diverse beam search |
| Reducing inference latency | Speculative decoding or Medusa |