Decoding strategies are the algorithms used to select output tokens during text generation by large language models (LLMs) and other autoregressive neural networks. When a language model generates text, it produces a probability distribution over its vocabulary at each step. The decoding strategy determines which token from that distribution is chosen as the next output. The choice of decoding strategy profoundly affects the quality, diversity, fluency, and computational cost of the generated text.
Decoding strategies range from simple deterministic methods like greedy decoding and beam search to stochastic sampling techniques like top-k and top-p sampling. More recent methods, including contrastive search and speculative decoding, aim to improve quality and speed further. Understanding these strategies is essential for anyone working with LLMs, as the same model can produce drastically different outputs depending on how tokens are selected.
Modern language models based on the Transformer architecture generate text one token at a time, from left to right. At each time step t, the model takes the sequence of previously generated tokens (plus any input prompt) and produces a probability distribution P(x_t | x_1, ..., x_{t-1}) over the entire vocabulary V.
The decoding strategy is the algorithm that selects x_t from this distribution. Once x_t is selected, it is appended to the sequence, and the process repeats until an end-of-sequence token is generated or a maximum length is reached.
The raw model outputs before the softmax function are called logits. Various preprocessing steps (temperature scaling, repetition penalties) may be applied to the logits before the decoding strategy makes its selection.
Greedy decoding is the simplest deterministic decoding strategy. At each step, it selects the single token with the highest probability:
x_t = argmax P(x_t | x_1, ..., x_{t-1})
Greedy decoding is suitable for tasks where determinism and speed are priorities, such as simple classification or structured extraction, but is generally unsuitable for open-ended text generation.
Beam search is a more sophisticated deterministic strategy that addresses greedy decoding's tendency to find locally optimal but globally suboptimal sequences. Instead of keeping only the single best token at each step, beam search maintains multiple candidate sequences (called "beams") in parallel.
The beam width B is the key hyperparameter. It controls the trade-off between quality and computational cost.
| Beam width | Behavior |
|---|---|
| B = 1 | Equivalent to greedy decoding |
| B = 2-5 | Common in machine translation and summarization |
| B = 5-10 | Used when higher quality is needed |
| B = infinity | Equivalent to exhaustive (best-first) search; impractical for large vocabularies |
Increasing the beam width generally improves output quality up to a point, after which returns diminish and computational cost increases linearly.
Beam search has a natural bias toward shorter sequences because log-probabilities are negative and accumulate with each step. Longer sequences have lower (more negative) cumulative scores even if they are of higher quality. Length normalization (or length penalty) addresses this by dividing the cumulative log-probability by a function of the sequence length:
score(Y) = log P(Y) / length(Y)^alpha
The parameter alpha controls the strength of the penalty. When alpha = 0, there is no length penalty. When alpha = 1, the score is normalized by length. A typical value is alpha = 0.6 to 0.8, which encourages longer, more complete outputs without overly penalizing short ones.
Diverse beam search (DBS), introduced by Vijayakumar et al. (2016), addresses the lack of diversity in standard beam search. DBS divides the beams into groups and adds a diversity penalty that discourages beams in different groups from being too similar. At each step, a Hamming diversity term penalizes tokens that have been selected by beams in other groups. This produces a set of outputs that are both high-quality and distinct from each other.
Sampling-based decoding methods introduce stochasticity by drawing tokens from the probability distribution rather than always selecting the most likely token. This produces more diverse, natural-sounding text at the cost of some consistency.
Pure sampling draws the next token directly from the full probability distribution without any filtering. While this produces maximum diversity, it frequently selects very low-probability tokens, leading to incoherent or nonsensical text.
Temperature scaling modifies the probability distribution before sampling by dividing the logits by a temperature parameter T:
P(x_i) = exp(logit_i / T) / sum(exp(logit_j / T))
| Temperature | Effect |
|---|---|
| T < 1 | Sharpens distribution; more deterministic |
| T = 1 | Original distribution |
| T > 1 | Flattens distribution; more random |
| T -> 0 | Converges to greedy decoding |
Top-k sampling restricts sampling to the k most probable tokens. All other tokens have their probabilities set to zero, and the remaining probabilities are renormalized. Top-k was popularized by Fan et al. (2018) for neural story generation.
The limitation of top-k is that the fixed value of k does not adapt to the shape of the distribution. In high-confidence contexts, k may include many irrelevant tokens. In low-confidence contexts, k may exclude plausible alternatives.
Top-p sampling (Holtzman et al., 2019) addresses top-k's inflexibility by dynamically adjusting the candidate pool. It keeps the smallest set of tokens whose cumulative probability exceeds a threshold p. This means the number of candidates grows when the model is uncertain and shrinks when the model is confident.
Min-p sampling (Nguyen et al., 2024) sets a dynamic threshold based on the top token's probability. A token is included only if its probability exceeds min_p times the probability of the most likely token. This scales naturally with model confidence and has been shown to outperform top-p at higher temperatures.
Contrastive search, introduced by Su et al. (2022), is a decoding method designed to overcome the common pitfalls of both greedy/beam search (repetition, blandness) and sampling methods (incoherence). It selects each token by balancing two objectives:
The selection criterion is:
x_t = argmax [(1 - alpha) * P(v) - alpha * max(sim(h_v, h_{x_j}))]
for all candidate tokens v and all previously generated tokens x_j, where h denotes hidden representations, sim is cosine similarity, and alpha is a hyperparameter (typically 0.6) that balances the two terms.
Repetitive text is characterized by hidden representations that cluster tightly together. By penalizing new tokens whose representations are similar to previous ones, contrastive search naturally steers away from repetition while staying close to the model's confident predictions. Empirical evaluations show that contrastive search produces text that is more coherent than sampling methods and more diverse than beam search, achieving a strong balance between the two.
| Property | Greedy | Beam search | Top-p sampling | Contrastive search |
|---|---|---|---|---|
| Deterministic | Yes | Yes | No | Yes |
| Repetition risk | High | Moderate | Low | Low |
| Diversity | Very low | Low | High | Moderate-high |
| Coherence | High (short text) | High | Variable | High |
| Computational cost | Very low | B times greedy | Low | Moderate |
Contrastive decoding, introduced by Li et al. (2023), is a related but distinct method from contrastive search. It uses two models: a strong "expert" model and a weaker "amateur" model. The next token is chosen to maximize the difference between the expert's and amateur's log-probabilities:
x_t = argmax [log P_expert(v) - log P_amateur(v)]
The intuition is that tokens where the expert strongly outperforms the amateur are more likely to be high-quality, while tokens where both models agree are more likely to be generic. Contrastive decoding encourages the generation of text that captures the expert model's unique capabilities.
Speculative decoding is a technique for accelerating inference without changing the output distribution. It addresses a fundamental bottleneck in autoregressive generation: each token requires a full forward pass through the model, and these passes must happen sequentially because each depends on the previous token.
Because the verification step processes multiple tokens in parallel, speculative decoding can generate multiple tokens per forward pass of the large model, achieving significant speedups (often 2-3x) without any change to the output quality.
| Variant | Draft mechanism | Key feature |
|---|---|---|
| Standard speculative decoding (Leviathan et al., 2023; Chen et al., 2023) | Separate small model | Original formulation; mathematically preserves output distribution |
| Medusa (Cai et al., 2024) | Lightweight heads on target model | No separate draft model needed; parameter-efficient |
| EAGLE (Li et al., 2024) | Feature-level prediction | Uses hidden states rather than token probabilities for drafting |
| Self-speculative decoding | Target model with layer skipping | Reuses the target model itself for drafting |
| Lookahead decoding | N-gram cache from generation history | No separate model; uses observed patterns |
Medusa, introduced by Cai et al. (2024), avoids the need for a separate draft model by attaching multiple lightweight prediction heads to the target model. Each head predicts a token at a different future position. During inference, these heads produce candidate token sequences, which are then verified using a tree-based attention mechanism. Medusa can achieve 2-3x speedup over standard autoregressive decoding.
The following table provides a comprehensive comparison of the major decoding strategies.
| Strategy | Type | Quality | Diversity | Speed | Repetition handling | Best for |
|---|---|---|---|---|---|---|
| Greedy | Deterministic | Moderate | None | Fastest | Poor | Simple tasks, classification |
| Beam search | Deterministic | High | Very low | Moderate | Moderate (with n-gram blocking) | Translation, summarization |
| Diverse beam search | Deterministic | High | Moderate | Moderate | Good | Generating multiple distinct outputs |
| Top-k sampling | Stochastic | Good | High | Fast | Good | Creative text generation |
| Top-p (nucleus) sampling | Stochastic | Good | High | Fast | Good | General-purpose generation |
| Min-p sampling | Stochastic | Very good | High | Fast | Good | Creative generation at high temperature |
| Contrastive search | Hybrid | Very high | Moderate-high | Moderate | Very good | Open-ended generation |
| Contrastive decoding | Hybrid | Very high | Moderate | Slow (two models) | Very good | High-quality generation |
| Speculative decoding | Acceleration | Same as target model | Same as target model | 2-3x faster | Same as target model | Inference acceleration |
In deployed systems, text generation involves several layers of processing beyond the basic decoding strategy.
Before the decoding strategy selects a token, the raw logits undergo several transformations:
Generation stops when any of the following conditions are met:
Most modern LLM deployments use streaming to send tokens to the user as they are generated rather than waiting for the full response. This significantly improves perceived responsiveness because the user begins reading almost immediately. The decoding strategy operates identically in streaming mode; the only difference is that tokens are transmitted incrementally.
LLM APIs expose various generation parameters that control the decoding process.
| Parameter | Provider support | Typical range | Effect |
|---|---|---|---|
| temperature | OpenAI, Anthropic, Google, Mistral | 0.0 - 2.0 | Controls distribution sharpness |
| top_p | OpenAI, Anthropic, Google, Mistral | 0.0 - 1.0 | Nucleus sampling threshold |
| top_k | Anthropic, Google, HuggingFace | 1 - vocabulary size | Fixed candidate pool size |
| max_tokens | All providers | 1 - context limit | Maximum output length |
| stop | OpenAI, Anthropic | Strings | Generation stops on match |
| presence_penalty | OpenAI | -2.0 to 2.0 | Penalizes repeated topics |
| frequency_penalty | OpenAI | -2.0 to 2.0 | Penalizes repeated tokens |
Decoding strategies have evolved alongside language model architectures.
| Period | Key development |
|---|---|
| Pre-2017 | Beam search dominates in machine translation and captioning; greedy decoding used for speed-critical applications |
| 2018 | Fan et al. introduce top-k sampling for story generation, showing that sampling produces more natural text than beam search |
| 2019 | Holtzman et al. propose nucleus (top-p) sampling, identifying "neural text degeneration" as a core problem |
| 2021 | Basu et al. introduce Mirostat, a perplexity-controlled decoding method |
| 2022 | Su et al. propose contrastive search; Meister et al. propose locally typical sampling |
| 2023 | Speculative decoding gains attention as an inference acceleration technique (Leviathan et al., Chen et al.) |
| 2023 | Li et al. introduce contrastive decoding using expert-amateur model pairs |
| 2024 | Min-p sampling proposed; Medusa and EAGLE offer new speculative decoding variants |
The optimal decoding strategy depends on the application requirements.
| Requirement | Recommended strategy |
|---|---|
| Maximum speed, minimal cost | Greedy decoding (optionally with speculative decoding for acceleration) |
| High quality for translation or summarization | Beam search with length penalty and n-gram blocking |
| Natural, diverse text for chatbots | Top-p sampling (p = 0.9) with temperature 0.7-1.0 |
| Creative writing | Top-p or min-p sampling with higher temperature (1.0-1.2) |
| Avoiding repetition without randomness | Contrastive search (alpha = 0.6) |
| Multiple distinct outputs | Diverse beam search |
| Reducing inference latency | Speculative decoding or Medusa |