# Greedy decoding

> Source: https://aiwiki.ai/wiki/greedy_decoding
> Updated: 2026-06-23
> Categories: Large Language Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Greedy decoding** (also called **greedy search** or **argmax decoding**) is the simplest text-generation strategy used by autoregressive [language models](/wiki/language_model): at every step it picks the single highest-probability next token (the argmax of the model's output distribution) and appends it to the sequence, with no sampling, no alternatives tracked, and no lookahead. It is deterministic, the cheapest decoding rule available, and exactly equivalent to sampling at [temperature](/wiki/temperature) zero, but it is myopic (locally optimal, not globally optimal) and prone to repetition loops and bland generic prose, which is why open-ended generation usually uses [beam search](/wiki/beam_search) or stochastic sampling instead. The loop repeats until the model produces an end-of-sequence token or hits a length limit.

Greedy decoding is the decoder behind the popular `temperature=0` and `top_k=1` settings on the [OpenAI](/wiki/openai_api) and [Anthropic](/wiki/anthropic_api) APIs, on Hugging Face `generate` (where `do_sample=False` with `num_beams=1` is the default decoding strategy), on [vLLM](/wiki/vllm), and on virtually every other inference stack [14] [15].

The trade-off is that greedy decoding is locally optimal but not globally optimal. By committing to the highest-probability token at every step, the decoder routinely walks into low-probability sequences and into well-documented failure modes such as repetition loops and dull generic prose [5]. For tasks with a single correct answer (math, code generation, classification, function calling, structured output) greedy is usually the right default. For creative writing, dialogue, and translation, it usually is not.

## What is greedy decoding, formally?

Let $V$ be the model vocabulary and $x_{<t} = (x_1, x_2, \dots, x_{t-1})$ be the tokens already generated together with the prompt. An [autoregressive](/wiki/autoregressive) language model defines a conditional distribution

$$
P_\theta(x_t \mid x_{<t})
$$

over the next token. Greedy decoding produces the next token by taking the [argmax](/wiki/argmax) of this distribution:

$$
x_t = \arg\max_{v \in V} P_\theta(v \mid x_{<t}).
$$

The full sequence is built by repeating this rule:

$$
x_1 = \arg\max_v P_\theta(v \mid x_{<1}), \quad x_2 = \arg\max_v P_\theta(v \mid x_{<2}), \quad \dots
$$

until the special end-of-sequence token (often `</s>`, `<|endoftext|>`, or `<|eot_id|>` depending on the [tokenizer](/wiki/token)) is emitted, or until a `max_tokens` budget is reached.

The decision is made on the post-[softmax](/wiki/softmax) distribution, but because softmax is monotonic in the [logits](/wiki/logits), the same token is selected by argmax over raw logits. Production implementations skip the softmax for greedy decoding and run argmax on logits directly.

## The algorithm

The full procedure is roughly five lines of pseudocode:

```
input: prompt tokens x_<1
output: completion tokens x_1, x_2, ..., x_T

repeat:
    logits = model.forward(x_<t)
    x_t = argmax(logits[-1])         # last position only
    append x_t to the sequence
until x_t == EOS or t == max_tokens
```

The `model.forward` call returns logits for every position in the sequence, but greedy decoding only needs the last row. Modern inference engines exploit a [KV cache](/wiki/kv_cache) so that step $t$ runs only the new token through the attention layers and reuses cached keys and values for the previous positions. With the cache, generation is linear in the output length rather than quadratic.

Ties in the argmax are rare but possible. PyTorch `torch.argmax` and NumPy both return the lowest index, but CUDA kernels are sometimes nondeterministic across batch sizes. This is one reason that two systems running "the same" greedy decoding can produce different completions.

## How does greedy decoding relate to other decoding strategies?

Greedy is a degenerate case of two more general families.

It is **[beam search](/wiki/beam_search) with width 1**. Beam search keeps the top-$k$ partial hypotheses at each step, expands each by every vocabulary token, scores the resulting candidates by joint log-probability, and prunes back to the top $k$. With $k = 1$ only one hypothesis is alive and the procedure collapses to greedy.

It is also **[temperature](/wiki/temperature)-zero sampling**. Temperature $\tau$ rescales logits before the softmax, giving $P_\tau(v) \propto \exp(\ell_v / \tau)$. As $\tau \to 0^+$ the distribution concentrates entirely on the highest-logit token. The decoder is also equivalent to **top-1 sampling** ([top-k](/wiki/top_p_sampling) with $k = 1$).

| Strategy | Description | Determinism | Greedy is recovered when |
|---|---|---|---|
| Greedy | Pick argmax at every step | Deterministic | n/a |
| Beam search | Keep top-$k$ partial sequences | Deterministic | $k = 1$ |
| Temperature sampling | Sample from softmax with temperature $\tau$ | Stochastic | $\tau \to 0$ |
| Top-$k$ sampling | Sample from the $k$ highest-probability tokens (Fan et al. 2018) | Stochastic | $k = 1$ |
| Top-$p$ / nucleus sampling | Sample from the smallest set with cumulative probability $\ge p$ (Holtzman et al. 2020) | Stochastic | $p \to 0$ |
| Min-$p$ sampling | Sample from tokens with probability $\ge p \cdot p_\max$ (Nguyen et al. 2024) | Stochastic | $\tau \to 0$ |
| Typical sampling | Sample tokens with information content close to the distribution's entropy (Meister et al. 2023) | Stochastic | does not directly recover greedy |
| [Speculative decoding](/wiki/speculative_decoding) | Draft model proposes tokens, target verifies in parallel (Leviathan et al. 2022; Chen et al. 2023) | Matches underlying decoder | Underlying decoder is greedy |

Greedy is therefore both a member of the deterministic family (with beam) and the limit case of every stochastic sampler. The top-$k$ sampling family was introduced by Fan, Lewis, and Dauphin at ACL 2018 for open-ended story generation [4], and top-$p$ (nucleus) sampling by Holtzman et al. at ICLR 2020 [5]. Most practical decoders combine one of these strategies with engineering tweaks (repetition penalty, presence penalty, banned tokens, logit bias).

## Properties

**Determinism.** Given the same prompt, the same model weights, and the same numerical kernels, greedy decoding always produces the same output. This is its single most useful property. Reproducibility is essential for evaluation harnesses, regression tests, and any production system that needs the same input to map to the same output. Sampling decoders introduce a random seed that has to be controlled, and numerical non-determinism on GPUs can still leak in.

**Local optimality.** Greedy maximises $P_\theta(x_t \mid x_{<t})$ at each step, but the joint probability of the whole sequence is $P_\theta(x_1, \dots, x_T) = \prod_{t=1}^T P_\theta(x_t \mid x_{<t})$, and the highest-probability prefix at step $t$ is not in general a prefix of the highest-probability complete sequence. A token that scores 0.51 at step 1 might force the model into a region where every continuation is poor, while a token scoring 0.49 might lead to a much higher joint probability. Beam search and exact search fix this at the cost of compute; greedy does not.

**Compute cost.** Per generated token, greedy adds nothing on top of the model's forward pass beyond a single `argmax` over the vocabulary, which is $O(|V|)$ and dwarfed by the forward pass itself. Beam search with width $k$ costs roughly $k$ times more memory and compute and requires a top-$k$ on a vector of size $k|V|$. Top-$p$ sampling needs a sort or partial sort of the vocabulary, which is $O(|V| \log |V|)$ but again negligible compared with the forward pass. In wall-clock terms, greedy and the standard sampling decoders are essentially the same speed; beam search is the one that pays a real cost.

## What are the failure modes of greedy decoding?

The most extensive analysis of greedy decoding's failure modes is Holtzman, Buys, Du, Forbes, and Choi's 2020 ICLR paper *The Curious Case of Neural Text Degeneration* (arXiv:1904.09751) [5]. Working with GPT-2 Large (762M parameters), they show that maximisation-based decoders, both greedy and beam search, produce systematically degenerate text on open-ended generation tasks even when applied to strong base language models. As the paper puts it, maximisation-based decoding produces "output text that is bland, incoherent, or gets stuck in repetitive loops" [5].

**Repetition loops.** The most visible failure is that greedy falls into repetitive cycles. After a phrase appears once, the model assigns it slightly higher probability the next time around, reinforcing the loop until the same fragment repeats indefinitely. Holtzman et al. trace this to a self-amplifying feedback dynamic: the highest-probability continuation is the one that has just occurred [5]. The effect is robust across model scales and is one of the main reasons GPT-2 and earlier open-ended generators looked so bad at long-form sampling. The same paper reports that the problem does not vanish with a wider beam: beam search at width 32 still produces degenerate repetition, and at beam widths of 64 or more GPT-2 Large and XL tend to stop generating almost immediately after the prompt [5].

**Mode collapse to high-frequency tokens.** Even when greedy does not loop, it tends to collapse onto bland, high-frequency tokens. The same pattern shows up in machine translation as a preference for short safe sentences and in dialogue as the "I don't know" attractor familiar from neural chatbots.

**Lack of diversity.** Because the decoder is deterministic, every prompt produces a single completion. Greedy cannot explore the space of plausible answers, which is fatal for creative writing and brainstorming.

**Brittleness on long horizons.** Local optimality compounds. A small mistake at token 30 forces the model into an awkward region for the next few hundred tokens.

Holtzman et al. propose [nucleus sampling](/wiki/top_p_sampling) (top-$p$) as an alternative that preserves the high-probability tokens the model is confident about while still admitting enough randomness to break the repetition feedback loop [5]. Min-$p$ sampling (Nguyen et al. 2024) [7] and typical sampling (Meister et al. 2023) [6] attack the same problem from slightly different angles.

## When should you use greedy decoding?

Despite the failure modes, greedy is the right choice for a large class of tasks and the default in most evaluation harnesses.

**Tasks with a single correct answer.** Math problems, multiple-choice questions, code generation against a specification, classification, span extraction, and most agentic tool use have a target output that is right or wrong. The model's job is to put high probability on the right tokens; greedy reads them off. Most code benchmarks, including [MBPP](/wiki/mbpp), HumanEval, and APPS, report `pass@1` numbers using greedy decoding, with `pass@k` numbers using temperature sampling.

**Reproducibility.** A regression test that runs at temperature 0 and pins exact output strings is load-bearing infrastructure for many production LLM systems. Sampling outputs are essentially impossible to test this way without controlling every layer of randomness.

**Function calling and structured output.** When the model is expected to emit JSON, XML, a function-call schema, or a SQL query, you almost always want temperature 0. Sampling adds the risk of a syntactic error that breaks the consumer. "Strict" or "JSON mode" provider settings usually imply or require greedy decoding.

**Constrained decoding.** In constrained generation (Outlines, JSONFormer, grammar-constrained decoding) the decoder masks invalid tokens to $-\infty$ before the argmax, giving deterministic, schema-correct output. This is the dominant pattern for production tool-use pipelines.

**LLM-as-judge and grading.** When one model grades another, the verdict needs to be stable across reruns. Most LLM-as-judge protocols specify temperature 0.

## When should you not use greedy decoding?

**Open-ended generation.** Stories, poems, marketing copy, brainstorming, and chat with a user who expects a varied tone all suffer from greedy's blandness and repetition. Top-$p$ sampling with $p \approx 0.9$ and temperature around 0.7 to 1.0 is the conventional setting.

**Dialogue.** Conversational agents that always produce the same response to the same prompt feel mechanical. A small amount of randomness is enough to make the agent feel responsive.

**[Machine translation](/wiki/machine_translation).** NMT systems have used beam search since the original seq2seq and attention papers. Sutskever, Vinyals, and Le (2014) used a left-to-right beam search and reported results up to a beam of 12, although they found a beam of 2 already captures most of the gain [1]. Bahdanau et al. (2015) used beam search [2], and the original [Transformer](/wiki/transformer) paper (Vaswani et al. 2017) used a beam of 4 with length penalty $\alpha = 0.6$, reaching 28.4 BLEU on WMT'14 English-to-German and 41.8 BLEU on English-to-French [3]. Translation has enough word-order and word-choice variation that beam search is meaningfully better than greedy on BLEU.

**Benchmarks with `pass@k` for $k > 1$.** If the metric samples $k$ candidates and rewards "any one passes", greedy throws away the benefit. Code benchmarks evaluate `pass@1` with greedy and `pass@10` or `pass@100` with sampling at temperature 0.6 to 0.8.

## API knobs: how do you turn on greedy decoding?

Most LLM APIs do not expose a literal "greedy" switch. They expose `temperature` and `top_p`, and you reach greedy by setting temperature to 0 and leaving top_p at 1 or unspecified.

| Provider | Greedy approximation |
|---|---|
| OpenAI Chat Completions | `temperature=0`, optional `top_p=1`, optional `seed` |
| Anthropic Messages | `temperature=0`, optional `top_p` and `top_k` |
| Hugging Face `generate` | `do_sample=False` (the literal switch, and the default) |
| vLLM `SamplingParams` | `temperature=0` or `top_k=1` |
| Google Gemini | `temperature=0`, optional `top_k=1` |
| llama.cpp | `--temp 0` or `--top-k 1` |

Temperature 0 in a hosted API is not always literally greedy. Some providers add a small floor to avoid division by zero, route requests across hardware that breaks ties differently, or apply tie-breaking rules that are not byte-identical across runs. OpenAI's documentation explicitly does not guarantee bit-exact reproducibility even with `seed` and `temperature=0` [16]. Anthropic's documentation makes a similar disclaimer [17]. Outputs are usually stable across many runs but can drift over months as model snapshots roll out behind the scenes. Hugging Face `generate` is the cleanest case: when `do_sample=False` and `num_beams=1` (the default), the decoder is a literal local argmax, and the output depends only on the prompt, the weights, and the floating-point determinism of the underlying kernels [14].

## Implementation notes

A minimal Hugging Face Transformers example:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto")

inputs = tok("The capital of France is", return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=8,
    do_sample=False,      # this is greedy
    num_beams=1,          # default; explicit for clarity
)
print(tok.decode(out[0], skip_special_tokens=True))
```

In `transformers`, `do_sample=False` plus `num_beams=1` is the default decoding strategy and dispatches to the greedy search implementation, which is a thin wrapper around the loop sketched earlier [14]. The `LogitsProcessor` chain (repetition penalty, no-repeat-n-gram, bad-words-mask, prefix-allowed-tokens) is applied to the logits before the argmax, so greedy combined with a repetition penalty is a common recipe for taming the worst loops without giving up determinism. In vLLM, the equivalent is `SamplingParams(temperature=0)` or `SamplingParams(top_k=1)`; vLLM implements the `temperature=0` path as a true argmax rather than a numerical limit [15]. For `llama.cpp` the corresponding CLI flags are `--temp 0` or `--top-k 1`.

## Cost compared with other decoders

| Strategy | Memory overhead | Per-token compute on top of forward pass | Wall clock vs greedy |
|---|---|---|---|
| Greedy | 1 KV cache | argmax over $\|V\|$ logits | 1.0x |
| Beam search, width $k$ | $k$ KV caches | top-$k$ over $k \cdot \|V\|$ scores | roughly $k$x |
| Top-$p$ / top-$k$ sampling | 1 KV cache | sort or top-$k$ over $\|V\|$ logits, plus an RNG draw | very close to 1.0x |
| Speculative decoding | target + draft KV caches | draft forward + verification | typically 1.5x to 3.0x speedup over greedy |

Speculative decoding is unusual: it is not a different decoding rule but a way of executing a chosen rule (greedy, sampling) faster by using a small draft model to propose tokens that the larger target model verifies in parallel (Leviathan, Kalman, and Matias 2022 [8]; Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper 2023 [9]). When the underlying decoder is greedy, speculative decoding produces bit-identical outputs to plain greedy on the target model, just faster.

## Special cases and extensions

**Constrained decoding.** Greedy plus a token mask is the standard recipe for grammar-correct generation. The mask sets disallowed tokens to $-\infty$ before the argmax, so the highest-probability *valid* token is chosen at each step. This gives JSON-mode, regex-constrained, and CFG-constrained outputs that are guaranteed to parse.

**Classifier-free guidance for language models.** CFG, originally a diffusion-model trick, has been adapted to language models (Sanchez et al. 2023) [10]. The decoder mixes logits from a conditional and an unconditional pass, $\ell_{\text{cfg}} = (1 + w)\ell_{\text{cond}} - w \ell_{\text{uncond}}$, and takes the argmax from $\ell_{\text{cfg}}$. With greedy this stays deterministic and cheap, but requires two forward passes per token, which doubles the compute.

**Reasoning models.** OpenAI's [o1](/wiki/o1) and [DeepSeek R1](/wiki/deepseek_r1) expose a single visible answer per prompt but internally use extensive sampling and search inside the chain of thought. Even when the externally visible decoding looks deterministic, the hidden reasoning trace usually is not.

## A short historical note

Greedy predates neural sequence models. It is the natural decoding rule for any locally-scored probabilistic generator and was the baseline against which beam search (used in speech recognition since the 1970s, in statistical machine translation through the 2000s) was compared. The modern case against greedy for open-ended text dates to 2018 and 2019: Fan, Lewis, and Dauphin's *Hierarchical Neural Story Generation* (2018) introduced top-$k$ sampling as an alternative [4], and Holtzman et al.'s *The Curious Case of Neural Text Degeneration* (2020) crystallised the case against maximisation-based decoders for open-ended tasks [5]. Neural machine translation has stayed with beam search throughout, because the failure modes that bite open-ended generation are less severe when the conditioning is tight.

## Summary table

| Use case | Recommended decoder | Why |
|---|---|---|
| Code generation `pass@1`, math, classification, JSON mode, LLM-as-judge | Greedy | Single right answer; deterministic; cheapest |
| Code generation `pass@k`, [chain-of-thought](/wiki/chain_of_thought) self-consistency | Sampling, $\tau \approx 0.6$, top-$p \approx 0.95$ | Need diverse candidates |
| Translation | Beam search, width 4 to 12 | Higher BLEU, narrow target distribution |
| Open-ended generation, dialogue, brainstorming | Top-$p$ sampling, $p \approx 0.9$, $\tau \approx 0.7$ to $1.0$ | Avoids degeneration, gives variety |
| Schema-constrained output | Greedy plus token mask | Guarantees valid syntax |
| Latency-sensitive deployment | Speculative decoding with greedy verification | Bit-identical to greedy on the target, just faster |

## See also

- [Beam search](/wiki/beam_search)
- [Temperature](/wiki/temperature)
- [Top-p and top-k sampling](/wiki/top_p_sampling)
- [Speculative decoding](/wiki/speculative_decoding)
- [Softmax](/wiki/softmax)
- [Logits](/wiki/logits)
- [Transformer](/wiki/transformer)
- [Language model](/wiki/language_model)
- [Large language models](/wiki/llm)
- [Autoregressive](/wiki/autoregressive)

## References

1. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). *Sequence to Sequence Learning with Neural Networks*. NeurIPS 2014. arXiv:1409.3215.
2. Bahdanau, D., Cho, K., and Bengio, Y. (2015). *Neural Machine Translation by Jointly Learning to Align and Translate*. ICLR 2015. arXiv:1409.0473.
3. Vaswani, A. et al. (2017). *Attention Is All You Need*. NeurIPS 2017. arXiv:1706.03762.
4. Fan, A., Lewis, M., and Dauphin, Y. (2018). *Hierarchical Neural Story Generation*. ACL 2018. arXiv:1805.04833.
5. Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). *The Curious Case of Neural Text Degeneration*. ICLR 2020. arXiv:1904.09751.
6. Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. (2023). *Locally Typical Sampling*. TACL 2023. arXiv:2202.00666.
7. Nguyen, M. et al. (2024). *Min-P Sampling: Balancing Creativity and Coherence at High Temperature*. arXiv:2407.01082.
8. Leviathan, Y., Kalman, M., and Matias, Y. (2022). *Fast Inference from Transformers via Speculative Decoding*. arXiv:2211.17192.
9. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling*. arXiv:2302.01318.
10. Sanchez, G., Spangher, A., Fan, H., Levi, E., and Biderman, S. (2023). *Stay on Topic with Classifier-Free Guidance*. arXiv:2306.17806.
11. Jurafsky, D., and Martin, J. H. *Speech and Language Processing*, 3rd edition draft.
12. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press.
13. Eisenstein, J. (2019). *Introduction to Natural Language Processing*. MIT Press.
14. Hugging Face Transformers documentation, *Text generation strategies*. https://huggingface.co/docs/transformers/generation_strategies
15. vLLM documentation, *Sampling parameters*. https://docs.vllm.ai/en/latest/api/inference_params.html
16. OpenAI API reference, *Chat completions*. https://platform.openai.com/docs/api-reference/chat
17. Anthropic API reference, *Messages*. https://docs.anthropic.com/en/api/messages

