Greedy decoding
Last reviewed
May 2, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,042 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 ยท 3,042 words
Add missing citations, update stale details, or suggest a clearer explanation.
Greedy decoding (also called greedy search or argmax decoding) is the simplest decoding strategy used by autoregressive language models. At every step the model emits a probability distribution over its vocabulary, and greedy decoding picks the highest-probability token and appends it to the sequence. Nothing is sampled, no alternatives are tracked, and no future tokens are considered. The loop repeats until the model produces an end-of-sequence token or hits a length limit.
It is the cheapest decoding rule available and the one most people reach for when they want reproducible output. It is the decoder behind the popular temperature=0 and top_k=1 settings on the OpenAI and Anthropic APIs, on Hugging Face generate, on vLLM, and on virtually every other inference stack.
The trade-off is that greedy decoding is locally optimal but not globally optimal. By committing to the highest-probability token at every step, the decoder routinely walks into low-probability sequences and into well-documented failure modes such as repetition loops and dull generic prose. For tasks with a single correct answer (math, code generation, classification, function calling, structured output) greedy is usually the right default. For creative writing, dialogue, and translation, it usually is not.
Let $V$ be the model vocabulary and $x_{<t} = (x_1, x_2, \dots, x_{t-1})$ be the tokens already generated together with the prompt. An autoregressive language model defines a conditional distribution
$$ P_\theta(x_t \mid x_{<t}) $$
over the next token. Greedy decoding produces the next token by taking the argmax of this distribution:
$$ x_t = \arg\max_{v \in V} P_\theta(v \mid x_{<t}). $$
The full sequence is built by repeating this rule:
$$ x_1 = \arg\max_v P_\theta(v \mid x_{<1}), \quad x_2 = \arg\max_v P_\theta(v \mid x_{<2}), \quad \dots $$
until the special end-of-sequence token (often </s>, <|endoftext|>, or <|eot_id|> depending on the tokenizer) is emitted, or until a max_tokens budget is reached.
The decision is made on the post-softmax distribution, but because softmax is monotonic in the logits, the same token is selected by argmax over raw logits. Production implementations skip the softmax for greedy decoding and run argmax on logits directly.
The full procedure is roughly five lines of pseudocode:
input: prompt tokens x_<1
output: completion tokens x_1, x_2, ..., x_T
repeat:
logits = model.forward(x_<t)
x_t = argmax(logits[-1]) # last position only
append x_t to the sequence
until x_t == EOS or t == max_tokens
The model.forward call returns logits for every position in the sequence, but greedy decoding only needs the last row. Modern inference engines exploit a KV cache so that step $t$ runs only the new token through the attention layers and reuses cached keys and values for the previous positions. With the cache, generation is linear in the output length rather than quadratic.
Ties in the argmax are rare but possible. PyTorch torch.argmax and NumPy both return the lowest index, but CUDA kernels are sometimes nondeterministic across batch sizes. This is one reason that two systems running "the same" greedy decoding can produce different completions.
Greedy is a degenerate case of two more general families.
It is beam search with width 1. Beam search keeps the top-$k$ partial hypotheses at each step, expands each by every vocabulary token, scores the resulting candidates by joint log-probability, and prunes back to the top $k$. With $k = 1$ only one hypothesis is alive and the procedure collapses to greedy.
It is also temperature-zero sampling. Temperature $\tau$ rescales logits before the softmax, giving $P_\tau(v) \propto \exp(\ell_v / \tau)$. As $\tau \to 0^+$ the distribution concentrates entirely on the highest-logit token. The decoder is also equivalent to top-1 sampling (top-k with $k = 1$).
| Strategy | Description | Determinism | Greedy is recovered when |
|---|---|---|---|
| Greedy | Pick argmax at every step | Deterministic | n/a |
| Beam search | Keep top-$k$ partial sequences | Deterministic | $k = 1$ |
| Temperature sampling | Sample from softmax with temperature $\tau$ | Stochastic | $\tau \to 0$ |
| Top-$k$ sampling | Sample from the $k$ highest-probability tokens (Fan et al. 2018) | Stochastic | $k = 1$ |
| Top-$p$ / nucleus sampling | Sample from the smallest set with cumulative probability $\ge p$ (Holtzman et al. 2020) | Stochastic | $p \to 0$ |
| Min-$p$ sampling | Sample from tokens with probability $\ge p \cdot p_\max$ (Nguyen et al. 2024) | Stochastic | $\tau \to 0$ |
| Typical sampling | Sample tokens with information content close to the distribution's entropy (Meister et al. 2023) | Stochastic | does not directly recover greedy |
| Speculative decoding | Draft model proposes tokens, target verifies in parallel (Leviathan et al. 2022; Chen et al. 2023) | Matches underlying decoder | Underlying decoder is greedy |
Greedy is therefore both a member of the deterministic family (with beam) and the limit case of every stochastic sampler. Most practical decoders combine one of these strategies with engineering tweaks (repetition penalty, presence penalty, banned tokens, logit bias).
Determinism. Given the same prompt, the same model weights, and the same numerical kernels, greedy decoding always produces the same output. This is its single most useful property. Reproducibility is essential for evaluation harnesses, regression tests, and any production system that needs the same input to map to the same output. Sampling decoders introduce a random seed that has to be controlled, and numerical non-determinism on GPUs can still leak in.
Local optimality. Greedy maximises $P_\theta(x_t \mid x_{<t})$ at each step, but the joint probability of the whole sequence is $P_\theta(x_1, \dots, x_T) = \prod_{t=1}^T P_\theta(x_t \mid x_{<t})$, and the highest-probability prefix at step $t$ is not in general a prefix of the highest-probability complete sequence. A token that scores 0.51 at step 1 might force the model into a region where every continuation is poor, while a token scoring 0.49 might lead to a much higher joint probability. Beam search and exact search fix this at the cost of compute; greedy does not.
Compute cost. Per generated token, greedy adds nothing on top of the model's forward pass beyond a single argmax over the vocabulary, which is $O(|V|)$ and dwarfed by the forward pass itself. Beam search with width $k$ costs roughly $k$ times more memory and compute and requires a top-$k$ on a vector of size $k|V|$. Top-$p$ sampling needs a sort or partial sort of the vocabulary, which is $O(|V| \log |V|)$ but again negligible compared with the forward pass. In wall-clock terms, greedy and the standard sampling decoders are essentially the same speed; beam search is the one that pays a real cost.
The most extensive analysis of greedy decoding's failure modes is Holtzman, Buys, Du, Forbes, and Choi's 2020 ICLR paper The Curious Case of Neural Text Degeneration (arXiv:1904.09751). They show that maximisation-based decoders, both greedy and beam search, produce systematically degenerate text on open-ended generation tasks even when applied to strong base language models such as GPT-2.
Repetition loops. The most visible failure is that greedy falls into repetitive cycles. After a phrase appears once, the model assigns it slightly higher probability the next time around, reinforcing the loop until the same fragment repeats indefinitely. Holtzman et al. trace this to a self-amplifying feedback dynamic: the highest-probability continuation is the one that has just occurred. The effect is robust across model scales and is one of the main reasons GPT-2 and earlier open-ended generators looked so bad at long-form sampling.
Mode collapse to high-frequency tokens. Even when greedy does not loop, it tends to collapse onto bland, high-frequency tokens. The same pattern shows up in machine translation as a preference for short safe sentences and in dialogue as the "I don't know" attractor familiar from neural chatbots.
Lack of diversity. Because the decoder is deterministic, every prompt produces a single completion. Greedy cannot explore the space of plausible answers, which is fatal for creative writing and brainstorming.
Brittleness on long horizons. Local optimality compounds. A small mistake at token 30 forces the model into an awkward region for the next few hundred tokens.
Holtzman et al. propose nucleus sampling (top-$p$) as an alternative that preserves the high-probability tokens the model is confident about while still admitting enough randomness to break the repetition feedback loop. Min-$p$ sampling (Nguyen et al. 2024) and typical sampling (Meister et al. 2023) attack the same problem from slightly different angles.
Despite the failure modes, greedy is the right choice for a large class of tasks and the default in most evaluation harnesses.
Tasks with a single correct answer. Math problems, multiple-choice questions, code generation against a specification, classification, span extraction, and most agentic tool use have a target output that is right or wrong. The model's job is to put high probability on the right tokens; greedy reads them off. Most code benchmarks, including MBPP, HumanEval, and APPS, report pass@1 numbers using greedy decoding, with pass@k numbers using temperature sampling.
Reproducibility. A regression test that runs at temperature 0 and pins exact output strings is load-bearing infrastructure for many production LLM systems. Sampling outputs are essentially impossible to test this way without controlling every layer of randomness.
Function calling and structured output. When the model is expected to emit JSON, XML, a function-call schema, or a SQL query, you almost always want temperature 0. Sampling adds the risk of a syntactic error that breaks the consumer. "Strict" or "JSON mode" provider settings usually imply or require greedy decoding.
Constrained decoding. In constrained generation (Outlines, JSONFormer, grammar-constrained decoding) the decoder masks invalid tokens to $-\infty$ before the argmax, giving deterministic, schema-correct output. This is the dominant pattern for production tool-use pipelines.
LLM-as-judge and grading. When one model grades another, the verdict needs to be stable across reruns. Most LLM-as-judge protocols specify temperature 0.
Open-ended generation. Stories, poems, marketing copy, brainstorming, and chat with a user who expects a varied tone all suffer from greedy's blandness and repetition. Top-$p$ sampling with $p \approx 0.9$ and temperature around 0.7 to 1.0 is the conventional setting.
Dialogue. Conversational agents that always produce the same response to the same prompt feel mechanical. A small amount of randomness is enough to make the agent feel responsive.
Machine translation. NMT systems have used beam search since the original seq2seq and attention papers. Sutskever, Vinyals, and Le (2014) used a beam of 12, Bahdanau et al. (2015) used beam search, and the original Transformer paper (Vaswani et al. 2017) used a beam of 4 with length penalty $\alpha = 0.6$ for the WMT'14 baseline. Translation has enough word-order and word-choice variation that beam search is meaningfully better than greedy on BLEU.
Benchmarks with pass@k for $k > 1$. If the metric samples $k$ candidates and rewards "any one passes", greedy throws away the benefit. Code benchmarks evaluate pass@1 with greedy and pass@10 or pass@100 with sampling at temperature 0.6 to 0.8.
Most LLM APIs do not expose a literal "greedy" switch. They expose temperature and top_p, and you reach greedy by setting temperature to 0 and leaving top_p at 1 or unspecified.
| Provider | Greedy approximation |
|---|---|
| OpenAI Chat Completions | temperature=0, optional top_p=1, optional seed |
| Anthropic Messages | temperature=0, optional top_p and top_k |
Hugging Face generate | do_sample=False (the literal switch, and the default) |
vLLM SamplingParams | temperature=0 or top_k=1 |
| Google Gemini | temperature=0, optional top_k=1 |
| llama.cpp | --temp 0 or --top-k 1 |
Temperature 0 in a hosted API is not always literally greedy. Some providers add a small floor to avoid division by zero, route requests across hardware that breaks ties differently, or apply tie-breaking rules that are not byte-identical across runs. OpenAI's documentation explicitly does not guarantee bit-exact reproducibility even with seed and temperature=0. Anthropic's documentation makes a similar disclaimer. Outputs are usually stable across many runs but can drift over months as model snapshots roll out behind the scenes. Hugging Face generate is the cleanest case: when do_sample=False and num_beams=1, the decoder is a literal local argmax, and the output depends only on the prompt, the weights, and the floating-point determinism of the underlying kernels.
A minimal Hugging Face Transformers example:
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto")
inputs = tok("The capital of France is", return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=8,
do_sample=False, # this is greedy
num_beams=1, # default; explicit for clarity
)
print(tok.decode(out<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, skip_special_tokens=True))
In transformers, do_sample=False plus num_beams=1 calls the greedy_search implementation, which is a thin wrapper around the loop sketched earlier. The LogitsProcessor chain (repetition penalty, no-repeat-n-gram, bad-words-mask, prefix-allowed-tokens) is applied to the logits before the argmax, so greedy combined with a repetition penalty is a common recipe for taming the worst loops without giving up determinism. In vLLM, the equivalent is SamplingParams(temperature=0) or SamplingParams(top_k=1). vLLM implements the temperature=0 path as a true argmax rather than a numerical limit. For llama.cpp the corresponding CLI flags are --temp 0 or --top-k 1.
| Strategy | Memory overhead | Per-token compute on top of forward pass | Wall clock vs greedy |
|---|---|---|---|
| Greedy | 1 KV cache | argmax over $|V|$ logits | 1.0x |
| Beam search, width $k$ | $k$ KV caches | top-$k$ over $k \cdot |V|$ scores | roughly $k$x |
| Top-$p$ / top-$k$ sampling | 1 KV cache | sort or top-$k$ over $|V|$ logits, plus an RNG draw | very close to 1.0x |
| Speculative decoding | target + draft KV caches | draft forward + verification | typically 1.5x to 3.0x speedup over greedy |
Speculative decoding is unusual: it is not a different decoding rule but a way of executing a chosen rule (greedy, sampling) faster by using a small draft model to propose tokens that the larger target model verifies in parallel (Leviathan, Kalman, and Matias 2022; Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper 2023). When the underlying decoder is greedy, speculative decoding produces bit-identical outputs to plain greedy on the target model, just faster.
Constrained decoding. Greedy plus a token mask is the standard recipe for grammar-correct generation. The mask sets disallowed tokens to $-\infty$ before the argmax, so the highest-probability valid token is chosen at each step. This gives JSON-mode, regex-constrained, and CFG-constrained outputs that are guaranteed to parse.
Classifier-free guidance for language models. CFG, originally a diffusion-model trick, has been adapted to language models (Sanchez et al. 2023). The decoder mixes logits from a conditional and an unconditional pass, $\ell_{\text{cfg}} = (1 + w)\ell_{\text{cond}} - w \ell_{\text{uncond}}$, and takes the argmax from $\ell_{\text{cfg}}$. With greedy this stays deterministic and cheap, but requires two forward passes per token, which doubles the compute.
Reasoning models. OpenAI's o1 and DeepSeek R1 expose a single visible answer per prompt but internally use extensive sampling and search inside the chain of thought. Even when the externally visible decoding looks deterministic, the hidden reasoning trace usually is not.
Greedy predates neural sequence models. It is the natural decoding rule for any locally-scored probabilistic generator and was the baseline against which beam search (used in speech recognition since the 1970s, in statistical machine translation through the 2000s) was compared. The modern case against greedy for open-ended text dates to 2018 and 2019: Fan, Lewis, and Dauphin's Hierarchical Neural Story Generation (2018) introduced top-$k$ sampling as an alternative, and Holtzman et al.'s The Curious Case of Neural Text Degeneration (2020) crystallised the case against maximisation-based decoders for open-ended tasks. Neural machine translation has stayed with beam search throughout, because the failure modes that bite open-ended generation are less severe when the conditioning is tight.
| Use case | Recommended decoder | Why |
|---|---|---|
Code generation pass@1, math, classification, JSON mode, LLM-as-judge | Greedy | Single right answer; deterministic; cheapest |
Code generation pass@k, chain-of-thought self-consistency | Sampling, $\tau \approx 0.6$, top-$p \approx 0.95$ | Need diverse candidates |
| Translation | Beam search, width 4 to 12 | Higher BLEU, narrow target distribution |
| Open-ended generation, dialogue, brainstorming | Top-$p$ sampling, $p \approx 0.9$, $\tau \approx 0.7$ to $1.0$ | Avoids degeneration, gives variety |
| Schema-constrained output | Greedy plus token mask | Guarantees valid syntax |
| Latency-sensitive deployment | Speculative decoding with greedy verification | Bit-identical to greedy on the target, just faster |