Greedy decoding

Greedy decoding (also called greedy search or argmax decoding) is the simplest decoding strategy used by autoregressive language models. At every step the model emits a probability distribution over its vocabulary, and greedy decoding picks the highest-probability token and appends it to the sequence. Nothing is sampled, no alternatives are tracked, and no future tokens are considered. The loop repeats until the model produces an end-of-sequence token or hits a length limit.

It is the cheapest decoding rule available and the one most people reach for when they want reproducible output. It is the decoder behind the popular temperature=0 and top_k=1 settings on the OpenAI and Anthropic APIs, on Hugging Face generate, on vLLM, and on virtually every other inference stack.

The trade-off is that greedy decoding is locally optimal but not globally optimal. By committing to the highest-probability token at every step, the decoder routinely walks into low-probability sequences and into well-documented failure modes such as repetition loops and dull generic prose. For tasks with a single correct answer (math, code generation, classification, function calling, structured output) greedy is usually the right default. For creative writing, dialogue, and translation, it usually is not.

Definition

Let $V$ be the model vocabulary and $x_{<t} = (x_1, x_2, \dots, x_{t-1})$ be the tokens already generated together with the prompt. An autoregressive language model defines a conditional distribution

$$ P_\theta(x_t \mid x_{<t}) $$

over the next token. Greedy decoding produces the next token by taking the argmax of this distribution:

$$ x_t = \arg\max_{v \in V} P_\theta(v \mid x_{<t}). $$

The full sequence is built by repeating this rule:

$$ x_1 = \arg\max_v P_\theta(v \mid x_{<1}), \quad x_2 = \arg\max_v P_\theta(v \mid x_{<2}), \quad \dots $$

until the special end-of-sequence token (often </s>, <|endoftext|>, or <|eot_id|> depending on the tokenizer) is emitted, or until a max_tokens budget is reached.

The decision is made on the post-softmax distribution, but because softmax is monotonic in the logits, the same token is selected by argmax over raw logits. Production implementations skip the softmax for greedy decoding and run argmax on logits directly.

The algorithm

The full procedure is roughly five lines of pseudocode:

input: prompt tokens x_<1
output: completion tokens x_1, x_2, ..., x_T

repeat:
    logits = model.forward(x_<t)
    x_t = argmax(logits[-1])         # last position only
    append x_t to the sequence
until x_t == EOS or t == max_tokens

The model.forward call returns logits for every position in the sequence, but greedy decoding only needs the last row. Modern inference engines exploit a KV cache so that step $t$ runs only the new token through the attention layers and reuses cached keys and values for the previous positions. With the cache, generation is linear in the output length rather than quadratic.

Ties in the argmax are rare but possible. PyTorch torch.argmax and NumPy both return the lowest index, but CUDA kernels are sometimes nondeterministic across batch sizes. This is one reason that two systems running "the same" greedy decoding can produce different completions.

Relation to other decoding strategies

Greedy is a degenerate case of two more general families.

It is beam search with width 1. Beam search keeps the top-$k$ partial hypotheses at each step, expands each by every vocabulary token, scores the resulting candidates by joint log-probability, and prunes back to the top $k$. With $k = 1$ only one hypothesis is alive and the procedure collapses to greedy.

It is also temperature-zero sampling. Temperature $\tau$ rescales logits before the softmax, giving $P_\tau(v) \propto \exp(\ell_v / \tau)$. As $\tau \to 0^+$ the distribution concentrates entirely on the highest-logit token. The decoder is also equivalent to top-1 sampling (top-k with $k = 1$).

Strategy	Description	Determinism	Greedy is recovered when
Greedy	Pick argmax at every step	Deterministic	n/a
Beam search	Keep top-$k$ partial sequences	Deterministic	$k = 1$
Temperature sampling	Sample from softmax with temperature $\tau$	Stochastic	$\tau \to 0$
Top-$k$ sampling	Sample from the $k$ highest-probability tokens (Fan et al. 2018)	Stochastic	$k = 1$
Top-$p$ / nucleus sampling	Sample from the smallest set with cumulative probability $\ge p$ (Holtzman et al. 2020)	Stochastic	$p \to 0$
Min-$p$ sampling	Sample from tokens with probability $\ge p \cdot p_\max$ (Nguyen et al. 2024)	Stochastic	$\tau \to 0$
Typical sampling	Sample tokens with information content close to the distribution's entropy (Meister et al. 2023)	Stochastic	does not directly recover greedy
Speculative decoding	Draft model proposes tokens, target verifies in parallel (Leviathan et al. 2022; Chen et al. 2023)	Matches underlying decoder	Underlying decoder is greedy

Greedy is therefore both a member of the deterministic family (with beam) and the limit case of every stochastic sampler. Most practical decoders combine one of these strategies with engineering tweaks (repetition penalty, presence penalty, banned tokens, logit bias).

Properties

Determinism. Given the same prompt, the same model weights, and the same numerical kernels, greedy decoding always produces the same output. This is its single most useful property. Reproducibility is essential for evaluation harnesses, regression tests, and any production system that needs the same input to map to the same output. Sampling decoders introduce a random seed that has to be controlled, and numerical non-determinism on GPUs can still leak in.

Local optimality. Greedy maximises $P_\theta(x_t \mid x_{<t})$ at each step, but the joint probability of the whole sequence is $P_\theta(x_1, \dots, x_T) = \prod_{t=1}^T P_\theta(x_t \mid x_{<t})$, and the highest-probability prefix at step $t$ is not in general a prefix of the highest-probability complete sequence. A token that scores 0.51 at step 1 might force the model into a region where every continuation is poor, while a token scoring 0.49 might lead to a much higher joint probability. Beam search and exact search fix this at the cost of compute; greedy does not.

Compute cost. Per generated token, greedy adds nothing on top of the model's forward pass beyond a single argmax over the vocabulary, which is $O(|V|)$ and dwarfed by the forward pass itself. Beam search with width $k$ costs roughly $k$ times more memory and compute and requires a top-$k$ on a vector of size $k|V|$. Top-$p$ sampling needs a sort or partial sort of the vocabulary, which is $O(|V| \log |V|)$ but again negligible compared with the forward pass. In wall-clock terms, greedy and the standard sampling decoders are essentially the same speed; beam search is the one that pays a real cost.

Failure modes

The most extensive analysis of greedy decoding's failure modes is Holtzman, Buys, Du, Forbes, and Choi's 2020 ICLR paper The Curious Case of Neural Text Degeneration (arXiv:1904.09751). They show that maximisation-based decoders, both greedy and beam search, produce systematically degenerate text on open-ended generation tasks even when applied to strong base language models such as GPT-2.

Repetition loops. The most visible failure is that greedy falls into repetitive cycles. After a phrase appears once, the model assigns it slightly higher probability the next time around, reinforcing the loop until the same fragment repeats indefinitely. Holtzman et al. trace this to a self-amplifying feedback dynamic: the highest-probability continuation is the one that has just occurred. The effect is robust across model scales and is one of the main reasons GPT-2 and earlier open-ended generators looked so bad at long-form sampling.

Mode collapse to high-frequency tokens. Even when greedy does not loop, it tends to collapse onto bland, high-frequency tokens. The same pattern shows up in machine translation as a preference for short safe sentences and in dialogue as the "I don't know" attractor familiar from neural chatbots.

Lack of diversity. Because the decoder is deterministic, every prompt produces a single completion. Greedy cannot explore the space of plausible answers, which is fatal for creative writing and brainstorming.

Brittleness on long horizons. Local optimality compounds. A small mistake at token 30 forces the model into an awkward region for the next few hundred tokens.

Holtzman et al. propose nucleus sampling (top-$p$) as an alternative that preserves the high-probability tokens the model is confident about while still admitting enough randomness to break the repetition feedback loop. Min-$p$ sampling (Nguyen et al. 2024) and typical sampling (Meister et al. 2023) attack the same problem from slightly different angles.

When to use greedy decoding

Despite the failure modes, greedy is the right choice for a large class of tasks and the default in most evaluation harnesses.

Tasks with a single correct answer. Math problems, multiple-choice questions, code generation against a specification, classification, span extraction, and most agentic tool use have a target output that is right or wrong. The model's job is to put high probability on the right tokens; greedy reads them off. Most code benchmarks, including MBPP, HumanEval, and APPS, report pass@1 numbers using greedy decoding, with pass@k numbers using temperature sampling.

Reproducibility. A regression test that runs at temperature 0 and pins exact output strings is load-bearing infrastructure for many production LLM systems. Sampling outputs are essentially impossible to test this way without controlling every layer of randomness.

Function calling and structured output. When the model is expected to emit JSON, XML, a function-call schema, or a SQL query, you almost always want temperature 0. Sampling adds the risk of a syntactic error that breaks the consumer. "Strict" or "JSON mode" provider settings usually imply or require greedy decoding.

Constrained decoding. In constrained generation (Outlines, JSONFormer, grammar-constrained decoding) the decoder masks invalid tokens to $-\infty$ before the argmax, giving deterministic, schema-correct output. This is the dominant pattern for production tool-use pipelines.

LLM-as-judge and grading. When one model grades another, the verdict needs to be stable across reruns. Most LLM-as-judge protocols specify temperature 0.

When not to use greedy decoding

Open-ended generation. Stories, poems, marketing copy, brainstorming, and chat with a user who expects a varied tone all suffer from greedy's blandness and repetition. Top-$p$ sampling with $p \approx 0.9$ and temperature around 0.7 to 1.0 is the conventional setting.

Dialogue. Conversational agents that always produce the same response to the same prompt feel mechanical. A small amount of randomness is enough to make the agent feel responsive.

Machine translation. NMT systems have used beam search since the original seq2seq and attention papers. Sutskever, Vinyals, and Le (2014) used a beam of 12, Bahdanau et al. (2015) used beam search, and the original Transformer paper (Vaswani et al. 2017) used a beam of 4 with length penalty $\alpha = 0.6$ for the WMT'14 baseline. Translation has enough word-order and word-choice variation that beam search is meaningfully better than greedy on BLEU.

Benchmarks with pass@k for $k > 1$. If the metric samples $k$ candidates and rewards "any one passes", greedy throws away the benefit. Code benchmarks evaluate pass@1 with greedy and pass@10 or pass@100 with sampling at temperature 0.6 to 0.8.

API knobs

Most LLM APIs do not expose a literal "greedy" switch. They expose temperature and top_p, and you reach greedy by setting temperature to 0 and leaving top_p at 1 or unspecified.

Provider	Greedy approximation
OpenAI Chat Completions	`temperature=0`, optional `top_p=1`, optional `seed`
Anthropic Messages	`temperature=0`, optional `top_p` and `top_k`
Hugging Face `generate`	`do_sample=False` (the literal switch, and the default)
vLLM `SamplingParams`	`temperature=0` or `top_k=1`
Google Gemini	`temperature=0`, optional `top_k=1`
llama.cpp	`--temp 0` or `--top-k 1`

Temperature 0 in a hosted API is not always literally greedy. Some providers add a small floor to avoid division by zero, route requests across hardware that breaks ties differently, or apply tie-breaking rules that are not byte-identical across runs. OpenAI's documentation explicitly does not guarantee bit-exact reproducibility even with seed and temperature=0. Anthropic's documentation makes a similar disclaimer. Outputs are usually stable across many runs but can drift over months as model snapshots roll out behind the scenes. Hugging Face generate is the cleanest case: when do_sample=False and num_beams=1, the decoder is a literal local argmax, and the output depends only on the prompt, the weights, and the floating-point determinism of the underlying kernels.

Implementation notes

A minimal Hugging Face Transformers example:

from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto")

inputs = tok("The capital of France is", return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=8,
    do_sample=False,      # this is greedy
    num_beams=1,          # default; explicit for clarity
)
print(tok.decode(out<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>, skip_special_tokens=True))

In transformers, do_sample=False plus num_beams=1 calls the greedy_search implementation, which is a thin wrapper around the loop sketched earlier. The LogitsProcessor chain (repetition penalty, no-repeat-n-gram, bad-words-mask, prefix-allowed-tokens) is applied to the logits before the argmax, so greedy combined with a repetition penalty is a common recipe for taming the worst loops without giving up determinism. In vLLM, the equivalent is SamplingParams(temperature=0) or SamplingParams(top_k=1). vLLM implements the temperature=0 path as a true argmax rather than a numerical limit. For llama.cpp the corresponding CLI flags are --temp 0 or --top-k 1.

Cost compared with other decoders

Strategy	Memory overhead	Per-token compute on top of forward pass	Wall clock vs greedy
Greedy	1 KV cache	argmax over $\|V\|$ logits	1.0x
Beam search, width $k$	$k$ KV caches	top-$k$ over $k \cdot \|V\|$ scores	roughly $k$x
Top-$p$ / top-$k$ sampling	1 KV cache	sort or top-$k$ over $\|V\|$ logits, plus an RNG draw	very close to 1.0x
Speculative decoding	target + draft KV caches	draft forward + verification	typically 1.5x to 3.0x speedup over greedy

Speculative decoding is unusual: it is not a different decoding rule but a way of executing a chosen rule (greedy, sampling) faster by using a small draft model to propose tokens that the larger target model verifies in parallel (Leviathan, Kalman, and Matias 2022; Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper 2023). When the underlying decoder is greedy, speculative decoding produces bit-identical outputs to plain greedy on the target model, just faster.

Special cases and extensions

Constrained decoding. Greedy plus a token mask is the standard recipe for grammar-correct generation. The mask sets disallowed tokens to $-\infty$ before the argmax, so the highest-probability valid token is chosen at each step. This gives JSON-mode, regex-constrained, and CFG-constrained outputs that are guaranteed to parse.

Classifier-free guidance for language models. CFG, originally a diffusion-model trick, has been adapted to language models (Sanchez et al. 2023). The decoder mixes logits from a conditional and an unconditional pass, $\ell_{\text{cfg}} = (1 + w)\ell_{\text{cond}} - w \ell_{\text{uncond}}$, and takes the argmax from $\ell_{\text{cfg}}$. With greedy this stays deterministic and cheap, but requires two forward passes per token, which doubles the compute.

Reasoning models. OpenAI's o1 and DeepSeek R1 expose a single visible answer per prompt but internally use extensive sampling and search inside the chain of thought. Even when the externally visible decoding looks deterministic, the hidden reasoning trace usually is not.

A short historical note

Greedy predates neural sequence models. It is the natural decoding rule for any locally-scored probabilistic generator and was the baseline against which beam search (used in speech recognition since the 1970s, in statistical machine translation through the 2000s) was compared. The modern case against greedy for open-ended text dates to 2018 and 2019: Fan, Lewis, and Dauphin's Hierarchical Neural Story Generation (2018) introduced top-$k$ sampling as an alternative, and Holtzman et al.'s The Curious Case of Neural Text Degeneration (2020) crystallised the case against maximisation-based decoders for open-ended tasks. Neural machine translation has stayed with beam search throughout, because the failure modes that bite open-ended generation are less severe when the conditioning is tight.

Summary table

Use case	Recommended decoder	Why
Code generation `pass@1`, math, classification, JSON mode, LLM-as-judge	Greedy	Single right answer; deterministic; cheapest
Code generation `pass@k`, chain-of-thought self-consistency	Sampling, $\tau \approx 0.6$, top-$p \approx 0.95$	Need diverse candidates
Translation	Beam search, width 4 to 12	Higher BLEU, narrow target distribution
Open-ended generation, dialogue, brainstorming	Top-$p$ sampling, $p \approx 0.9$, $\tau \approx 0.7$ to $1.0$	Avoids degeneration, gives variety
Schema-constrained output	Greedy plus token mask	Guarantees valid syntax
Latency-sensitive deployment	Speculative decoding with greedy verification	Bit-identical to greedy on the target, just faster

References

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). *Sequence to Sequence Learning with Neural Networks*. NeurIPS 2014. arXiv:1409.3215.
Bahdanau, D., Cho, K., and Bengio, Y. (2015). *Neural Machine Translation by Jointly Learning to Align and Translate*. ICLR 2015. arXiv:1409.0473.
Vaswani, A. et al. (2017). *Attention Is All You Need*. NeurIPS 2017. arXiv:1706.03762.
Fan, A., Lewis, M., and Dauphin, Y. (2018). *Hierarchical Neural Story Generation*. ACL 2018. arXiv:1805.04833.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). *The Curious Case of Neural Text Degeneration*. ICLR 2020. arXiv:1904.09751.
Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. (2023). *Locally Typical Sampling*. TACL 2023. arXiv:2202.00666.
Nguyen, M. et al. (2024). *Min-P Sampling: Balancing Creativity and Coherence at High Temperature*. arXiv:2407.01082.
Leviathan, Y., Kalman, M., and Matias, Y. (2022). *Fast Inference from Transformers via Speculative Decoding*. arXiv:2211.17192.
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling*. arXiv:2302.01318.
Sanchez, G., Spangher, A., Fan, H., Levi, E., and Biderman, S. (2023). *Stay on Topic with Classifier-Free Guidance*. arXiv:2306.17806.
Jurafsky, D., and Martin, J. H. *Speech and Language Processing*, 3rd edition draft.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press.
Eisenstein, J. (2019). *Introduction to Natural Language Processing*. MIT Press.
Hugging Face Transformers documentation, *Text generation strategies*.
vLLM documentation, *Sampling parameters*.
OpenAI API reference, *Chat completions*.
Anthropic API reference, *Messages*.

Greedy decoding

Definition

The algorithm

Relation to other decoding strategies

Properties

Failure modes

When to use greedy decoding

When not to use greedy decoding

API knobs

Implementation notes

Cost compared with other decoders

Special cases and extensions

A short historical note

Summary table

See also

References

Improve this article

Definition

The algorithm

Relation to other decoding strategies

Properties

Failure modes

When to use greedy decoding

When not to use greedy decoding

API knobs

Implementation notes

Cost compared with other decoders

Special cases and extensions

A short historical note

Summary table

See also

References

Definition

The algorithm

Relation to other decoding strategies

Properties

Failure modes

When to use greedy decoding

When not to use greedy decoding

API knobs

Implementation notes

Cost compared with other decoders

Special cases and extensions

A short historical note

Summary table

See also

References

Improve this article

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

Tokenization

Definition

The algorithm

Relation to other decoding strategies

Properties

Failure modes

When to use greedy decoding

When not to use greedy decoding

API knobs

Implementation notes

Cost compared with other decoders

Special cases and extensions

A short historical note

Summary table

See also

References

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

Tokenization