# Greedy decoding > Source: https://aiwiki.ai/wiki/greedy_decoding > Updated: 2026-06-23 > Categories: Large Language Models, Natural Language Processing > License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) > From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)". **Greedy decoding** (also called **greedy search** or **argmax decoding**) is the simplest text-generation strategy used by autoregressive [language models](/wiki/language_model): at every step it picks the single highest-probability next token (the argmax of the model's output distribution) and appends it to the sequence, with no sampling, no alternatives tracked, and no lookahead. It is deterministic, the cheapest decoding rule available, and exactly equivalent to sampling at [temperature](/wiki/temperature) zero, but it is myopic (locally optimal, not globally optimal) and prone to repetition loops and bland generic prose, which is why open-ended generation usually uses [beam search](/wiki/beam_search) or stochastic sampling instead. The loop repeats until the model produces an end-of-sequence token or hits a length limit. Greedy decoding is the decoder behind the popular `temperature=0` and `top_k=1` settings on the [OpenAI](/wiki/openai_api) and [Anthropic](/wiki/anthropic_api) APIs, on Hugging Face `generate` (where `do_sample=False` with `num_beams=1` is the default decoding strategy), on [vLLM](/wiki/vllm), and on virtually every other inference stack [14] [15]. The trade-off is that greedy decoding is locally optimal but not globally optimal. By committing to the highest-probability token at every step, the decoder routinely walks into low-probability sequences and into well-documented failure modes such as repetition loops and dull generic prose [5]. For tasks with a single correct answer (math, code generation, classification, function calling, structured output) greedy is usually the right default. For creative writing, dialogue, and translation, it usually is not. ## What is greedy decoding, formally? Let $V$ be the model vocabulary and $x_{`, `<|endoftext|>`, or `<|eot_id|>` depending on the [tokenizer](/wiki/token)) is emitted, or until a `max_tokens` budget is reached. The decision is made on the post-[softmax](/wiki/softmax) distribution, but because softmax is monotonic in the [logits](/wiki/logits), the same token is selected by argmax over raw logits. Production implementations skip the softmax for greedy decoding and run argmax on logits directly. ## The algorithm The full procedure is roughly five lines of pseudocode: ``` input: prompt tokens x_<1 output: completion tokens x_1, x_2, ..., x_T repeat: logits = model.forward(x_ 1$.** If the metric samples $k$ candidates and rewards "any one passes", greedy throws away the benefit. Code benchmarks evaluate `pass@1` with greedy and `pass@10` or `pass@100` with sampling at temperature 0.6 to 0.8. ## API knobs: how do you turn on greedy decoding? Most LLM APIs do not expose a literal "greedy" switch. They expose `temperature` and `top_p`, and you reach greedy by setting temperature to 0 and leaving top_p at 1 or unspecified. | Provider | Greedy approximation | |---|---| | OpenAI Chat Completions | `temperature=0`, optional `top_p=1`, optional `seed` | | Anthropic Messages | `temperature=0`, optional `top_p` and `top_k` | | Hugging Face `generate` | `do_sample=False` (the literal switch, and the default) | | vLLM `SamplingParams` | `temperature=0` or `top_k=1` | | Google Gemini | `temperature=0`, optional `top_k=1` | | llama.cpp | `--temp 0` or `--top-k 1` | Temperature 0 in a hosted API is not always literally greedy. Some providers add a small floor to avoid division by zero, route requests across hardware that breaks ties differently, or apply tie-breaking rules that are not byte-identical across runs. OpenAI's documentation explicitly does not guarantee bit-exact reproducibility even with `seed` and `temperature=0` [16]. Anthropic's documentation makes a similar disclaimer [17]. Outputs are usually stable across many runs but can drift over months as model snapshots roll out behind the scenes. Hugging Face `generate` is the cleanest case: when `do_sample=False` and `num_beams=1` (the default), the decoder is a literal local argmax, and the output depends only on the prompt, the weights, and the floating-point determinism of the underlying kernels [14]. ## Implementation notes A minimal Hugging Face Transformers example: ```python from transformers import AutoTokenizer, AutoModelForCausalLM tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B", device_map="auto") inputs = tok("The capital of France is", return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=8, do_sample=False, # this is greedy num_beams=1, # default; explicit for clarity ) print(tok.decode(out[0], skip_special_tokens=True)) ``` In `transformers`, `do_sample=False` plus `num_beams=1` is the default decoding strategy and dispatches to the greedy search implementation, which is a thin wrapper around the loop sketched earlier [14]. The `LogitsProcessor` chain (repetition penalty, no-repeat-n-gram, bad-words-mask, prefix-allowed-tokens) is applied to the logits before the argmax, so greedy combined with a repetition penalty is a common recipe for taming the worst loops without giving up determinism. In vLLM, the equivalent is `SamplingParams(temperature=0)` or `SamplingParams(top_k=1)`; vLLM implements the `temperature=0` path as a true argmax rather than a numerical limit [15]. For `llama.cpp` the corresponding CLI flags are `--temp 0` or `--top-k 1`. ## Cost compared with other decoders | Strategy | Memory overhead | Per-token compute on top of forward pass | Wall clock vs greedy | |---|---|---|---| | Greedy | 1 KV cache | argmax over $\|V\|$ logits | 1.0x | | Beam search, width $k$ | $k$ KV caches | top-$k$ over $k \cdot \|V\|$ scores | roughly $k$x | | Top-$p$ / top-$k$ sampling | 1 KV cache | sort or top-$k$ over $\|V\|$ logits, plus an RNG draw | very close to 1.0x | | Speculative decoding | target + draft KV caches | draft forward + verification | typically 1.5x to 3.0x speedup over greedy | Speculative decoding is unusual: it is not a different decoding rule but a way of executing a chosen rule (greedy, sampling) faster by using a small draft model to propose tokens that the larger target model verifies in parallel (Leviathan, Kalman, and Matias 2022 [8]; Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper 2023 [9]). When the underlying decoder is greedy, speculative decoding produces bit-identical outputs to plain greedy on the target model, just faster. ## Special cases and extensions **Constrained decoding.** Greedy plus a token mask is the standard recipe for grammar-correct generation. The mask sets disallowed tokens to $-\infty$ before the argmax, so the highest-probability *valid* token is chosen at each step. This gives JSON-mode, regex-constrained, and CFG-constrained outputs that are guaranteed to parse. **Classifier-free guidance for language models.** CFG, originally a diffusion-model trick, has been adapted to language models (Sanchez et al. 2023) [10]. The decoder mixes logits from a conditional and an unconditional pass, $\ell_{\text{cfg}} = (1 + w)\ell_{\text{cond}} - w \ell_{\text{uncond}}$, and takes the argmax from $\ell_{\text{cfg}}$. With greedy this stays deterministic and cheap, but requires two forward passes per token, which doubles the compute. **Reasoning models.** OpenAI's [o1](/wiki/o1) and [DeepSeek R1](/wiki/deepseek_r1) expose a single visible answer per prompt but internally use extensive sampling and search inside the chain of thought. Even when the externally visible decoding looks deterministic, the hidden reasoning trace usually is not. ## A short historical note Greedy predates neural sequence models. It is the natural decoding rule for any locally-scored probabilistic generator and was the baseline against which beam search (used in speech recognition since the 1970s, in statistical machine translation through the 2000s) was compared. The modern case against greedy for open-ended text dates to 2018 and 2019: Fan, Lewis, and Dauphin's *Hierarchical Neural Story Generation* (2018) introduced top-$k$ sampling as an alternative [4], and Holtzman et al.'s *The Curious Case of Neural Text Degeneration* (2020) crystallised the case against maximisation-based decoders for open-ended tasks [5]. Neural machine translation has stayed with beam search throughout, because the failure modes that bite open-ended generation are less severe when the conditioning is tight. ## Summary table | Use case | Recommended decoder | Why | |---|---|---| | Code generation `pass@1`, math, classification, JSON mode, LLM-as-judge | Greedy | Single right answer; deterministic; cheapest | | Code generation `pass@k`, [chain-of-thought](/wiki/chain_of_thought) self-consistency | Sampling, $\tau \approx 0.6$, top-$p \approx 0.95$ | Need diverse candidates | | Translation | Beam search, width 4 to 12 | Higher BLEU, narrow target distribution | | Open-ended generation, dialogue, brainstorming | Top-$p$ sampling, $p \approx 0.9$, $\tau \approx 0.7$ to $1.0$ | Avoids degeneration, gives variety | | Schema-constrained output | Greedy plus token mask | Guarantees valid syntax | | Latency-sensitive deployment | Speculative decoding with greedy verification | Bit-identical to greedy on the target, just faster | ## See also - [Beam search](/wiki/beam_search) - [Temperature](/wiki/temperature) - [Top-p and top-k sampling](/wiki/top_p_sampling) - [Speculative decoding](/wiki/speculative_decoding) - [Softmax](/wiki/softmax) - [Logits](/wiki/logits) - [Transformer](/wiki/transformer) - [Language model](/wiki/language_model) - [Large language models](/wiki/llm) - [Autoregressive](/wiki/autoregressive) ## References 1. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). *Sequence to Sequence Learning with Neural Networks*. NeurIPS 2014. arXiv:1409.3215. 2. Bahdanau, D., Cho, K., and Bengio, Y. (2015). *Neural Machine Translation by Jointly Learning to Align and Translate*. ICLR 2015. arXiv:1409.0473. 3. Vaswani, A. et al. (2017). *Attention Is All You Need*. NeurIPS 2017. arXiv:1706.03762. 4. Fan, A., Lewis, M., and Dauphin, Y. (2018). *Hierarchical Neural Story Generation*. ACL 2018. arXiv:1805.04833. 5. Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. (2020). *The Curious Case of Neural Text Degeneration*. ICLR 2020. arXiv:1904.09751. 6. Meister, C., Pimentel, T., Wiher, G., and Cotterell, R. (2023). *Locally Typical Sampling*. TACL 2023. arXiv:2202.00666. 7. Nguyen, M. et al. (2024). *Min-P Sampling: Balancing Creativity and Coherence at High Temperature*. arXiv:2407.01082. 8. Leviathan, Y., Kalman, M., and Matias, Y. (2022). *Fast Inference from Transformers via Speculative Decoding*. arXiv:2211.17192. 9. Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). *Accelerating Large Language Model Decoding with Speculative Sampling*. arXiv:2302.01318. 10. Sanchez, G., Spangher, A., Fan, H., Levi, E., and Biderman, S. (2023). *Stay on Topic with Classifier-Free Guidance*. arXiv:2306.17806. 11. Jurafsky, D., and Martin, J. H. *Speech and Language Processing*, 3rd edition draft. 12. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. 13. Eisenstein, J. (2019). *Introduction to Natural Language Processing*. MIT Press. 14. Hugging Face Transformers documentation, *Text generation strategies*. https://huggingface.co/docs/transformers/generation_strategies 15. vLLM documentation, *Sampling parameters*. https://docs.vllm.ai/en/latest/api/inference_params.html 16. OpenAI API reference, *Chat completions*. https://platform.openai.com/docs/api-reference/chat 17. Anthropic API reference, *Messages*. https://docs.anthropic.com/en/api/messages