Top-p and top-k sampling

Large Language Models Machine Learning Natural Language Processing

18 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v5 · 3,550 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Top-p sampling, also called nucleus sampling, is a text-generation decoding method for large language models (LLMs) that, at each step, samples the next token from the smallest set of highest-probability tokens whose cumulative probability is at least a threshold p (for example p = 0.9 or p = 0.95). It was introduced by Holtzman et al. in "The Curious Case of Neural Text Degeneration" (arXiv 2019, published at ICLR 2020), and its defining property is that the size of this candidate set, the "nucleus," expands and contracts dynamically with the shape of the distribution, unlike fixed top-k sampling which always keeps the same number of tokens. ^[1]

Top-k sampling is the related, simpler method that keeps a fixed number k of the highest-probability tokens. Top-k was popularized by Fan et al. (2018) for story generation, while top-p was proposed to fix top-k's inability to adapt to how confident the model is at each step. ^[1]^[2] Both are decoding strategies used in LLMs and other autoregressive neural networks to control the randomness and quality of generated text: they filter the vocabulary at each generation step, restricting the pool of candidate tokens before sampling. They are fundamental tools for balancing creativity, coherence, and diversity across applications such as chatbots, creative writing assistants, and code generators. Since their introduction, newer methods such as min-p sampling have emerged to address remaining limitations.

What problem does top-p sampling solve?

Holtzman et al.'s paper identified a critical problem with greedy decoding and beam search: even high-quality language models produce text that is bland and repetitive when decoded with maximization-based methods. The paper states that "using likelihood as a decoding objective leads to text that is bland and strangely repetitive," and shows that "decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model." ^[1]

The authors' key observation is distributional: neural language models concentrate the "vast majority of probability mass at each time step" in a nucleus, "a small subset of the vocabulary that tends to range between one and a thousand candidates," while scattering the remaining probability across a long, unreliable tail. ^[1] Top-p sampling captures the natural variability of human language by drawing only from this nucleus, described in the paper as "sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution." ^[1] In experiments the authors used p = 0.95 as their main setting. ^[1]

Human language, by contrast with maximization-based output, has a degree of unpredictability that makes it engaging. Holtzman et al. measured this quantitatively, showing that human text has a much higher variance in per-token probability than text generated by greedy or beam search. ^[1]

Overview of token sampling

At each step of autoregressive text generation, a language model produces a probability distribution over its entire vocabulary. The model assigns a probability to every token, and the generation method determines which token is actually selected. The simplest approach, greedy decoding, always picks the token with the highest probability. While deterministic and fast, greedy decoding tends to produce repetitive, generic text. ^[1]

Sampling-based methods introduce randomness by drawing a token from the probability distribution rather than always picking the most likely one. Pure random sampling (drawing from the full distribution) can produce incoherent text because it sometimes selects very unlikely tokens. Top-k and top-p sampling address this by truncating the distribution, keeping only the most plausible candidates before sampling.

How does temperature interact with sampling?

Temperature is a scaling parameter applied to the logits (raw model outputs) before converting them to probabilities via the softmax function. It controls how "peaked" or "flat" the probability distribution is. The parameter traces back to statistical mechanics and was used in neural networks as early as Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985). ^[7]

Temperature value	Effect on distribution	Behavior
T = 1.0	No change; uses the raw distribution	Standard model output
T < 1.0 (e.g., 0.3)	Sharpens the distribution; high-probability tokens become more dominant	More deterministic, focused output
T > 1.0 (e.g., 1.5)	Flattens the distribution; low-probability tokens get relatively higher chances	More random, creative, potentially incoherent output
T approaching 0	Distribution collapses to a spike on the top token	Equivalent to greedy decoding

Mathematically, the probability of token i is computed as:

P(i) = \frac{\exp(\text{logit}_i / T)}{\sum_j \exp(\text{logit}_j / T)}

Temperature is typically applied before top-k or top-p filtering. It is not a sampling method itself but a preprocessing step that modifies the distribution before other truncation methods are applied.

How does top-k sampling work?

Top-k sampling, popularized by Fan et al. (2018) for story generation, is the simpler of the two main truncation methods. ^[2] At each generation step, the model:

Computes the probability distribution over the full vocabulary.
Sorts tokens by probability in descending order.
Keeps only the k tokens with the highest probabilities.
Sets the probabilities of all other tokens to zero.
Renormalizes the remaining probabilities so they sum to 1.
Samples a token from this truncated distribution.

Example

Suppose a model's top predictions for the next token are:

Token	Probability
"the"	0.30
"a"	0.20
"this"	0.15
"my"	0.10
"our"	0.08
"that"	0.05
(remaining tokens)	0.12

With k = 4, the model keeps only "the," "a," "this," and "my." Their probabilities are renormalized to sum to 1 (0.40, 0.27, 0.20, 0.13), and one token is sampled from this restricted set.

Why is a fixed k a limitation?

Top-k sampling effectively prevents the selection of very unlikely tokens, reducing incoherence. However, it has a fundamental limitation: the fixed value of k does not adapt to the shape of the probability distribution. In some contexts, the model is highly confident and concentrates most probability mass on just 2 or 3 tokens. In other contexts, probability is spread more evenly across dozens of plausible continuations. A fixed k of, say, 40 would include many irrelevant tokens in the first case and might still miss plausible tokens in the second. This inflexibility motivated the development of top-p sampling. ^[1]

How does top-p (nucleus) sampling work?

Top-p sampling, or nucleus sampling, was proposed by Holtzman et al. (2020) to address top-k's inability to adapt to varying confidence levels. Instead of keeping a fixed number of tokens, top-p keeps the smallest set of tokens whose cumulative probability exceeds a threshold p. Formally, the paper defines the top-p vocabulary V(p) as "the smallest set such that the sum of P(x | x_1:i-1) over x in V(p) is at least p," then renormalizes that set for sampling. ^[1]

At each generation step, the model:

Computes the probability distribution over the full vocabulary.
Sorts tokens by probability in descending order.
Computes the cumulative sum of probabilities from highest to lowest.
Includes tokens until the cumulative probability reaches or exceeds p.
Sets the probabilities of all excluded tokens to zero.
Renormalizes and samples from the remaining tokens.

Example

Using the same probability distribution as above, with p = 0.75:

Token	Probability	Cumulative probability	Included?
"the"	0.30	0.30	Yes
"a"	0.20	0.50	Yes
"this"	0.15	0.65	Yes
"my"	0.10	0.75	Yes
"our"	0.08	0.83	No
"that"	0.05	0.88	No

The nucleus contains 4 tokens in this case. If the model were more confident (e.g., "the" had probability 0.80), the nucleus might contain only 1 token. If the model were less confident, the nucleus might contain 20 or more tokens. In the original paper the nucleus is reported to range "between one and a thousand candidates" depending on the step. ^[1] This adaptive behavior is the key advantage of top-p over top-k.

Combining temperature, top-k, and top-p

In practice, these parameters are often used together. A typical generation pipeline applies them in the following order:

The model produces raw logits.
Temperature scaling is applied to the logits.
Logits are converted to probabilities via softmax.
Top-k filtering removes all but the top k tokens (if top-k is enabled).
Top-p filtering further removes tokens below the cumulative probability threshold (if top-p is enabled).
The remaining distribution is renormalized, and a token is sampled.

When both top-k and top-p are applied, top-k acts as a hard upper bound on the number of candidates, while top-p provides a soft, distribution-dependent filter. For example, setting k = 50 and p = 0.9 means the model considers at most 50 tokens, but may consider fewer if 90% of the probability is concentrated in just a handful of tokens.

What are typical top-p values?

Top-p has become a standard generation default, with practitioners typically setting p in the 0.9 to 0.95 range. ^[8] The following table shows typical parameter configurations for different use cases.

Use case	Temperature	Top-k	Top-p	Notes
Factual Q&A	0.0-0.3	N/A	N/A	Low temperature or greedy decoding for accuracy
General chatbot	0.7-1.0	40-50	0.9-0.95	Balanced creativity and coherence
Creative writing	1.0-1.2	50-100	0.95-1.0	Higher temperature for more diverse output
Code generation	0.0-0.4	N/A	0.9	Low temperature for correctness; top-p to avoid nonsense
Brainstorming	1.0-1.5	100	0.95-1.0	Maximizes diversity at the cost of some coherence

Provider APIs typically expose these parameters as defaults. In the Hugging Face Transformers GenerationConfig, the documented defaults are temperature = 1.0, top_k = 50, and top_p = 1.0, with greedy decoding used unless do_sample=True is set. ^[9] OpenAI's API supports temperature and top_p; Anthropic's API for Claude supports temperature and top_k/top_p, and Anthropic changed the default top_p in the Messages API from 0.999 to 0.99 across models. ^[10]^[11]

How does min-p sampling differ from top-p?

Min-p sampling is a newer truncation method that addresses a limitation shared by both top-k and top-p: neither directly considers the confidence level of the model's top prediction when deciding which tokens to keep.

Introduced by Minh Nhat Nguyen and collaborators in "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" (arXiv:2407.01082), min-p works as follows: ^[3]

Find the probability of the most likely token, P_max.
Compute a dynamic threshold: $\text{threshold} = P_{\max} \cdot \text{min\_p}$ , where min_p is a hyperparameter (e.g., 0.1).
Discard every token whose probability falls below this threshold.
Renormalize the remaining probabilities and sample.

How min-p adapts to confidence

The key insight is that the threshold scales with the model's confidence:

Scenario	$P_{\max}$	Threshold ( $\text{min\_p} = 0.1$ )	Effect
High confidence	0.90	0.09	Only tokens with >= 9% probability survive; very few candidates
Moderate confidence	0.30	0.03	Tokens with >= 3% probability survive; moderate candidate pool
Low confidence	0.05	0.005	Tokens with >= 0.5% probability survive; many candidates allowed

When the model is highly confident, min-p aggressively filters, keeping only the strongest candidates. When the model is uncertain, min-p relaxes, allowing a wider range of plausible continuations. This behavior is more principled than top-p's fixed cumulative threshold, which can include too many low-quality tokens when the model is uncertain.

Empirical results

Nguyen et al. demonstrated that min-p sampling improves both the quality and diversity of generated text over top-p across multiple model families (Mistral and Llama 3) and model sizes (1B to 123B parameters), particularly at higher temperatures where top-p tends to produce incoherent outputs. ^[3] The evaluation spanned reasoning and creative benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, and human evaluators showed a clear preference for min-p in both quality and creativity. ^[3] The paper was the 18th highest-scoring submission to ICLR 2025 and was accepted as an oral presentation. ^[3]

Min-p is supported in major inference frameworks including llama.cpp, vLLM, Hugging Face Transformers, Ollama, ExLlamaV2, KoboldCpp, and text-generation-webui.

Other sampling methods

Several additional sampling methods have been proposed to improve text generation quality.

Typical sampling

Locally typical sampling, introduced by Meister et al. (2023), selects tokens whose information content (negative log probability) is close to the conditional entropy of the distribution. ^[4] The intuition is that "typical" tokens are neither too predictable nor too surprising, aligning with information-theoretic properties of natural language.

Eta sampling

Eta sampling (Hewitt et al., 2022) uses both absolute and relative probability thresholds to truncate the distribution. ^[5] It removes tokens in the tail of the distribution whose probabilities fall below a threshold derived from the distribution's entropy.

Mirostat

Mirostat (Basu et al., 2021) is a sampling method that dynamically adjusts the truncation to maintain a target perplexity (surprise level) throughout generation. ^[6] Rather than using a fixed threshold like top-k or top-p, Mirostat uses a feedback control loop to keep the text at a consistent level of predictability.

Method	Year	Truncation criterion	Adaptive?	Key property
Top-k	2018	Fixed number of tokens	No	Simple; does not adapt to confidence
Top-p (nucleus)	2019/2020	Cumulative probability threshold	Partially	Adapts candidate pool size to distribution shape
Typical sampling	2022	Information content near entropy	Yes	Information-theoretically motivated
Eta sampling	2022	Entropy-based threshold	Yes	Removes low-probability tail tokens
Mirostat	2021	Target perplexity feedback loop	Yes	Maintains consistent surprise level
Min-p	2024	Fraction of top token's probability	Yes	Scales threshold with model confidence

Repetition penalty and frequency penalty

In addition to truncation methods, most generation systems apply repetition penalties to discourage the model from repeating the same tokens or phrases. These penalties reduce the logit of tokens that have already appeared in the generated text, with the penalty typically increasing with the number of prior occurrences.

OpenAI's API provides two related parameters: presence_penalty (penalizes tokens that have appeared at all) and frequency_penalty (penalizes tokens proportionally to how often they have appeared). ^[10] These work alongside temperature and top-p to control output quality.

Implementation details

Top-k and top-p sampling are implemented efficiently in modern inference libraries. The key computational steps (sorting logits, computing cumulative sums, masking) add minimal overhead compared to the model's forward pass. In the Hugging Face Transformers library, sampling parameters are passed through the GenerationConfig object:

from transformers import GenerationConfig

config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    max_new_tokens=256,
)

For API-based access, OpenAI's Chat Completions API accepts temperature and top_p as parameters. The API documentation recommends altering either temperature or top_p but not both simultaneously, though in practice both can be used together. ^[10]

Theoretical foundations

The development of sampling methods for language models is rooted in information theory and probability theory. When a language model generates text, each step can be viewed as sampling from a categorical distribution over the vocabulary. The quality of the generated text depends on how well the sampling strategy navigates the trade-off between two failure modes:

Degeneration: When the model always picks the most probable tokens (greedy or beam search), the output becomes repetitive and generic. Holtzman et al. measured this quantitatively, showing that human text has a much higher variance in per-token probability than text generated by maximization-based methods. ^[1]
Incoherence: When the model samples from the full distribution without truncation, it occasionally selects tokens from the long tail of the distribution, producing nonsensical or contradictory text.

Top-k, top-p, and min-p can all be understood as different approaches to identifying and removing the unreliable tail of the distribution while preserving the informative nucleus. From an information-theoretic perspective, typical sampling goes even further by directly targeting the "typical set" of sequences, those whose information content per token is close to the model's entropy. ^[4]

How do sampling parameters interact?

Understanding how sampling parameters interact is important for achieving desired generation behavior. The interaction can be subtle, and combining parameters does not always produce intuitive results.

Combination	Interaction	Practical note
Low temperature + top-p	Temperature sharpens the distribution before top-p filters. The nucleus becomes very small, often containing just 1-3 tokens.	Behaves almost like greedy decoding. top-p has little effect.
High temperature + top-p	Temperature flattens the distribution, spreading probability across many tokens. top-p then truncates the long tail.	top-p does most of the heavy lifting in preventing incoherence.
Top-k + top-p together	top-k sets a hard ceiling on candidates; top-p may further reduce the set.	Useful when you want an absolute maximum on candidate count.
Min-p + high temperature	High temperature flattens the distribution, but min-p scales its threshold relative to the still-highest token.	Min-p remains effective because it adapts to the post-temperature distribution.
Min-p + top-p	Both filters apply. In practice, one typically dominates.	Most practitioners use one or the other, not both.

OpenAI's API documentation specifically notes that users should generally alter either temperature or top_p, not both. ^[10] This guidance reflects the fact that both parameters affect the effective size of the candidate pool, and combining them without careful tuning can produce unexpected behavior.

Impact on LLM applications

The choice of sampling parameters has significant effects on the behavior of deployed LLM applications.

Application	Preferred strategy	Reason
Customer support bots	Low temperature (0.1-0.3), no top-k/top-p	Consistency and accuracy are paramount
AI coding assistants	Low temperature, top-p = 0.9	Correct code with some variation for alternative approaches
Story generation	High temperature (1.0+), top-p = 0.95	Creativity and unpredictability are desired
Search-augmented generation (RAG)	Low temperature (0.0-0.2)	Faithfulness to retrieved context
Translation	Temperature = 0.3-0.5, top-p = 0.9	Balance between fluency and accuracy

Historical development

The evolution of sampling strategies tracks the broader development of neural language models.

Year	Development
Pre-2018	Beam search dominates in sequence-to-sequence tasks. Temperature sampling used informally.
2018	Fan et al. popularize top-k sampling for hierarchical story generation, demonstrating that stochastic decoding produces more engaging narratives than beam search. ^[2]
2019	Holtzman et al. publish "The Curious Case of Neural Text Degeneration" on arXiv, introducing nucleus (top-p) sampling. The paper appears at ICLR 2020 and becomes one of the most cited works on text generation. ^[1]
2021	Basu et al. propose Mirostat, using control theory to maintain a target perplexity during generation. ^[6]
2022	Meister et al. formalize locally typical sampling based on information-theoretic principles. Hewitt et al. introduce eta sampling. ^[4]^[5]
2024	Nguyen et al. propose min-p sampling, which scales the truncation threshold with model confidence. The paper gains rapid adoption in open-source inference frameworks. ^[3]
2025	Min-p accepted as an oral at ICLR 2025. Sampling methods continue to evolve alongside new model architectures. ^[3]

The trend across this timeline is clear: sampling methods have become progressively more adaptive, moving from fixed thresholds (top-k) to distribution-dependent thresholds (top-p) to confidence-scaled thresholds (min-p). Each generation of methods better approximates the ideal of including exactly the plausible continuations while excluding the implausible ones.

References

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751. https://arxiv.org/abs/1904.09751 ↩
Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. https://arxiv.org/abs/1805.04833 ↩
Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., & Shwartz-Ziv, R. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs." ICLR 2025 (oral). arXiv:2407.01082. https://arxiv.org/abs/2407.01082 ↩
Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Locally Typical Sampling." Transactions of the Association for Computational Linguistics, 11, 102-121. ↩
Hewitt, J., Manning, C. D., & Liang, P. (2022). "Truncation Sampling as Language Model Desmoothing." EMNLP 2022. arXiv:2210.15191. ↩
Basu, S., Ramachandran, G. S., Keskar, N. S., & Varshney, L. R. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity." ICLR 2021. arXiv:2007.14966. ↩
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). "A Learning Algorithm for Boltzmann Machines." Cognitive Science, 9(1), 147-169. (Original description of the temperature parameter in neural networks.) ↩
Chip Huyen. (2024). "Generation configurations: temperature, top-k, top-p, and test time compute." https://huyenchip.com/2024/01/16/sampling.html ↩
Hugging Face. "Generation - GenerationConfig." Transformers documentation. https://huggingface.co/docs/transformers/main_classes/text_generation ↩
OpenAI. "API Reference: Chat Completions (temperature, top_p, presence_penalty, frequency_penalty)." https://platform.openai.com/docs/api-reference/chat ↩
Anthropic. "Claude API release notes / Messages API parameters (top_p, temperature, top_k)." https://docs.anthropic.com/en/release-notes/api ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Beam search Decoding strategies Greedy decoding Logits Rejection sampling Self-consistency Temperature (artificial intelligence)Temperature sampling Text Generation Models

What problem does top-p sampling solve?

Overview of token sampling

How does temperature interact with sampling?

How does top-k sampling work?

Example

Why is a fixed k a limitation?

How does top-p (nucleus) sampling work?

Example

Combining temperature, top-k, and top-p

What are typical top-p values?

How does min-p sampling differ from top-p?

How min-p adapts to confidence

Empirical results

Other sampling methods

Typical sampling

Eta sampling

Mirostat

Repetition penalty and frequency penalty

Implementation details

Theoretical foundations

How do sampling parameters interact?

Impact on LLM applications

Historical development

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here