Top-p sampling

Large Language Models Machine Learning Natural Language Processing

20 min read

Updated Apr 26, 2026

Suggest edit History Talk

RawGraph

Last edited

Apr 26, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v5 · 4,052 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Top-p sampling, also called nucleus sampling, is a stochastic decoding method for text generation in which the model samples from the smallest possible set of tokens whose cumulative probability mass exceeds a threshold p. Introduced by Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi in the 2019 paper "The Curious Case of Neural Text Degeneration" (published at ICLR 2020), the technique addressed a persistent problem in open-ended generation: deterministic decoding methods like beam search and greedy search produce text that is repetitive, generic, and ultimately degenerate, while naive stochastic sampling can produce incoherent output by drawing from the unreliable tail of the probability distribution ^[1].

Top-p sampling has since become one of the most widely used generation parameters in large language model APIs. OpenAI, Anthropic, Google, Meta, and virtually every other LLM provider expose a top_p parameter, making it a foundational concept for anyone working with generative AI.

Background and Motivation

Before nucleus sampling was proposed, the two dominant approaches to generating text from neural language models were deterministic decoding (greedy search, beam search) and simple stochastic sampling, sometimes combined with temperature scaling or top-k sampling.

The Problem with Deterministic Decoding

Greedy decoding selects the single most probable token at every step. It is fast and deterministic but tends to get trapped in repetitive loops. Beam search extends this by tracking the k most probable sequences simultaneously, but Holtzman et al. demonstrated that it suffers from the same fundamental problem: maximization-based decoding is simply inappropriate for open-ended text generation. The text it produces is bland and repetitive, bearing little resemblance to natural human writing, which exhibits far more variability and surprise ^[1].

One of the key insights of the paper was that human-written text does not follow a maximum-likelihood trajectory through the probability space. At any given point, humans frequently choose words that are plausible but not the single most probable continuation. A good decoding method needs to capture this stochastic quality.

The Problem with Top-k Sampling

Top-k sampling, which restricts sampling to the k most probable tokens at each step, partially addresses the issue. By removing low-probability tokens from consideration, it prevents the model from producing highly improbable (and often incoherent) outputs. However, top-k has a critical limitation: it uses a fixed number of candidate tokens regardless of the shape of the probability distribution.

Consider two scenarios. In the first, the model is highly confident about the next word: perhaps the phrase "The capital of France is" gives nearly all probability mass to the token "Paris." Here, even k = 5 might include irrelevant tokens. In the second scenario, the model faces genuine ambiguity: after "My favorite food is," dozens or even hundreds of tokens could be reasonable continuations. With a fixed k = 5, the model is forced to ignore many perfectly valid options.

This mismatch between a fixed vocabulary size and the dynamic shape of the probability distribution is what nucleus sampling was designed to solve.

How Top-p Sampling Works

The mechanism of top-p sampling is straightforward. At each generation step, the model produces a probability distribution over the entire vocabulary. The algorithm then proceeds as follows:

Sort tokens by probability. Arrange all tokens in descending order of their predicted probability.
Accumulate probabilities. Starting from the most probable token, compute a running cumulative sum of probabilities.
Identify the nucleus. The nucleus is the smallest set of tokens whose cumulative probability meets or exceeds the threshold p. All tokens outside this set are discarded.
Renormalize. Rescale the probabilities of the tokens within the nucleus so that they sum to 1.
Sample. Draw a token from this renormalized distribution.

The result is a dynamically sized candidate set that expands when the model is uncertain (spreading probability across many tokens) and contracts when the model is confident (concentrating probability on a few tokens). This adaptive behavior is the defining advantage of top-p over top-k.

Worked Example

Suppose a language model produces the following probability distribution for the next token after the prompt "The weather today is":

Token	Probability	Cumulative Probability
sunny	0.35	0.35
warm	0.20	0.55
nice	0.15	0.70
beautiful	0.10	0.80
cloudy	0.07	0.87
cold	0.05	0.92
perfect	0.03	0.95
terrible	0.02	0.97
...	...	...

With p = 0.9, the nucleus includes the top 5 tokens (sunny, warm, nice, beautiful, cloudy) because their cumulative probability (0.87) does not yet reach 0.9, so the next token "cold" is also included, bringing the cumulative sum to 0.92, which exceeds 0.9. The nucleus contains 6 tokens. All remaining tokens are discarded, and the probabilities of these 6 tokens are renormalized to sum to 1 before sampling.

Now consider a more confident distribution after "The capital of France is":

Token	Probability	Cumulative Probability
Paris	0.96	0.96
the	0.01	0.97
a	0.005	0.975
...	...	...

With p = 0.9, the nucleus contains just a single token, "Paris," because its probability alone exceeds the threshold. The method effectively becomes greedy in this case, which is exactly the right behavior.

Mathematical Formulation

Formally, let V denote the vocabulary of the language model, and let P(x | x_1:t-1) denote the probability the model assigns to token x given the preceding context. At each time step t, top-p sampling defines the nucleus V^(p) as:

V^(p) = argmin_{V' subset of V} |V'| such that the sum of P(x | x_1:t-1) for all x in V' is greater than or equal to p

where the tokens in V' are those with the highest probabilities. In other words, it is the smallest set of the most probable tokens whose cumulative mass reaches p.

The sampling distribution is then:

P'(x | x_1:t-1) = P(x | x_1:t-1) / Z if x is in V^(p), and 0 otherwise

where Z = sum of P(x | x_1:t-1) for all x in V^(p) is the renormalization constant.

The parameter p ranges from 0 to 1. When p = 1, the nucleus includes the entire vocabulary and top-p sampling reduces to standard (untruncated) sampling. When p approaches 0, the nucleus shrinks to the single most probable token and the method converges to greedy decoding.

Comparison with Other Decoding Strategies

Top-p sampling sits within a broader family of decoding strategies. Understanding how these methods relate helps clarify when and why to use each one.

Top-p vs. Top-k Sampling

Aspect	Top-p (Nucleus)	Top-k
Candidate set size	Dynamic, varies per step	Fixed at k tokens
Adapts to confidence	Yes, contracts when confident, expands when uncertain	No, always considers exactly k tokens
Risk of including irrelevant tokens	Low, only high-probability tokens are included	Higher when k is large relative to the nucleus
Risk of excluding valid tokens	Low when p is well-chosen	Higher when the distribution is flat and many tokens are plausible
Typical default	p = 0.9 to 1.0	k = 40 to 50 (varies by provider)

The fundamental difference is that top-k imposes a hard boundary on the number of candidates, while top-p adapts to the model's confidence. When the model assigns 95% probability to a single token, top-p naturally restricts to that one token, while top-k still considers k alternatives. When the model distributes probability evenly across 200 tokens, top-p includes all 200 if necessary, while top-k with k = 50 arbitrarily excludes 150 of them.

In practice, top-k and top-p can be combined. Many APIs allow setting both parameters simultaneously, in which case the intersection of the two candidate sets is used.

Top-p vs. Temperature

Temperature and top-p are often confused because both affect the randomness of generated text, but they operate differently.

Temperature modifies the probability distribution before any truncation. Given logits z_i (the raw model outputs before softmax), the probability of token i at temperature T is:

P(x_i) = exp(z_i / T) / sum of exp(z_j / T)

A temperature less than 1 sharpens the distribution (making probable tokens more probable and unlikely tokens less likely). A temperature greater than 1 flattens it (making the distribution more uniform). Temperature = 1 uses the model's raw probabilities unchanged.

Top-p operates after the softmax (and after temperature scaling, if applied). It truncates the distribution rather than reshaping it. The two parameters serve complementary roles: temperature controls the "shape" of the distribution, while top-p controls where the distribution is "cut off."

Aspect	Temperature	Top-p
What it modifies	Shape of probability distribution	Which tokens are eligible for sampling
When applied	Before softmax (or equivalently, before top-p)	After softmax and temperature
Effect of increasing	Flatter distribution, more randomness	Larger nucleus, more candidate tokens
Effect of decreasing	Sharper distribution, more determinism	Smaller nucleus, fewer candidate tokens

Interaction Between Temperature and Top-p

Because temperature is applied before top-p, the two parameters interact in important ways. A high temperature flattens the probability distribution, which means more tokens are needed to reach the cumulative threshold p, resulting in a larger nucleus and more diverse outputs. Conversely, a low temperature sharpens the distribution, concentrating mass on fewer tokens, so the nucleus shrinks even at the same p value.

This interaction means that the "effective" randomness of generation depends on both settings together. Most API providers recommend adjusting either temperature or top-p, but not both simultaneously, to avoid unpredictable compounding effects. OpenAI's documentation, for example, suggests: "We generally recommend altering this or temperature but not both" ^[2].

Greedy Decoding and Beam Search

Greedy decoding (always pick the most probable token) and beam search (track the top b sequences by cumulative log-probability) are deterministic methods that maximize some form of likelihood. They work well for tasks with a clear "correct" output, such as machine translation or summarization, where fidelity to a source is paramount. However, for open-ended generation (creative writing, dialogue, brainstorming), deterministic methods produce degenerate text. Nucleus sampling was specifically designed as an alternative for these open-ended settings ^[1].

Typical Values and API Defaults

Different providers set different default values for the top_p parameter, and recommended values vary by use case.

Provider Defaults

Provider	Default top_p	Range	Notes
OpenAI	1.0	0.0 to 1.0	Default means no truncation; relies on temperature for randomness control ^[2]
Anthropic (Claude)	Not specified (left to model defaults)	0.0 to 1.0	Anthropic recommends adjusting only temperature for most use cases ^[3]
Google (Gemini)	0.94 (varies by model)	0.0 to 1.0	Default varies across Gemini model versions ^[4]
Hugging Face	1.0	0.0 to 1.0	Default in the `transformers` library; no truncation unless explicitly set

Note that a default of 1.0 effectively disables top-p filtering, since the nucleus includes the entire vocabulary. This means the model relies entirely on temperature (and optionally top-k) to control output randomness.

Recommended Values by Use Case

Use Case	Suggested top_p	Temperature	Rationale
Factual Q&A, retrieval	0.1 to 0.5	0.0 to 0.3	Narrow nucleus keeps answers precise
Code generation	0.9 to 0.95	0.2 to 0.4	Slightly wider nucleus for syntactic variation; low temperature for correctness
Creative writing	0.9 to 1.0	0.7 to 1.0	Wide nucleus and higher temperature for diverse, surprising outputs
Dialogue / chatbots	0.9 to 0.95	0.5 to 0.8	Balance between coherence and natural variety
Structured output (JSON, XML)	0.1 to 0.3	0.0 to 0.2	Tight nucleus for syntactic correctness

These are guidelines, not hard rules. The optimal settings depend on the specific model, the quality of the prompt, and the application's tolerance for variability.

Why Nucleus Sampling Was Invented

The original motivation for nucleus sampling arose from careful empirical analysis of neural text degeneration. Holtzman et al. ^[1] identified several key observations:

Maximization-based decoding is fundamentally flawed for open-ended generation. Beam search and greedy decoding produce text that scores high in likelihood but low in quality by human judgment. The most probable sequence is not the most human-like sequence.
Language model probability distributions have an unreliable tail. Modern language models assign small but nonzero probability to an enormous number of tokens at each step. Many of these low-probability tokens are nonsensical in context. Sampling from this tail introduces incoherence.
The "nucleus" of the distribution is where the reliable probability mass lives. The authors showed that at each generation step, the vast majority of the probability mass is concentrated in a relatively small subset of the vocabulary, typically ranging from a single token to around a thousand tokens. This subset, the nucleus, captures the tokens the model is genuinely "considering."
The nucleus size varies dynamically. When the model is confident, the nucleus is tiny. When the model faces genuine ambiguity, the nucleus is large. A good decoding strategy should respect this variation rather than imposing a fixed candidate set.

These observations led directly to the design of top-p sampling: truncate the unreliable tail by including only enough tokens to cover a fraction p of the total probability mass, then sample from that truncated distribution.

The Original Paper in Detail

"The Curious Case of Neural Text Degeneration" ^[1] was posted to arXiv in April 2019 and published at ICLR 2020. The authors were affiliated with the University of Washington and the Allen Institute for Artificial Intelligence (AI2).

The paper's contributions include:

Empirical analysis of text degeneration. The authors demonstrated that beam search produces text with abnormally low perplexity, high repetition rates, and poor human evaluations. They showed that the probability of human text under a language model is much lower than the probability of beam search output, revealing a fundamental mismatch.
Analysis of probability distributions. The paper examined how probability mass is distributed across tokens at each generation step, showing that the tail of the distribution is unreliable and that the nucleus (the high-probability core) varies dramatically in size from step to step.
Proposal of nucleus sampling. The method was shown to produce text that is more diverse, more coherent, and more closely resembling human writing than text produced by beam search, greedy decoding, top-k sampling, or untruncated sampling with temperature.
Human evaluation. Human raters consistently preferred text generated by nucleus sampling over text from other decoding methods in open-ended generation tasks.

The paper has been highly influential, accumulating hundreds of citations, and nucleus sampling has been adopted as a standard parameter in essentially all major LLM APIs.

Closing the Curious Case: Follow-Up Research

In 2024, Finlayson et al. published "Closing the Curious Case of Neural Text Degeneration" at ICLR 2024 ^[5]. This follow-up paper provided a theoretical explanation for why truncation sampling methods like nucleus sampling work so well. The authors showed that neural language models implicitly learn a "smoothed" version of the true language distribution, and that truncation sampling effectively "desmooths" this distribution, recovering something closer to the true distribution.

Specifically, the paper demonstrated that language models tend to spread probability mass too thinly across unlikely tokens (a consequence of the softmax function and cross-entropy training). Truncation methods like top-p remove this excess tail mass, producing a distribution that more closely matches the actual distribution of human language. This theoretical grounding validated what practitioners had observed empirically for years: cutting off the tail improves generation quality.

Min-p Sampling: A Newer Alternative

In 2024, researchers proposed min-p sampling as an alternative to nucleus sampling, addressing a subtle flaw in top-p's behavior at higher temperatures ^[6]. Min-p was presented at ICLR 2025 in the paper "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs."

How Min-p Works

Instead of using a fixed cumulative probability threshold, min-p sets a dynamic cutoff based on the probability of the most likely token. Given a min-p parameter m (typically between 0.05 and 0.1), a token is included in the candidate set if and only if its probability is at least m times the probability of the most likely token:

P(x) >= m * max_{x' in V} P(x')

For example, if the most probable token has probability 0.4 and m = 0.1, then any token with probability >= 0.04 is included. If the most probable token has probability 0.01 (a very flat distribution), the threshold drops to 0.001, automatically including more candidates.

Why Min-p Improves on Top-p

Top-p has a known weakness at high temperatures. When temperature flattens the probability distribution, the cumulative threshold p is reached only after including a very large number of tokens, many of which are low-quality. Min-p avoids this problem because its threshold scales with the maximum probability: when the distribution is flat (all probabilities are low), the threshold drops proportionally, but it still excludes tokens that are much less probable than the best option.

Empirical evaluations showed that min-p produces more diverse outputs than top-p at higher temperatures while maintaining coherence. Human evaluators preferred min-p outputs for both quality and creativity ^[6].

Adoption

As of early 2026, min-p is supported in several open-source inference frameworks, including vLLM, llama.cpp, and Hugging Face's text generation inference server. However, major commercial APIs (OpenAI, Anthropic, Google) have not yet added min-p as an exposed parameter. For practitioners using open-source models, min-p with values between 0.05 and 0.1 is increasingly recommended as a replacement for top-p, particularly for creative generation tasks.

Relationship to Other Decoding Strategies

Top-p sampling exists within a broader taxonomy of decoding methods. The following table summarizes the major approaches:

Strategy	Type	Key Property	Best For
Greedy decoding	Deterministic	Always picks most probable token	Simple, low-stakes generation
Beam search	Deterministic	Tracks top b sequences by cumulative probability	Translation, summarization
Temperature sampling	Stochastic	Reshapes distribution; samples from full vocabulary	General-purpose randomness control
Top-k sampling	Stochastic	Samples from the k most probable tokens	Simple truncation; used when distribution shape is stable
Top-p (nucleus) sampling	Stochastic	Samples from the smallest set covering probability mass p	Open-ended generation where confidence varies
Min-p sampling	Stochastic	Includes tokens above a fraction of the max probability	Creative generation, especially at high temperatures
Eta sampling	Stochastic	Truncation based on conditional entropy	Entropy-aware generation
Mirostat	Stochastic	Targets a specific perplexity level	Controlling perceived quality/complexity
Contrastive decoding	Hybrid	Penalizes tokens that a weaker model also favors	Reducing generic/repetitive outputs
Speculative decoding	Optimization	Uses a draft model to speed up generation	Inference acceleration (does not change output distribution)

In practice, these methods are often combined. A typical configuration might apply temperature scaling first, then top-p truncation, and finally sample from the resulting distribution. Some inference frameworks allow stacking top-k and top-p together, taking the intersection of both candidate sets.

Implementation Details

Implementing top-p sampling is straightforward. In pseudocode:

function top_p_sample(logits, p, temperature=1.0):
    // Apply temperature
    logits = logits / temperature
    
    // Convert to probabilities
    probs = softmax(logits)
    
    // Sort in descending order
    sorted_probs, sorted_indices = sort_descending(probs)
    
    // Compute cumulative sum
    cumulative_probs = cumulative_sum(sorted_probs)
    
    // Find cutoff index: first index where cumulative prob >= p
    cutoff = first_index_where(cumulative_probs >= p)
    
    // Zero out everything after cutoff
    sorted_probs[cutoff+1:] = 0
    
    // Renormalize
    sorted_probs = sorted_probs / sum(sorted_probs)
    
    // Sample from the truncated distribution
    chosen_index = sample_from(sorted_probs)
    
    return sorted_indices[chosen_index]

The computational overhead compared to standard sampling is minimal: sorting the vocabulary is O(V log V) where V is vocabulary size, but in practice efficient partial sorting or selection algorithms reduce this. Modern GPU implementations in libraries like vLLM and Hugging Face Transformers handle this efficiently even for vocabulary sizes exceeding 200,000 tokens.

Practical Guidance

Here are concrete recommendations for practitioners working with top-p sampling:

Start with provider defaults. If you are using a commercial API, the default values are usually well-tuned. OpenAI defaults to top_p = 1.0, relying on temperature alone. Anthropic recommends adjusting only temperature for most tasks.

Adjust one parameter at a time. Changing temperature and top-p simultaneously makes it difficult to understand the effect of either change. Pick one to tune first.

Use lower top-p for precision tasks. When generating structured output (JSON, SQL, code with strict syntax), a lower top-p (0.1 to 0.5) combined with low temperature reduces the chance of syntactic errors.

Use higher top-p for creative tasks. For open-ended generation, storytelling, or brainstorming, top-p values of 0.9 to 0.95 with moderate temperature (0.7 to 1.0) encourage diverse and interesting outputs.

Be aware of the temperature interaction. If you set a high temperature (e.g., 1.5) and a high top-p (e.g., 0.95), the combination may produce incoherent text because the flattened distribution includes many low-quality tokens in the nucleus. In such cases, consider using min-p instead, if your inference framework supports it.

Test empirically. The optimal settings vary by model, task, and prompt. There is no universal "best" configuration. Systematic evaluation on a representative sample of inputs is the most reliable way to find good settings.

Current State (2025 to 2026)

Top-p sampling remains the dominant truncation method in commercial LLM APIs. Every major provider (OpenAI, Anthropic, Google, Cohere, Mistral) exposes top_p as a generation parameter. The method is well-understood, easy to implement, and effective across a wide range of tasks.

Several trends are shaping its future:

Min-p is gaining ground in open-source. As demonstrated by its acceptance at ICLR 2025 as an oral presentation, min-p is increasingly viewed as a superior alternative, particularly for high-temperature creative generation ^[6]. Its adoption in open-source inference frameworks (vLLM, llama.cpp, Hugging Face TGI) suggests it may eventually be added to commercial APIs as well.

Reasoning models use different strategies. Models optimized for chain-of-thought reasoning, such as OpenAI's o-series and DeepSeek R1, often use locked or constrained decoding parameters. For these models, the provider may override user-specified temperature and top-p settings to ensure consistent reasoning quality.

Research continues on adaptive sampling. Methods like eta sampling (which uses conditional entropy to set the truncation point) and Mirostat (which targets a specific perplexity level) represent ongoing efforts to build samplers that require less manual tuning. The trajectory of research points toward methods that automatically adapt to context without requiring users to specify parameters like p or k at all.

Theoretical understanding is deepening. The 2024 work by Finlayson et al. ^[5] on truncation sampling as "desmoothing" provided the first rigorous theoretical framework for understanding why nucleus sampling works. This kind of theoretical grounding may inform the design of even better sampling methods in the future.

Despite these developments, top-p sampling is unlikely to disappear anytime soon. Its simplicity, effectiveness, and universal availability across APIs make it the practical default for most generation tasks. For the foreseeable future, understanding top-p remains essential knowledge for anyone building applications on top of large language models.

References

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, "The Curious Case of Neural Text Degeneration," Proceedings of the International Conference on Learning Representations (ICLR), 2020. Available: https://arxiv.org/abs/1904.09751 ↩
OpenAI, "API Reference: Create chat completion," OpenAI Platform Documentation. Available: https://platform.openai.com/docs/api-reference/chat/create ↩
Anthropic, "API Reference: Create a Message," Anthropic Documentation. Available: https://docs.anthropic.com/en/api/messages ↩
Google, "Experiment with parameter values," Vertex AI Generative AI Documentation. Available: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/adjust-parameter-values ↩
M. Finlayson, J. Hewitt, A. Koller, S. Swayamdipta, and A. Holtzman, "Closing the Curious Case of Neural Text Degeneration," Proceedings of the International Conference on Learning Representations (ICLR), 2024. Available: https://arxiv.org/abs/2310.01693 ↩
"Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs," Proceedings of the International Conference on Learning Representations (ICLR), 2025. Available: https://openreview.net/forum?id=FBkpCyujtS ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Self-consistency Top-p and top-k sampling

Background and Motivation

The Problem with Deterministic Decoding

The Problem with Top-k Sampling

How Top-p Sampling Works

Worked Example

Mathematical Formulation

Comparison with Other Decoding Strategies

Top-p vs. Top-k Sampling

Top-p vs. Temperature

Interaction Between Temperature and Top-p

Greedy Decoding and Beam Search

Typical Values and API Defaults

Provider Defaults

Recommended Values by Use Case

Why Nucleus Sampling Was Invented

The Original Paper in Detail

Closing the Curious Case: Follow-Up Research

Min-p Sampling: A Newer Alternative

How Min-p Works

Why Min-p Improves on Top-p

Adoption

Relationship to Other Decoding Strategies

Implementation Details

Practical Guidance

Current State (2025 to 2026)

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here