Top-p sampling, also called nucleus sampling, is a stochastic decoding method for text generation in which the model samples from the smallest possible set of tokens whose cumulative probability mass exceeds a threshold p. Introduced by Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi in the 2019 paper "The Curious Case of Neural Text Degeneration" (published at ICLR 2020), the technique addressed a persistent problem in open-ended generation: deterministic decoding methods like beam search and greedy search produce text that is repetitive, generic, and ultimately degenerate, while naive stochastic sampling can produce incoherent output by drawing from the unreliable tail of the probability distribution [1].
Top-p sampling has since become one of the most widely used generation parameters in large language model APIs. OpenAI, Anthropic, Google, Meta, and virtually every other LLM provider expose a top_p parameter, making it a foundational concept for anyone working with generative AI.
Before nucleus sampling was proposed, the two dominant approaches to generating text from neural language models were deterministic decoding (greedy search, beam search) and simple stochastic sampling, sometimes combined with temperature scaling or top-k sampling.
Greedy decoding selects the single most probable token at every step. It is fast and deterministic but tends to get trapped in repetitive loops. Beam search extends this by tracking the k most probable sequences simultaneously, but Holtzman et al. demonstrated that it suffers from the same fundamental problem: maximization-based decoding is simply inappropriate for open-ended text generation. The text it produces is bland and repetitive, bearing little resemblance to natural human writing, which exhibits far more variability and surprise [1].
One of the key insights of the paper was that human-written text does not follow a maximum-likelihood trajectory through the probability space. At any given point, humans frequently choose words that are plausible but not the single most probable continuation. A good decoding method needs to capture this stochastic quality.
Top-k sampling, which restricts sampling to the k most probable tokens at each step, partially addresses the issue. By removing low-probability tokens from consideration, it prevents the model from producing highly improbable (and often incoherent) outputs. However, top-k has a critical limitation: it uses a fixed number of candidate tokens regardless of the shape of the probability distribution.
Consider two scenarios. In the first, the model is highly confident about the next word: perhaps the phrase "The capital of France is" gives nearly all probability mass to the token "Paris." Here, even k = 5 might include irrelevant tokens. In the second scenario, the model faces genuine ambiguity: after "My favorite food is," dozens or even hundreds of tokens could be reasonable continuations. With a fixed k = 5, the model is forced to ignore many perfectly valid options.
This mismatch between a fixed vocabulary size and the dynamic shape of the probability distribution is what nucleus sampling was designed to solve.
The mechanism of top-p sampling is straightforward. At each generation step, the model produces a probability distribution over the entire vocabulary. The algorithm then proceeds as follows:
The result is a dynamically sized candidate set that expands when the model is uncertain (spreading probability across many tokens) and contracts when the model is confident (concentrating probability on a few tokens). This adaptive behavior is the defining advantage of top-p over top-k.
Suppose a language model produces the following probability distribution for the next token after the prompt "The weather today is":
| Token | Probability | Cumulative Probability |
|---|---|---|
| sunny | 0.35 | 0.35 |
| warm | 0.20 | 0.55 |
| nice | 0.15 | 0.70 |
| beautiful | 0.10 | 0.80 |
| cloudy | 0.07 | 0.87 |
| cold | 0.05 | 0.92 |
| perfect | 0.03 | 0.95 |
| terrible | 0.02 | 0.97 |
| ... | ... | ... |
With p = 0.9, the nucleus includes the top 5 tokens (sunny, warm, nice, beautiful, cloudy) because their cumulative probability (0.87) does not yet reach 0.9, so the next token "cold" is also included, bringing the cumulative sum to 0.92, which exceeds 0.9. The nucleus contains 6 tokens. All remaining tokens are discarded, and the probabilities of these 6 tokens are renormalized to sum to 1 before sampling.
Now consider a more confident distribution after "The capital of France is":
| Token | Probability | Cumulative Probability |
|---|---|---|
| Paris | 0.96 | 0.96 |
| the | 0.01 | 0.97 |
| a | 0.005 | 0.975 |
| ... | ... | ... |
With p = 0.9, the nucleus contains just a single token, "Paris," because its probability alone exceeds the threshold. The method effectively becomes greedy in this case, which is exactly the right behavior.
Formally, let V denote the vocabulary of the language model, and let P(x | x1:t-1) denote the probability the model assigns to token x given the preceding context. At each time step t, top-p sampling defines the nucleus V(p) as:
V(p) = argminV' subset of V |V'| such that the sum of P(x | x1:t-1) for all x in V' is greater than or equal to p
where the tokens in V' are those with the highest probabilities. In other words, it is the smallest set of the most probable tokens whose cumulative mass reaches p.
The sampling distribution is then:
P'(x | x1:t-1) = P(x | x1:t-1) / Z if x is in V(p), and 0 otherwise
where Z = sum of P(x | x1:t-1) for all x in V(p) is the renormalization constant.
The parameter p ranges from 0 to 1. When p = 1, the nucleus includes the entire vocabulary and top-p sampling reduces to standard (untruncated) sampling. When p approaches 0, the nucleus shrinks to the single most probable token and the method converges to greedy decoding.
Top-p sampling sits within a broader family of decoding strategies. Understanding how these methods relate helps clarify when and why to use each one.
| Aspect | Top-p (Nucleus) | Top-k |
|---|---|---|
| Candidate set size | Dynamic, varies per step | Fixed at k tokens |
| Adapts to confidence | Yes, contracts when confident, expands when uncertain | No, always considers exactly k tokens |
| Risk of including irrelevant tokens | Low, only high-probability tokens are included | Higher when k is large relative to the nucleus |
| Risk of excluding valid tokens | Low when p is well-chosen | Higher when the distribution is flat and many tokens are plausible |
| Typical default | p = 0.9 to 1.0 | k = 40 to 50 (varies by provider) |
The fundamental difference is that top-k imposes a hard boundary on the number of candidates, while top-p adapts to the model's confidence. When the model assigns 95% probability to a single token, top-p naturally restricts to that one token, while top-k still considers k alternatives. When the model distributes probability evenly across 200 tokens, top-p includes all 200 if necessary, while top-k with k = 50 arbitrarily excludes 150 of them.
In practice, top-k and top-p can be combined. Many APIs allow setting both parameters simultaneously, in which case the intersection of the two candidate sets is used.
Temperature and top-p are often confused because both affect the randomness of generated text, but they operate differently.
Temperature modifies the probability distribution before any truncation. Given logits zi (the raw model outputs before softmax), the probability of token i at temperature T is:
P(xi) = exp(zi / T) / sum of exp(zj / T)
A temperature less than 1 sharpens the distribution (making probable tokens more probable and unlikely tokens less likely). A temperature greater than 1 flattens it (making the distribution more uniform). Temperature = 1 uses the model's raw probabilities unchanged.
Top-p operates after the softmax (and after temperature scaling, if applied). It truncates the distribution rather than reshaping it. The two parameters serve complementary roles: temperature controls the "shape" of the distribution, while top-p controls where the distribution is "cut off."
| Aspect | Temperature | Top-p |
|---|---|---|
| What it modifies | Shape of probability distribution | Which tokens are eligible for sampling |
| When applied | Before softmax (or equivalently, before top-p) | After softmax and temperature |
| Effect of increasing | Flatter distribution, more randomness | Larger nucleus, more candidate tokens |
| Effect of decreasing | Sharper distribution, more determinism | Smaller nucleus, fewer candidate tokens |
Because temperature is applied before top-p, the two parameters interact in important ways. A high temperature flattens the probability distribution, which means more tokens are needed to reach the cumulative threshold p, resulting in a larger nucleus and more diverse outputs. Conversely, a low temperature sharpens the distribution, concentrating mass on fewer tokens, so the nucleus shrinks even at the same p value.
This interaction means that the "effective" randomness of generation depends on both settings together. Most API providers recommend adjusting either temperature or top-p, but not both simultaneously, to avoid unpredictable compounding effects. OpenAI's documentation, for example, suggests: "We generally recommend altering this or temperature but not both" [2].
Greedy decoding (always pick the most probable token) and beam search (track the top b sequences by cumulative log-probability) are deterministic methods that maximize some form of likelihood. They work well for tasks with a clear "correct" output, such as machine translation or summarization, where fidelity to a source is paramount. However, for open-ended generation (creative writing, dialogue, brainstorming), deterministic methods produce degenerate text. Nucleus sampling was specifically designed as an alternative for these open-ended settings [1].
Different providers set different default values for the top_p parameter, and recommended values vary by use case.
| Provider | Default top_p | Range | Notes |
|---|---|---|---|
| OpenAI | 1.0 | 0.0 to 1.0 | Default means no truncation; relies on temperature for randomness control [2] |
| Anthropic (Claude) | Not specified (left to model defaults) | 0.0 to 1.0 | Anthropic recommends adjusting only temperature for most use cases [3] |
| Google (Gemini) | 0.94 (varies by model) | 0.0 to 1.0 | Default varies across Gemini model versions [4] |
| Hugging Face | 1.0 | 0.0 to 1.0 | Default in the transformers library; no truncation unless explicitly set |
Note that a default of 1.0 effectively disables top-p filtering, since the nucleus includes the entire vocabulary. This means the model relies entirely on temperature (and optionally top-k) to control output randomness.
| Use Case | Suggested top_p | Temperature | Rationale |
|---|---|---|---|
| Factual Q&A, retrieval | 0.1 to 0.5 | 0.0 to 0.3 | Narrow nucleus keeps answers precise |
| Code generation | 0.9 to 0.95 | 0.2 to 0.4 | Slightly wider nucleus for syntactic variation; low temperature for correctness |
| Creative writing | 0.9 to 1.0 | 0.7 to 1.0 | Wide nucleus and higher temperature for diverse, surprising outputs |
| Dialogue / chatbots | 0.9 to 0.95 | 0.5 to 0.8 | Balance between coherence and natural variety |
| Structured output (JSON, XML) | 0.1 to 0.3 | 0.0 to 0.2 | Tight nucleus for syntactic correctness |
These are guidelines, not hard rules. The optimal settings depend on the specific model, the quality of the prompt, and the application's tolerance for variability.
The original motivation for nucleus sampling arose from careful empirical analysis of neural text degeneration. Holtzman et al. [1] identified several key observations:
Maximization-based decoding is fundamentally flawed for open-ended generation. Beam search and greedy decoding produce text that scores high in likelihood but low in quality by human judgment. The most probable sequence is not the most human-like sequence.
Language model probability distributions have an unreliable tail. Modern language models assign small but nonzero probability to an enormous number of tokens at each step. Many of these low-probability tokens are nonsensical in context. Sampling from this tail introduces incoherence.
The "nucleus" of the distribution is where the reliable probability mass lives. The authors showed that at each generation step, the vast majority of the probability mass is concentrated in a relatively small subset of the vocabulary, typically ranging from a single token to around a thousand tokens. This subset, the nucleus, captures the tokens the model is genuinely "considering."
The nucleus size varies dynamically. When the model is confident, the nucleus is tiny. When the model faces genuine ambiguity, the nucleus is large. A good decoding strategy should respect this variation rather than imposing a fixed candidate set.
These observations led directly to the design of top-p sampling: truncate the unreliable tail by including only enough tokens to cover a fraction p of the total probability mass, then sample from that truncated distribution.
"The Curious Case of Neural Text Degeneration" [1] was posted to arXiv in April 2019 and published at ICLR 2020. The authors were affiliated with the University of Washington and the Allen Institute for Artificial Intelligence (AI2).
The paper's contributions include:
Empirical analysis of text degeneration. The authors demonstrated that beam search produces text with abnormally low perplexity, high repetition rates, and poor human evaluations. They showed that the probability of human text under a language model is much lower than the probability of beam search output, revealing a fundamental mismatch.
Analysis of probability distributions. The paper examined how probability mass is distributed across tokens at each generation step, showing that the tail of the distribution is unreliable and that the nucleus (the high-probability core) varies dramatically in size from step to step.
Proposal of nucleus sampling. The method was shown to produce text that is more diverse, more coherent, and more closely resembling human writing than text produced by beam search, greedy decoding, top-k sampling, or untruncated sampling with temperature.
Human evaluation. Human raters consistently preferred text generated by nucleus sampling over text from other decoding methods in open-ended generation tasks.
The paper has been highly influential, accumulating hundreds of citations, and nucleus sampling has been adopted as a standard parameter in essentially all major LLM APIs.
In 2024, Finlayson et al. published "Closing the Curious Case of Neural Text Degeneration" at ICLR 2024 [5]. This follow-up paper provided a theoretical explanation for why truncation sampling methods like nucleus sampling work so well. The authors showed that neural language models implicitly learn a "smoothed" version of the true language distribution, and that truncation sampling effectively "desmooths" this distribution, recovering something closer to the true distribution.
Specifically, the paper demonstrated that language models tend to spread probability mass too thinly across unlikely tokens (a consequence of the softmax function and cross-entropy training). Truncation methods like top-p remove this excess tail mass, producing a distribution that more closely matches the actual distribution of human language. This theoretical grounding validated what practitioners had observed empirically for years: cutting off the tail improves generation quality.
In 2024, researchers proposed min-p sampling as an alternative to nucleus sampling, addressing a subtle flaw in top-p's behavior at higher temperatures [6]. Min-p was presented at ICLR 2025 in the paper "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs."
Instead of using a fixed cumulative probability threshold, min-p sets a dynamic cutoff based on the probability of the most likely token. Given a min-p parameter m (typically between 0.05 and 0.1), a token is included in the candidate set if and only if its probability is at least m times the probability of the most likely token:
P(x) >= m * maxx' in V P(x')
For example, if the most probable token has probability 0.4 and m = 0.1, then any token with probability >= 0.04 is included. If the most probable token has probability 0.01 (a very flat distribution), the threshold drops to 0.001, automatically including more candidates.
Top-p has a known weakness at high temperatures. When temperature flattens the probability distribution, the cumulative threshold p is reached only after including a very large number of tokens, many of which are low-quality. Min-p avoids this problem because its threshold scales with the maximum probability: when the distribution is flat (all probabilities are low), the threshold drops proportionally, but it still excludes tokens that are much less probable than the best option.
Empirical evaluations showed that min-p produces more diverse outputs than top-p at higher temperatures while maintaining coherence. Human evaluators preferred min-p outputs for both quality and creativity [6].
As of early 2026, min-p is supported in several open-source inference frameworks, including vLLM, llama.cpp, and Hugging Face's text generation inference server. However, major commercial APIs (OpenAI, Anthropic, Google) have not yet added min-p as an exposed parameter. For practitioners using open-source models, min-p with values between 0.05 and 0.1 is increasingly recommended as a replacement for top-p, particularly for creative generation tasks.
Top-p sampling exists within a broader taxonomy of decoding methods. The following table summarizes the major approaches:
| Strategy | Type | Key Property | Best For |
|---|---|---|---|
| Greedy decoding | Deterministic | Always picks most probable token | Simple, low-stakes generation |
| Beam search | Deterministic | Tracks top b sequences by cumulative probability | Translation, summarization |
| Temperature sampling | Stochastic | Reshapes distribution; samples from full vocabulary | General-purpose randomness control |
| Top-k sampling | Stochastic | Samples from the k most probable tokens | Simple truncation; used when distribution shape is stable |
| Top-p (nucleus) sampling | Stochastic | Samples from the smallest set covering probability mass p | Open-ended generation where confidence varies |
| Min-p sampling | Stochastic | Includes tokens above a fraction of the max probability | Creative generation, especially at high temperatures |
| Eta sampling | Stochastic | Truncation based on conditional entropy | Entropy-aware generation |
| Mirostat | Stochastic | Targets a specific perplexity level | Controlling perceived quality/complexity |
| Contrastive decoding | Hybrid | Penalizes tokens that a weaker model also favors | Reducing generic/repetitive outputs |
| Speculative decoding | Optimization | Uses a draft model to speed up generation | Inference acceleration (does not change output distribution) |
In practice, these methods are often combined. A typical configuration might apply temperature scaling first, then top-p truncation, and finally sample from the resulting distribution. Some inference frameworks allow stacking top-k and top-p together, taking the intersection of both candidate sets.
Implementing top-p sampling is straightforward. In pseudocode:
function top_p_sample(logits, p, temperature=1.0):
// Apply temperature
logits = logits / temperature
// Convert to probabilities
probs = softmax(logits)
// Sort in descending order
sorted_probs, sorted_indices = sort_descending(probs)
// Compute cumulative sum
cumulative_probs = cumulative_sum(sorted_probs)
// Find cutoff index: first index where cumulative prob >= p
cutoff = first_index_where(cumulative_probs >= p)
// Zero out everything after cutoff
sorted_probs[cutoff+1:] = 0
// Renormalize
sorted_probs = sorted_probs / sum(sorted_probs)
// Sample from the truncated distribution
chosen_index = sample_from(sorted_probs)
return sorted_indices[chosen_index]
The computational overhead compared to standard sampling is minimal: sorting the vocabulary is O(V log V) where V is vocabulary size, but in practice efficient partial sorting or selection algorithms reduce this. Modern GPU implementations in libraries like vLLM and Hugging Face Transformers handle this efficiently even for vocabulary sizes exceeding 200,000 tokens.
Here are concrete recommendations for practitioners working with top-p sampling:
Start with provider defaults. If you are using a commercial API, the default values are usually well-tuned. OpenAI defaults to top_p = 1.0, relying on temperature alone. Anthropic recommends adjusting only temperature for most tasks.
Adjust one parameter at a time. Changing temperature and top-p simultaneously makes it difficult to understand the effect of either change. Pick one to tune first.
Use lower top-p for precision tasks. When generating structured output (JSON, SQL, code with strict syntax), a lower top-p (0.1 to 0.5) combined with low temperature reduces the chance of syntactic errors.
Use higher top-p for creative tasks. For open-ended generation, storytelling, or brainstorming, top-p values of 0.9 to 0.95 with moderate temperature (0.7 to 1.0) encourage diverse and interesting outputs.
Be aware of the temperature interaction. If you set a high temperature (e.g., 1.5) and a high top-p (e.g., 0.95), the combination may produce incoherent text because the flattened distribution includes many low-quality tokens in the nucleus. In such cases, consider using min-p instead, if your inference framework supports it.
Test empirically. The optimal settings vary by model, task, and prompt. There is no universal "best" configuration. Systematic evaluation on a representative sample of inputs is the most reliable way to find good settings.
Top-p sampling remains the dominant truncation method in commercial LLM APIs. Every major provider (OpenAI, Anthropic, Google, Cohere, Mistral) exposes top_p as a generation parameter. The method is well-understood, easy to implement, and effective across a wide range of tasks.
Several trends are shaping its future:
Min-p is gaining ground in open-source. As demonstrated by its acceptance at ICLR 2025 as an oral presentation, min-p is increasingly viewed as a superior alternative, particularly for high-temperature creative generation [6]. Its adoption in open-source inference frameworks (vLLM, llama.cpp, Hugging Face TGI) suggests it may eventually be added to commercial APIs as well.
Reasoning models use different strategies. Models optimized for chain-of-thought reasoning, such as OpenAI's o-series and DeepSeek R1, often use locked or constrained decoding parameters. For these models, the provider may override user-specified temperature and top-p settings to ensure consistent reasoning quality.
Research continues on adaptive sampling. Methods like eta sampling (which uses conditional entropy to set the truncation point) and Mirostat (which targets a specific perplexity level) represent ongoing efforts to build samplers that require less manual tuning. The trajectory of research points toward methods that automatically adapt to context without requiring users to specify parameters like p or k at all.
Theoretical understanding is deepening. The 2024 work by Finlayson et al. [5] on truncation sampling as "desmoothing" provided the first rigorous theoretical framework for understanding why nucleus sampling works. This kind of theoretical grounding may inform the design of even better sampling methods in the future.
Despite these developments, top-p sampling is unlikely to disappear anytime soon. Its simplicity, effectiveness, and universal availability across APIs make it the practical default for most generation tasks. For the foreseeable future, understanding top-p remains essential knowledge for anyone building applications on top of large language models.