# Top-p and top-k sampling

> Source: https://aiwiki.ai/wiki/top_p_sampling
> Updated: 2026-06-23
> Categories: Large Language Models, Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**[Top-p sampling](/wiki/top_p)**, also called **nucleus sampling**, is a text-generation decoding method for [large language models](/wiki/large_language_model) (LLMs) that, at each step, samples the next token from the smallest set of highest-probability tokens whose cumulative probability is at least a threshold **p** (for example p = 0.9 or p = 0.95). It was introduced by Holtzman et al. in "The Curious Case of Neural Text Degeneration" (arXiv 2019, published at ICLR 2020), and its defining property is that the size of this candidate set, the "nucleus," expands and contracts dynamically with the shape of the distribution, unlike fixed [top-k sampling](/wiki/top_k_sampling) which always keeps the same number of tokens. [1]

**Top-k sampling** is the related, simpler method that keeps a fixed number k of the highest-probability tokens. Top-k was popularized by Fan et al. (2018) for story generation, while top-p was proposed to fix top-k's inability to adapt to how confident the model is at each step. [1][2] Both are decoding strategies used in LLMs and other autoregressive [neural networks](/wiki/neural_network) to control the randomness and quality of generated text: they filter the vocabulary at each generation step, restricting the pool of candidate tokens before sampling. They are fundamental tools for balancing creativity, coherence, and diversity across applications such as chatbots, creative writing assistants, and code generators. Since their introduction, newer methods such as **min-p sampling** have emerged to address remaining limitations.

## What problem does top-p sampling solve?

Holtzman et al.'s paper identified a critical problem with [greedy decoding](/wiki/greedy_decoding) and [beam search](/wiki/beam_search): even high-quality language models produce text that is bland and repetitive when decoded with maximization-based methods. The paper states that "using likelihood as a decoding objective leads to text that is bland and strangely repetitive," and shows that "decoding strategies alone can dramatically effect the quality of machine text, even when generated from exactly the same neural language model." [1]

The authors' key observation is distributional: neural language models concentrate the "vast majority of probability mass at each time step" in a nucleus, "a small subset of the vocabulary that tends to range between one and a thousand candidates," while scattering the remaining probability across a long, unreliable tail. [1] Top-p sampling captures the natural variability of human language by drawing only from this nucleus, described in the paper as "sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution." [1] In experiments the authors used **p = 0.95** as their main setting. [1]

Human language, by contrast with maximization-based output, has a degree of unpredictability that makes it engaging. Holtzman et al. measured this quantitatively, showing that human text has a much higher variance in per-token probability than text generated by greedy or beam search. [1]

## Overview of token sampling

At each step of autoregressive text generation, a language model produces a probability distribution over its entire vocabulary. The model assigns a probability to every token, and the generation method determines which token is actually selected. The simplest approach, [greedy decoding](/wiki/greedy_decoding), always picks the token with the highest probability. While deterministic and fast, greedy decoding tends to produce repetitive, generic text. [1]

Sampling-based methods introduce randomness by drawing a token from the probability distribution rather than always picking the most likely one. Pure random sampling (drawing from the full distribution) can produce incoherent text because it sometimes selects very unlikely tokens. Top-k and top-p sampling address this by truncating the distribution, keeping only the most plausible candidates before sampling.

## How does temperature interact with sampling?

[Temperature](/wiki/temperature) is a scaling parameter applied to the logits (raw model outputs) before converting them to probabilities via the softmax function. It controls how "peaked" or "flat" the probability distribution is. The parameter traces back to statistical mechanics and was used in neural networks as early as Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985). [7]

| Temperature value | Effect on distribution | Behavior |
|-------------------|----------------------|----------|
| T = 1.0 | No change; uses the raw distribution | Standard model output |
| T < 1.0 (e.g., 0.3) | Sharpens the distribution; high-probability tokens become more dominant | More deterministic, focused output |
| T > 1.0 (e.g., 1.5) | Flattens the distribution; low-probability tokens get relatively higher chances | More random, creative, potentially incoherent output |
| T approaching 0 | Distribution collapses to a spike on the top token | Equivalent to greedy decoding |

Mathematically, the probability of token i is computed as:

**P(i) = exp(logit_i / T) / sum(exp(logit_j / T))**

Temperature is typically applied before top-k or top-p filtering. It is not a sampling method itself but a preprocessing step that modifies the distribution before other truncation methods are applied.

## How does top-k sampling work?

Top-k sampling, popularized by Fan et al. (2018) for story generation, is the simpler of the two main truncation methods. [2] At each generation step, the model:

1. Computes the probability distribution over the full vocabulary.
2. Sorts tokens by probability in descending order.
3. Keeps only the **k** tokens with the highest probabilities.
4. Sets the probabilities of all other tokens to zero.
5. Renormalizes the remaining probabilities so they sum to 1.
6. Samples a token from this truncated distribution.

### Example

Suppose a model's top predictions for the next token are:

| Token | Probability |
|-------|-------------|
| "the" | 0.30 |
| "a" | 0.20 |
| "this" | 0.15 |
| "my" | 0.10 |
| "our" | 0.08 |
| "that" | 0.05 |
| (remaining tokens) | 0.12 |

With k = 4, the model keeps only "the," "a," "this," and "my." Their probabilities are renormalized to sum to 1 (0.40, 0.27, 0.20, 0.13), and one token is sampled from this restricted set.

### Why is a fixed k a limitation?

Top-k sampling effectively prevents the selection of very unlikely tokens, reducing incoherence. However, it has a fundamental limitation: the fixed value of k does not adapt to the shape of the probability distribution. In some contexts, the model is highly confident and concentrates most probability mass on just 2 or 3 tokens. In other contexts, probability is spread more evenly across dozens of plausible continuations. A fixed k of, say, 40 would include many irrelevant tokens in the first case and might still miss plausible tokens in the second. This inflexibility motivated the development of top-p sampling. [1]

## How does top-p (nucleus) sampling work?

Top-p sampling, or nucleus sampling, was proposed by Holtzman et al. (2020) to address top-k's inability to adapt to varying confidence levels. Instead of keeping a fixed number of tokens, top-p keeps the smallest set of tokens whose cumulative probability exceeds a threshold **p**. Formally, the paper defines the top-p vocabulary V(p) as "the smallest set such that the sum of P(x | x_1:i-1) over x in V(p) is at least p," then renormalizes that set for sampling. [1]

At each generation step, the model:

1. Computes the probability distribution over the full vocabulary.
2. Sorts tokens by probability in descending order.
3. Computes the cumulative sum of probabilities from highest to lowest.
4. Includes tokens until the cumulative probability reaches or exceeds **p**.
5. Sets the probabilities of all excluded tokens to zero.
6. Renormalizes and samples from the remaining tokens.

### Example

Using the same probability distribution as above, with p = 0.75:

| Token | Probability | Cumulative probability | Included? |
|-------|-------------|----------------------|----------|
| "the" | 0.30 | 0.30 | Yes |
| "a" | 0.20 | 0.50 | Yes |
| "this" | 0.15 | 0.65 | Yes |
| "my" | 0.10 | 0.75 | Yes |
| "our" | 0.08 | 0.83 | No |
| "that" | 0.05 | 0.88 | No |

The nucleus contains 4 tokens in this case. If the model were more confident (e.g., "the" had probability 0.80), the nucleus might contain only 1 token. If the model were less confident, the nucleus might contain 20 or more tokens. In the original paper the nucleus is reported to range "between one and a thousand candidates" depending on the step. [1] This adaptive behavior is the key advantage of top-p over top-k.

## Combining temperature, top-k, and top-p

In practice, these parameters are often used together. A typical generation pipeline applies them in the following order:

1. The model produces raw logits.
2. **Temperature** scaling is applied to the logits.
3. Logits are converted to probabilities via softmax.
4. **Top-k** filtering removes all but the top k tokens (if top-k is enabled).
5. **Top-p** filtering further removes tokens below the cumulative probability threshold (if top-p is enabled).
6. The remaining distribution is renormalized, and a token is sampled.

When both top-k and top-p are applied, top-k acts as a hard upper bound on the number of candidates, while top-p provides a soft, distribution-dependent filter. For example, setting k = 50 and p = 0.9 means the model considers at most 50 tokens, but may consider fewer if 90% of the probability is concentrated in just a handful of tokens.

### What are typical top-p values?

Top-p has become a standard generation default, with practitioners typically setting p in the 0.9 to 0.95 range. [8] The following table shows typical parameter configurations for different use cases.

| Use case | Temperature | Top-k | Top-p | Notes |
|----------|------------|-------|-------|-------|
| Factual Q&A | 0.0-0.3 | N/A | N/A | Low temperature or greedy decoding for accuracy |
| General chatbot | 0.7-1.0 | 40-50 | 0.9-0.95 | Balanced creativity and coherence |
| Creative writing | 1.0-1.2 | 50-100 | 0.95-1.0 | Higher temperature for more diverse output |
| Code generation | 0.0-0.4 | N/A | 0.9 | Low temperature for correctness; top-p to avoid nonsense |
| Brainstorming | 1.0-1.5 | 100 | 0.95-1.0 | Maximizes diversity at the cost of some coherence |

Provider APIs typically expose these parameters as defaults. In the [Hugging Face](/wiki/hugging_face) Transformers `GenerationConfig`, the documented defaults are temperature = 1.0, top_k = 50, and top_p = 1.0, with greedy decoding used unless `do_sample=True` is set. [9] [OpenAI](/wiki/openai)'s API supports temperature and top_p; [Anthropic](/wiki/anthropic)'s API for [Claude](/wiki/claude) supports temperature and top_k/top_p, and Anthropic changed the default top_p in the Messages API from 0.999 to 0.99 across models. [10][11]

## How does min-p sampling differ from top-p?

Min-p sampling is a newer truncation method that addresses a limitation shared by both top-k and top-p: neither directly considers the confidence level of the model's top prediction when deciding which tokens to keep.

Introduced by Minh Nhat Nguyen and collaborators in "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" (arXiv:2407.01082), min-p works as follows: [3]

1. Find the probability of the most likely token, P_max.
2. Compute a dynamic threshold: **threshold = P_max * min_p**, where min_p is a hyperparameter (e.g., 0.1).
3. Discard every token whose probability falls below this threshold.
4. Renormalize the remaining probabilities and sample.

### How min-p adapts to confidence

The key insight is that the threshold scales with the model's confidence:

| Scenario | P_max | Threshold (min_p = 0.1) | Effect |
|----------|-------|------------------------|--------|
| High confidence | 0.90 | 0.09 | Only tokens with >= 9% probability survive; very few candidates |
| Moderate confidence | 0.30 | 0.03 | Tokens with >= 3% probability survive; moderate candidate pool |
| Low confidence | 0.05 | 0.005 | Tokens with >= 0.5% probability survive; many candidates allowed |

When the model is highly confident, min-p aggressively filters, keeping only the strongest candidates. When the model is uncertain, min-p relaxes, allowing a wider range of plausible continuations. This behavior is more principled than top-p's fixed cumulative threshold, which can include too many low-quality tokens when the model is uncertain.

### Empirical results

Nguyen et al. demonstrated that min-p sampling improves both the quality and diversity of generated text over top-p across multiple model families ([Mistral](/wiki/mistral) and [Llama](/wiki/llama) 3) and model sizes (1B to 123B parameters), particularly at higher temperatures where top-p tends to produce incoherent outputs. [3] The evaluation spanned reasoning and creative benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, and human evaluators showed a clear preference for min-p in both quality and creativity. [3] The paper was the 18th highest-scoring submission to ICLR 2025 and was accepted as an oral presentation. [3]

Min-p is supported in major inference frameworks including [llama.cpp](/wiki/llama_cpp), [vLLM](/wiki/vllm), Hugging Face Transformers, [Ollama](/wiki/ollama), ExLlamaV2, KoboldCpp, and text-generation-webui.

## Other sampling methods

Several additional sampling methods have been proposed to improve text generation quality.

### Typical sampling

Locally typical sampling, introduced by Meister et al. (2023), selects tokens whose information content (negative log probability) is close to the conditional entropy of the distribution. [4] The intuition is that "typical" tokens are neither too predictable nor too surprising, aligning with information-theoretic properties of natural language.

### Eta sampling

Eta sampling (Hewitt et al., 2022) uses both absolute and relative probability thresholds to truncate the distribution. [5] It removes tokens in the tail of the distribution whose probabilities fall below a threshold derived from the distribution's entropy.

### Mirostat

Mirostat (Basu et al., 2021) is a sampling method that dynamically adjusts the truncation to maintain a target perplexity (surprise level) throughout generation. [6] Rather than using a fixed threshold like top-k or top-p, Mirostat uses a feedback control loop to keep the text at a consistent level of predictability.

| Method | Year | Truncation criterion | Adaptive? | Key property |
|--------|------|---------------------|-----------|-------------|
| Top-k | 2018 | Fixed number of tokens | No | Simple; does not adapt to confidence |
| Top-p (nucleus) | 2019/2020 | Cumulative probability threshold | Partially | Adapts candidate pool size to distribution shape |
| Typical sampling | 2022 | Information content near entropy | Yes | Information-theoretically motivated |
| Eta sampling | 2022 | Entropy-based threshold | Yes | Removes low-probability tail tokens |
| Mirostat | 2021 | Target perplexity feedback loop | Yes | Maintains consistent surprise level |
| Min-p | 2024 | Fraction of top token's probability | Yes | Scales threshold with model confidence |

## Repetition penalty and frequency penalty

In addition to truncation methods, most generation systems apply **repetition penalties** to discourage the model from repeating the same tokens or phrases. These penalties reduce the logit of tokens that have already appeared in the generated text, with the penalty typically increasing with the number of prior occurrences.

[OpenAI](/wiki/openai)'s API provides two related parameters: **presence_penalty** (penalizes tokens that have appeared at all) and **frequency_penalty** (penalizes tokens proportionally to how often they have appeared). [10] These work alongside temperature and top-p to control output quality.

## Implementation details

Top-k and top-p sampling are implemented efficiently in modern inference libraries. The key computational steps (sorting logits, computing cumulative sums, masking) add minimal overhead compared to the model's forward pass. In the Hugging Face Transformers library, sampling parameters are passed through the `GenerationConfig` object:

```python
from transformers import GenerationConfig

config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    max_new_tokens=256,
)
```

For API-based access, [OpenAI](/wiki/openai)'s Chat Completions API accepts temperature and top_p as parameters. The API documentation recommends altering either temperature or top_p but not both simultaneously, though in practice both can be used together. [10]

## Theoretical foundations

The development of sampling methods for language models is rooted in information theory and probability theory. When a language model generates text, each step can be viewed as sampling from a categorical distribution over the vocabulary. The quality of the generated text depends on how well the sampling strategy navigates the trade-off between two failure modes:

1. **Degeneration:** When the model always picks the most probable tokens (greedy or beam search), the output becomes repetitive and generic. Holtzman et al. measured this quantitatively, showing that human text has a much higher variance in per-token probability than text generated by maximization-based methods. [1]

2. **Incoherence:** When the model samples from the full distribution without truncation, it occasionally selects tokens from the long tail of the distribution, producing nonsensical or contradictory text.

Top-k, top-p, and min-p can all be understood as different approaches to identifying and removing the unreliable tail of the distribution while preserving the informative nucleus. From an information-theoretic perspective, typical sampling goes even further by directly targeting the "typical set" of sequences, those whose information content per token is close to the model's entropy. [4]

## How do sampling parameters interact?

Understanding how sampling parameters interact is important for achieving desired generation behavior. The interaction can be subtle, and combining parameters does not always produce intuitive results.

| Combination | Interaction | Practical note |
|-------------|-------------|----------------|
| Low temperature + top-p | Temperature sharpens the distribution before top-p filters. The nucleus becomes very small, often containing just 1-3 tokens. | Behaves almost like greedy decoding. top-p has little effect. |
| High temperature + top-p | Temperature flattens the distribution, spreading probability across many tokens. top-p then truncates the long tail. | top-p does most of the heavy lifting in preventing incoherence. |
| Top-k + top-p together | top-k sets a hard ceiling on candidates; top-p may further reduce the set. | Useful when you want an absolute maximum on candidate count. |
| Min-p + high temperature | High temperature flattens the distribution, but min-p scales its threshold relative to the still-highest token. | Min-p remains effective because it adapts to the post-temperature distribution. |
| Min-p + top-p | Both filters apply. In practice, one typically dominates. | Most practitioners use one or the other, not both. |

[OpenAI](/wiki/openai)'s API documentation specifically notes that users should generally alter either temperature or top_p, not both. [10] This guidance reflects the fact that both parameters affect the effective size of the candidate pool, and combining them without careful tuning can produce unexpected behavior.

## Impact on LLM applications

The choice of sampling parameters has significant effects on the behavior of deployed LLM applications.

| Application | Preferred strategy | Reason |
|-------------|-------------------|--------|
| Customer support bots | Low temperature (0.1-0.3), no top-k/top-p | Consistency and accuracy are paramount |
| [AI coding assistants](/wiki/ai_coding_agent) | Low temperature, top-p = 0.9 | Correct code with some variation for alternative approaches |
| Story generation | High temperature (1.0+), top-p = 0.95 | Creativity and unpredictability are desired |
| Search-augmented generation (RAG) | Low temperature (0.0-0.2) | Faithfulness to retrieved context |
| Translation | Temperature = 0.3-0.5, top-p = 0.9 | Balance between fluency and accuracy |

## Historical development

The evolution of sampling strategies tracks the broader development of neural language models.

| Year | Development |
|------|-------------|
| Pre-2018 | [Beam search](/wiki/beam_search) dominates in sequence-to-sequence tasks. Temperature sampling used informally. |
| 2018 | Fan et al. popularize top-k sampling for hierarchical story generation, demonstrating that stochastic decoding produces more engaging narratives than beam search. [2] |
| 2019 | Holtzman et al. publish "The Curious Case of Neural Text Degeneration" on arXiv, introducing nucleus (top-p) sampling. The paper appears at ICLR 2020 and becomes one of the most cited works on text generation. [1] |
| 2021 | Basu et al. propose Mirostat, using control theory to maintain a target perplexity during generation. [6] |
| 2022 | Meister et al. formalize locally typical sampling based on information-theoretic principles. Hewitt et al. introduce eta sampling. [4][5] |
| 2024 | Nguyen et al. propose min-p sampling, which scales the truncation threshold with model confidence. The paper gains rapid adoption in open-source inference frameworks. [3] |
| 2025 | Min-p accepted as an oral at ICLR 2025. Sampling methods continue to evolve alongside new model architectures. [3] |

The trend across this timeline is clear: sampling methods have become progressively more adaptive, moving from fixed thresholds (top-k) to distribution-dependent thresholds (top-p) to confidence-scaled thresholds (min-p). Each generation of methods better approximates the ideal of including exactly the plausible continuations while excluding the implausible ones.

## References

1. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751. https://arxiv.org/abs/1904.09751
2. Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. https://arxiv.org/abs/1805.04833
3. Nguyen, M. N., Baker, A., Neo, C., Roush, A., Kirsch, A., & Shwartz-Ziv, R. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs." ICLR 2025 (oral). arXiv:2407.01082. https://arxiv.org/abs/2407.01082
4. Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Locally Typical Sampling." Transactions of the Association for Computational Linguistics, 11, 102-121.
5. Hewitt, J., Manning, C. D., & Liang, P. (2022). "Truncation Sampling as Language Model Desmoothing." EMNLP 2022. arXiv:2210.15191.
6. Basu, S., Ramachandran, G. S., Keskar, N. S., & Varshney, L. R. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls [Perplexity](/wiki/perplexity)." ICLR 2021. arXiv:2007.14966.
7. Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). "A Learning Algorithm for Boltzmann Machines." Cognitive Science, 9(1), 147-169. (Original description of the temperature parameter in neural networks.)
8. Chip Huyen. (2024). "Generation configurations: temperature, top-k, top-p, and test time compute." https://huyenchip.com/2024/01/16/sampling.html
9. Hugging Face. "Generation - GenerationConfig." Transformers documentation. https://huggingface.co/docs/transformers/main_classes/text_generation
10. OpenAI. "API Reference: Chat Completions (temperature, top_p, presence_penalty, frequency_penalty)." https://platform.openai.com/docs/api-reference/chat
11. Anthropic. "Claude API release notes / Messages API parameters (top_p, temperature, top_k)." https://docs.anthropic.com/en/release-notes/api

