Top-p sampling and top-k sampling are decoding strategies used in large language models (LLMs) and other autoregressive neural networks to control the randomness and quality of generated text. These methods filter the vocabulary at each generation step, restricting the pool of candidate tokens before sampling. They are fundamental tools for balancing creativity, coherence, and diversity in text generation across applications such as chatbots, creative writing assistants, and code generators.
Top-k sampling was popularized by Fan et al. (2018), while top-p sampling (also called nucleus sampling) was introduced by Holtzman et al. in their 2019 paper "The Curious Case of Neural Text Degeneration," published at ICLR 2020. Since then, newer methods such as min-p sampling have emerged to address remaining limitations.
At each step of autoregressive text generation, a language model produces a probability distribution over its entire vocabulary. The model assigns a probability to every token, and the generation method determines which token is actually selected. The simplest approach, greedy decoding, always picks the token with the highest probability. While deterministic and fast, greedy decoding tends to produce repetitive, generic text.
Sampling-based methods introduce randomness by drawing a token from the probability distribution rather than always picking the most likely one. Pure random sampling (drawing from the full distribution) can produce incoherent text because it sometimes selects very unlikely tokens. Top-k and top-p sampling address this by truncating the distribution, keeping only the most plausible candidates before sampling.
Temperature is a scaling parameter applied to the logits (raw model outputs) before converting them to probabilities via the softmax function. It controls how "peaked" or "flat" the probability distribution is.
| Temperature value | Effect on distribution | Behavior |
|---|---|---|
| T = 1.0 | No change; uses the raw distribution | Standard model output |
| T < 1.0 (e.g., 0.3) | Sharpens the distribution; high-probability tokens become more dominant | More deterministic, focused output |
| T > 1.0 (e.g., 1.5) | Flattens the distribution; low-probability tokens get relatively higher chances | More random, creative, potentially incoherent output |
| T approaching 0 | Distribution collapses to a spike on the top token | Equivalent to greedy decoding |
Mathematically, the probability of token i is computed as:
P(i) = exp(logit_i / T) / sum(exp(logit_j / T))
Temperature is typically applied before top-k or top-p filtering. It is not a sampling method itself but a preprocessing step that modifies the distribution before other truncation methods are applied.
Top-k sampling, popularized by Fan et al. (2018) for story generation, is the simpler of the two main truncation methods. At each generation step, the model:
Suppose a model's top predictions for the next token are:
| Token | Probability |
|---|---|
| "the" | 0.30 |
| "a" | 0.20 |
| "this" | 0.15 |
| "my" | 0.10 |
| "our" | 0.08 |
| "that" | 0.05 |
| (remaining tokens) | 0.12 |
With k = 4, the model keeps only "the," "a," "this," and "my." Their probabilities are renormalized to sum to 1 (0.40, 0.27, 0.20, 0.13), and one token is sampled from this restricted set.
Top-k sampling effectively prevents the selection of very unlikely tokens, reducing incoherence. However, it has a fundamental limitation: the fixed value of k does not adapt to the shape of the probability distribution. In some contexts, the model is highly confident and concentrates most probability mass on just 2 or 3 tokens. In other contexts, probability is spread more evenly across dozens of plausible continuations. A fixed k of, say, 40 would include many irrelevant tokens in the first case and might still miss plausible tokens in the second. This inflexibility motivated the development of top-p sampling.
Top-p sampling, or nucleus sampling, was proposed by Holtzman et al. (2019) to address top-k's inability to adapt to varying confidence levels. Instead of keeping a fixed number of tokens, top-p keeps the smallest set of tokens whose cumulative probability exceeds a threshold p.
At each generation step, the model:
Using the same probability distribution as above, with p = 0.75:
| Token | Probability | Cumulative probability | Included? |
|---|---|---|---|
| "the" | 0.30 | 0.30 | Yes |
| "a" | 0.20 | 0.50 | Yes |
| "this" | 0.15 | 0.65 | Yes |
| "my" | 0.10 | 0.75 | Yes |
| "our" | 0.08 | 0.83 | No |
| "that" | 0.05 | 0.88 | No |
The nucleus contains 4 tokens in this case. If the model were more confident (e.g., "the" had probability 0.80), the nucleus might contain only 1 token. If the model were less confident, the nucleus might contain 20 or more tokens. This adaptive behavior is the key advantage of top-p over top-k.
Holtzman et al.'s paper identified a critical problem with standard beam search and greedy decoding: even high-quality language models produce text that is "boring" and repetitive when decoded with maximization-based methods. Human language, by contrast, has a degree of unpredictability that makes it engaging. The authors showed that the probability distributions produced by neural language models assign non-trivial probability to a "nucleus" of plausible tokens, while scattering the remaining probability across a long tail of unlikely tokens. By sampling only from the nucleus, top-p sampling captures the natural variability of language while avoiding the incoherence that comes from sampling unlikely tokens.
In practice, these parameters are often used together. A typical generation pipeline applies them in the following order:
When both top-k and top-p are applied, top-k acts as a hard upper bound on the number of candidates, while top-p provides a soft, distribution-dependent filter. For example, setting k = 50 and p = 0.9 means the model considers at most 50 tokens, but may consider fewer if 90% of the probability is concentrated in just a handful of tokens.
The following table shows typical parameter configurations for different use cases.
| Use case | Temperature | Top-k | Top-p | Notes |
|---|---|---|---|---|
| Factual Q&A | 0.0-0.3 | N/A | N/A | Low temperature or greedy decoding for accuracy |
| General chatbot | 0.7-1.0 | 40-50 | 0.9-0.95 | Balanced creativity and coherence |
| Creative writing | 1.0-1.2 | 50-100 | 0.95-1.0 | Higher temperature for more diverse output |
| Code generation | 0.0-0.4 | N/A | 0.9 | Low temperature for correctness; top-p to avoid nonsense |
| Brainstorming | 1.0-1.5 | 100 | 0.95-1.0 | Maximizes diversity at the cost of some coherence |
Provider APIs typically expose these parameters. OpenAI's API, for instance, supports temperature and top_p. Anthropic's API for Claude supports temperature and top_k/top_p. The Hugging Face Transformers library exposes all three parameters plus additional options.
Min-p sampling is a newer truncation method that addresses a limitation shared by both top-k and top-p: neither directly considers the confidence level of the model's top prediction when deciding which tokens to keep.
Introduced by Minh Nguyen and collaborators, min-p works as follows:
The key insight is that the threshold scales with the model's confidence:
| Scenario | P_max | Threshold (min_p = 0.1) | Effect |
|---|---|---|---|
| High confidence | 0.90 | 0.09 | Only tokens with >= 9% probability survive; very few candidates |
| Moderate confidence | 0.30 | 0.03 | Tokens with >= 3% probability survive; moderate candidate pool |
| Low confidence | 0.05 | 0.005 | Tokens with >= 0.5% probability survive; many candidates allowed |
When the model is highly confident, min-p aggressively filters, keeping only the strongest candidates. When the model is uncertain, min-p relaxes, allowing a wider range of plausible continuations. This behavior is more principled than top-p's fixed cumulative threshold, which can include too many low-quality tokens when the model is uncertain.
Nguyen et al. demonstrated that min-p sampling outperforms top-p across multiple model families (Mistral, Llama 3) and model sizes (1B to 123B parameters), particularly at higher temperatures where top-p tends to produce incoherent outputs. Min-p maintains both quality and diversity more effectively than top-p, especially in creative generation tasks. The paper was accepted as an oral presentation at ICLR 2025.
Min-p is supported in major inference frameworks including llama.cpp, vLLM, Hugging Face Transformers, Ollama, ExLlamaV2, KoboldCpp, and text-generation-webui.
Several additional sampling methods have been proposed to improve text generation quality.
Locally typical sampling, introduced by Meister et al. (2023), selects tokens whose information content (negative log probability) is close to the conditional entropy of the distribution. The intuition is that "typical" tokens are neither too predictable nor too surprising, aligning with information-theoretic properties of natural language.
Eta sampling (Hewitt et al., 2022) uses both absolute and relative probability thresholds to truncate the distribution. It removes tokens in the tail of the distribution whose probabilities fall below a threshold derived from the distribution's entropy.
Mirostat (Basu et al., 2021) is a sampling method that dynamically adjusts the truncation to maintain a target perplexity (surprise level) throughout generation. Rather than using a fixed threshold like top-k or top-p, Mirostat uses a feedback control loop to keep the text at a consistent level of predictability.
| Method | Year | Truncation criterion | Adaptive? | Key property |
|---|---|---|---|---|
| Top-k | 2018 | Fixed number of tokens | No | Simple; does not adapt to confidence |
| Top-p (nucleus) | 2019 | Cumulative probability threshold | Partially | Adapts candidate pool size to distribution shape |
| Typical sampling | 2022 | Information content near entropy | Yes | Information-theoretically motivated |
| Eta sampling | 2022 | Entropy-based threshold | Yes | Removes low-probability tail tokens |
| Mirostat | 2021 | Target perplexity feedback loop | Yes | Maintains consistent surprise level |
| Min-p | 2024 | Fraction of top token's probability | Yes | Scales threshold with model confidence |
In addition to truncation methods, most generation systems apply repetition penalties to discourage the model from repeating the same tokens or phrases. These penalties reduce the logit of tokens that have already appeared in the generated text, with the penalty typically increasing with the number of prior occurrences.
OpenAI's API provides two related parameters: presence_penalty (penalizes tokens that have appeared at all) and frequency_penalty (penalizes tokens proportionally to how often they have appeared). These work alongside temperature and top-p to control output quality.
Top-k and top-p sampling are implemented efficiently in modern inference libraries. The key computational steps (sorting logits, computing cumulative sums, masking) add minimal overhead compared to the model's forward pass. In the Hugging Face Transformers library, sampling parameters are passed through the GenerationConfig object:
from transformers import GenerationConfig
config = GenerationConfig(
do_sample=True,
temperature=0.8,
top_k=50,
top_p=0.9,
max_new_tokens=256,
)
For API-based access, OpenAI's Chat Completions API accepts temperature and top_p as parameters. The API documentation recommends altering either temperature or top_p but not both simultaneously, though in practice both can be used together.
The development of sampling methods for language models is rooted in information theory and probability theory. When a language model generates text, each step can be viewed as sampling from a categorical distribution over the vocabulary. The quality of the generated text depends on how well the sampling strategy navigates the trade-off between two failure modes:
Degeneration: When the model always picks the most probable tokens (greedy or beam search), the output becomes repetitive and generic. Holtzman et al. measured this quantitatively, showing that human text has a much higher variance in per-token probability than text generated by maximization-based methods.
Incoherence: When the model samples from the full distribution without truncation, it occasionally selects tokens from the long tail of the distribution, producing nonsensical or contradictory text.
Top-k, top-p, and min-p can all be understood as different approaches to identifying and removing the unreliable tail of the distribution while preserving the informative nucleus. From an information-theoretic perspective, typical sampling goes even further by directly targeting the "typical set" of sequences, those whose information content per token is close to the model's entropy.
Understanding how sampling parameters interact is important for achieving desired generation behavior. The interaction can be subtle, and combining parameters does not always produce intuitive results.
| Combination | Interaction | Practical note |
|---|---|---|
| Low temperature + top-p | Temperature sharpens the distribution before top-p filters. The nucleus becomes very small, often containing just 1-3 tokens. | Behaves almost like greedy decoding. top-p has little effect. |
| High temperature + top-p | Temperature flattens the distribution, spreading probability across many tokens. top-p then truncates the long tail. | top-p does most of the heavy lifting in preventing incoherence. |
| Top-k + top-p together | top-k sets a hard ceiling on candidates; top-p may further reduce the set. | Useful when you want an absolute maximum on candidate count. |
| Min-p + high temperature | High temperature flattens the distribution, but min-p scales its threshold relative to the still-highest token. | Min-p remains effective because it adapts to the post-temperature distribution. |
| Min-p + top-p | Both filters apply. In practice, one typically dominates. | Most practitioners use one or the other, not both. |
OpenAI's API documentation specifically notes that users should generally alter either temperature or top_p, not both. This guidance reflects the fact that both parameters affect the effective size of the candidate pool, and combining them without careful tuning can produce unexpected behavior.
The choice of sampling parameters has significant effects on the behavior of deployed LLM applications.
| Application | Preferred strategy | Reason |
|---|---|---|
| Customer support bots | Low temperature (0.1-0.3), no top-k/top-p | Consistency and accuracy are paramount |
| AI coding assistants | Low temperature, top-p = 0.9 | Correct code with some variation for alternative approaches |
| Story generation | High temperature (1.0+), top-p = 0.95 | Creativity and unpredictability are desired |
| Search-augmented generation (RAG) | Low temperature (0.0-0.2) | Faithfulness to retrieved context |
| Translation | Temperature = 0.3-0.5, top-p = 0.9 | Balance between fluency and accuracy |
The evolution of sampling strategies tracks the broader development of neural language models.
| Year | Development |
|---|---|
| Pre-2018 | Beam search dominates in sequence-to-sequence tasks. Temperature sampling used informally. |
| 2018 | Fan et al. popularize top-k sampling for hierarchical story generation, demonstrating that stochastic decoding produces more engaging narratives than beam search. |
| 2019 | Holtzman et al. publish "The Curious Case of Neural Text Degeneration," introducing nucleus (top-p) sampling. The paper appears at ICLR 2020 and becomes one of the most cited works on text generation. |
| 2021 | Basu et al. propose Mirostat, using control theory to maintain a target perplexity during generation. |
| 2022 | Meister et al. formalize locally typical sampling based on information-theoretic principles. Hewitt et al. introduce eta sampling. |
| 2024 | Nguyen et al. propose min-p sampling, which scales the truncation threshold with model confidence. The paper gains rapid adoption in open-source inference frameworks. |
| 2025 | Min-p accepted as an oral at ICLR 2025. Sampling methods continue to evolve alongside new model architectures. |
The trend across this timeline is clear: sampling methods have become progressively more adaptive, moving from fixed thresholds (top-k) to distribution-dependent thresholds (top-p) to confidence-scaled thresholds (min-p). Each generation of methods better approximates the ideal of including exactly the plausible continuations while excluding the implausible ones.