Temperature sampling is a technique that controls the randomness of token generation in large language models and other generative neural networks by scaling the model's output logits before the softmax function is applied. The technique introduces a single hyperparameter, conventionally written as T, that reshapes the probability distribution over the next token. A lower temperature concentrates probability mass on the highest-scoring tokens and produces more deterministic, focused output. A higher temperature flattens the distribution and produces more diverse, creative, and sometimes incoherent output. At T = 1 the distribution is left unchanged, at T close to 0 the procedure becomes equivalent to argmax over the logits (also known as greedy decoding), and as T tends toward infinity the distribution approaches uniform.
Temperature is one of the most commonly adjusted parameters in LLM-based applications and is fundamental to controlling the behavior of models such as GPT-4, Claude, Gemini, LLaMA, and other transformer-based architectures. It is the simplest member of a broader family of stochastic decoding strategies that includes top-k sampling, top-p sampling (also called nucleus sampling), min-p sampling, Mirostat, locally typical sampling, and various combinations of these. As a companion technique to beam search, which performs deterministic best-first exploration, temperature sampling supplies the controlled randomness that has become standard for open-ended generation in modern chat-style models.
The concept of temperature in sampling has its roots in statistical mechanics and thermodynamics. In physics, temperature governs the probability of a system occupying different energy states according to the Boltzmann distribution:
P(state_i) = exp(-E_i / (k_B * T)) / Z
Where E_i is the energy of state i, k_B is the Boltzmann constant, T is the absolute temperature, and Z is the partition function (a normalizing constant). At low temperatures, the system overwhelmingly occupies the lowest-energy state. At high temperatures, the system explores many states more uniformly.
This physical analogy carries directly over to machine learning. In the neural network context, logits play the role of negative energies: tokens with higher logits (lower "energy") are more probable. The temperature parameter T controls how aggressively the model favors high-logit tokens over low-logit ones, just as physical temperature controls how strongly a system prefers low-energy configurations.
The earliest uses of temperature in neural network sampling trace back to Boltzmann machines, introduced by Geoffrey Hinton and Terrence Sejnowski in the 1980s [5]. In these stochastic neural networks, temperature controlled the randomness of neuron activation, and simulated annealing (gradually lowering temperature) was used to find optimal configurations [11]. The same temperature trick reappeared in the soft-attention literature, in policy gradient reinforcement learning under the name Boltzmann exploration, and eventually in the autoregressive sequence models of the mid-2010s, where it became the standard knob for trading off diversity against fidelity in character-level RNNs and later in transformer language models.
During text generation, a language model predicts the next token by computing a score (called a logit) for every token in its vocabulary. These raw logits are then converted into a probability distribution using the softmax function. Temperature modifies this process by dividing each logit by the temperature value T before applying softmax.
Without temperature scaling, the standard softmax function converts logits z = (z_1, z_2, ..., z_V) for a vocabulary of size V into probabilities:
P(token_i) = exp(z_i) / sum from j=1 to V of exp(z_j)
Each logit is exponentiated, and the results are normalized so that all probabilities sum to 1. The exponential function amplifies differences between logits: a logit of 5 produces a value roughly 150 times larger than a logit of 0, so even modest gaps in raw scores translate into very large gaps in probability.
When temperature T is introduced, the logits are divided by T before exponentiation:
P(token_i) = exp(z_i / T) / sum from j=1 to V of exp(z_j / T)
This single modification has a significant effect on the output distribution. Because softmax converts logit differences into probability ratios, dividing by T < 1 amplifies the gap between the top token and its competitors, while dividing by T > 1 compresses that gap.
| Temperature range | Effect on logits | Effect on distribution | Result |
|---|---|---|---|
| T approaching 0 | Divides logits by a very small number, making them very large in magnitude | Distribution becomes extremely peaked (one-hot) | The highest-logit token gets probability near 1 |
| T = 1 | Logits remain unchanged | Original distribution as learned during training | Standard behavior |
| T > 1 | Divides logits by a number greater than 1, compressing them toward zero | Distribution becomes flatter (more uniform) | Lower-probability tokens gain probability mass |
| T approaching infinity | All logits approach zero | Distribution approaches uniform (1/V for each token) | All tokens become equally likely |
To understand why temperature works the way it does, consider what dividing logits by T does to the differences between them. Suppose two tokens have logits 5 and 3, giving a difference of 2. At T = 0.5, the effective logits become 10 and 6 (difference of 4, amplified). At T = 2, the effective logits become 2.5 and 1.5 (difference of 1, compressed). Because the softmax function converts differences in logits into ratios of probabilities, amplifying differences makes the distribution more peaked, while compressing differences makes it flatter.
A useful mental picture is that temperature stretches or compresses the model's confidence horizontally on the logit scale. The shape of the distribution stays "in the same family," but its sharpness varies. This is qualitatively different from top-k or top-p, which physically cut the tail off the distribution and renormalize what remains. Temperature still allows the model to occasionally pick tail tokens; it just makes that less or more likely than the model originally thought.
Consider a simple vocabulary of four tokens with the following logits:
| Token | Logit (z) |
|---|---|
| "the" | 5.0 |
| "a" | 3.0 |
| "one" | 1.0 |
| "some" | 0.5 |
The probabilities at different temperatures:
| Token | T = 0.25 | T = 0.5 | T = 1.0 | T = 2.0 | T = 5.0 |
|---|---|---|---|---|---|
| "the" | 0.9997 | 0.9820 | 0.8360 | 0.5220 | 0.3150 |
| "a" | 0.0003 | 0.0177 | 0.1131 | 0.2363 | 0.2548 |
| "one" | ~0 | 0.0002 | 0.0153 | 0.1056 | 0.2192 |
| "some" | ~0 | 0.0001 | 0.0093 | 0.0796 | 0.2110 |
At T = 0.25, "the" has a near-certain probability of 99.97%. At T = 5.0, the distribution is much more even, and the model might plausibly select any of the four tokens. The probability of "some" (the least likely token at T = 1) increases from under 1% to over 21% as temperature rises from 1.0 to 5.0. In a real vocabulary of 100,000 or more tokens, this redistribution effect is what causes very high temperatures to produce gibberish: at T = 5 even quite ungrammatical or off-topic tokens accumulate enough probability mass to be sampled occasionally, and over a long generation those bad picks compound.
Setting temperature to 0 is a special case. Mathematically, dividing by zero is undefined, but the limit as T approaches 0 causes the softmax to assign all probability mass to the token with the highest logit. In practice, most LLM implementations handle T = 0 by switching to greedy decoding (also called argmax decoding), which always selects the most probable token at each step.
Characteristics of T = 0 / greedy decoding:
| Property | Description |
|---|---|
| Output determinism | Nearly deterministic (same input produces same output) |
| Diversity | Minimal; always picks the single most likely token |
| Creativity | Very low |
| Risk of repetition | High; can get stuck in repetitive loops |
| Use cases | Factual questions, code generation, math problems, structured outputs |
A well-known problem with greedy decoding is repetition degeneracy: the model can enter loops where the same phrase is generated repeatedly. Holtzman et al. (2019) documented this effect at length, showing that even strong autoregressive models like GPT-2 produce dull, repetitive text under maximization-style decoding [1]. The cause is that once a token is generated, it influences the context, making the same token likely again, and the model has no built-in mechanism for breaking out of the loop. This is one reason why some amount of sampling randomness (T > 0) is often preferred even for tasks where accuracy is prioritized.
Note that even with T = 0, outputs may not be perfectly deterministic. Floating-point arithmetic differences across hardware, batching effects, parallel reductions in matrix multiplications, and non-deterministic GPU operations can all cause runs to diverge. Anthropic's documentation explicitly warns that even with temperature 0.0, the results will not be fully deterministic [12]. Reproducibility of greedy outputs across hardware generations is a known engineering problem, and major labs use deterministic CUDA kernels and fixed seeds when they need bit-exact runs.
Some implementations also distinguish between "temperature 0" and a true argmax mode. In a few stacks T = 0 is silently clamped to a small positive value like 1e-5 and sampling proceeds normally, which means tied logits can still resolve differently across runs. If exact reproducibility matters, the safer choice is to call a documented greedy mode rather than relying on T = 0.
At T = 1, the logits pass through softmax without modification. The resulting probability distribution reflects the model's learned distribution from training. This is the default setting for many LLM APIs.
Characteristics of T = 1:
| Property | Description |
|---|---|
| Output determinism | Moderate randomness |
| Diversity | Moderate |
| Creativity | Balanced between coherence and variety |
| Use cases | General-purpose conversation, summarization, most standard tasks |
Many commercial LLM providers use T = 1 as the API default. OpenAI's GPT models default to T = 1 in API calls [13], though the ChatGPT interface reportedly uses a value around 0.7 internally. Anthropic's Claude API defaults to T = 1 with a range of 0.0 to 1.0 [12]. Google's Gemini API also defaults to 1.0 with a range of 0.0 to 2.0, and Google explicitly recommends keeping T = 1 for Gemini 3 reasoning tasks because lowering it can cause looping or degraded performance [14].
It is worth noting that T = 1 does not mean the model generates "randomly." Because the model's learned distribution strongly favors coherent, grammatical text, most tokens will still receive very low probabilities even at T = 1. The model will still mostly select high-probability tokens, but with enough variation to avoid repetitive patterns. The question of how "random" T = 1 actually is depends on how confident the model is on a given token. On a deterministic continuation like "The capital of France is" the next-token distribution is so peaked on "Paris" that even T = 1 sampling will essentially always pick the same word. On a wide-open continuation like "My favorite hobby is" the distribution is naturally flat, and T = 1 sampling can produce highly varied outputs.
Setting temperature above 1 flattens the probability distribution, giving more weight to less likely tokens. This increases the diversity and unpredictability of generated text.
Characteristics of high temperature:
| Property | Description |
|---|---|
| Output determinism | Low; outputs vary significantly between runs |
| Diversity | High; the model explores a wider range of vocabulary |
| Creativity | High; more surprising and novel word choices |
| Coherence risk | Can produce incoherent, grammatically incorrect, or nonsensical text |
| Use cases | Creative writing, brainstorming, poetry, generating diverse options |
Most LLM APIs cap the temperature at 2.0 (OpenAI, Gemini) or 1.0 (Anthropic). Setting the temperature too high can make outputs effectively random and unusable. In practice, values above 1.5 frequently produce text with grammatical errors, nonsensical phrases, or abrupt topic changes. Open-source inference stacks like vLLM, llama.cpp, and Text Generation Inference allow values above 2.0, but at that point useful output usually requires combining temperature with truncation samplers (top-p, top-k, or min-p) that strip out the long, unreliable tail of the distribution before sampling.
The Min-p paper (Nguyen et al., 2024) makes the case that pairing very high temperatures (T = 1.5 to 3.0) with min-p truncation can produce text that is both creative and coherent, because min-p removes the implausible tail tokens that high temperature would otherwise allow [3]. This combination has become a popular preset in roleplay and creative-writing setups built on local models.
For most production applications, temperatures between 0.0 and 1.0 are the most commonly used range. Values in this range keep the model's outputs coherent while allowing varying degrees of variation:
| Temperature | Behavior | Example application |
|---|---|---|
| 0.0 | Deterministic; always picks the top token | JSON extraction, classification |
| 0.1 to 0.3 | Nearly deterministic with slight variation | Code generation, factual Q&A |
| 0.3 to 0.5 | Minor variation; mostly follows the most likely path | Summarization, translation |
| 0.5 to 0.7 | Moderate variation; natural-sounding diversity | Chatbots, email drafting |
| 0.7 to 0.9 | Noticeable variation; occasionally surprising word choices | Creative writing, story generation |
| 0.9 to 1.0 | Full model distribution; maximum variety without over-randomness | Brainstorming, poetry, exploratory prompts |
The sub-1.0 range can be thought of as "sharpening" the model's distribution: the model still follows its learned patterns but with less willingness to deviate from the most probable path. Production deployments that prioritize a consistent voice (chat assistants for customer support, internal copilots) typically settle in the 0.3 to 0.7 range, while deployments that need creative variety (story writing, marketing copy generation) push higher into 0.8 to 1.0.
Temperature is one decoding strategy in a larger family that has grown rapidly since 2018. Each method controls a different aspect of how candidate tokens are filtered or shaped before a token is finally drawn. Understanding the family is essential for choosing the right combination for a given task.
| Method | Year introduced | Core idea | Key parameter |
|---|---|---|---|
| Greedy decoding | Classical | Always pick the highest-probability token | None |
| Pure sampling | Classical | Sample directly from the model's softmax distribution | None |
| Temperature sampling | 1980s in Boltzmann machines, ubiquitous since 2015 | Scale logits by 1/T before softmax | T (temperature) |
| Top-k sampling | Fan et al. 2018 [2] | Truncate to the k tokens with highest probability | k (cutoff count) |
| Top-p (nucleus) sampling | Holtzman et al. 2019 [1] | Truncate to the smallest set whose cumulative probability exceeds p | p (cumulative threshold) |
| Mirostat | Basu et al. 2020 [4] | Dynamically adjust truncation to target a perplexity setpoint | tau (target surprise), eta (learning rate) |
| Locally typical sampling | Meister et al. 2022 [6] | Sample tokens whose information content is close to the conditional entropy | tau (typicality threshold) |
| Min-p sampling | Nguyen et al. 2024 [3] | Keep tokens with probability at least min_p times the top token's probability | min_p (relative threshold) |
| Dynamic temperature | Open-source community 2023+ | Adjust T per step based on entropy of the current distribution | min_T, max_T, exponent |
| XTC (Exclude Top Choices) | Open-source community 2024 [16] | Probabilistically remove the top tokens to force tail picks | xtc_threshold, xtc_probability |
| DRY (Don't Repeat Yourself) | Open-source community 2024 [16] | Penalize tokens that would extend a sequence already seen in the context | multiplier, base, allowed_length |
| Speculative decoding | Leviathan et al. 2023 [17] | Draft with a small model, verify with the large model | Draft model, acceptance rule |
Speculative decoding is conceptually orthogonal to the others: it does not change which distribution is sampled from, only how that sampling is implemented. The other methods change the effective distribution and so directly affect output quality.
Temperature is most often used in combination with the truncation samplers, which filter which tokens are eligible for selection. The three most common combinations are temperature alone, temperature plus top-k, and temperature plus top-p.
Top-k sampling restricts the set of candidate tokens to the k tokens with the highest probabilities. After filtering, the probabilities are renormalized and a token is sampled from this reduced set. The method was popularized by Fan, Lewis, and Dauphin (2018) in their hierarchical neural story generation paper, which used k = 10 to k = 100 paired with a temperature around 0.7 [2].
Key settings:
A central limitation of top-k is that it uses a fixed number of candidates regardless of how the probability is distributed. When the model is confident (probability concentrated on a few tokens), k = 50 may include many irrelevant low-probability tokens. When the model is uncertain (probability spread widely), k = 50 may exclude viable candidates. This shape-blindness is what motivated nucleus sampling.
Top-p sampling, introduced by Holtzman et al. (2019) in "The Curious Case of Neural Text Degeneration," dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p [1]. Unlike top-k, which always considers a fixed number of tokens, top-p adapts the number of candidates based on the shape of the distribution.
Top-p is generally preferred over top-k because of its adaptive nature. When the model is confident, only a few tokens are needed to reach the cumulative threshold. When the model is uncertain, more tokens are automatically included. The Holtzman paper shows that nucleus sampling closely matches the diversity statistics of human-written text on a range of generation tasks, while pure sampling at T = 1 is too noisy and beam search is too repetitive.
Temperature and top-p/top-k operate at different stages of the sampling pipeline:
| Parameter | What it controls | When it acts | Effect |
|---|---|---|---|
| Temperature | Shape of the probability distribution | Before filtering | Changes the relative probabilities of all tokens |
| Top-k | Maximum number of candidate tokens | After temperature | Hard cutoff on number of tokens |
| Top-p | Cumulative probability threshold | After temperature (and optionally after top-k) | Adaptive cutoff based on probability mass |
| Min-p | Relative probability threshold | After temperature | Adaptive cutoff based on the top token's probability |
| Repetition penalty | Penalize previously generated tokens | Modifies logits before temperature | Discourages literal repetition |
OpenAI's API documentation recommends altering either temperature or top-p, but not both simultaneously, as their combined effect can be unpredictable. Despite this guidance, many practitioners find that using moderate temperature (0.5 to 0.8) with top-p around 0.9 to 0.95 produces good results, and most production stacks ship sensible defaults for both.
Min-p sampling, introduced by Nguyen et al. in "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" (2024), sets a minimum probability threshold relative to the top token's probability [3]. A token is only considered if its probability is at least (min_p times the probability of the top token). Unlike top-k, which uses a fixed count, and top-p, which uses cumulative probability, min-p uses a relative threshold.
This approach handles varying distribution shapes well because the threshold automatically adjusts based on how confident the model is about its top choice. When the top token has 90% probability and min_p = 0.05, only tokens with at least 4.5% probability survive, which is a very tight cutoff. When the top token has 10% probability, the same min_p admits anything above 0.5%, keeping more options open.
The Min-p paper claims that the method enables aggressive temperature settings (T > 1.5) without producing incoherent text, because the relative threshold strips out implausible tokens even when temperature has flattened the distribution. The paper was selected as an Oral presentation at ICLR 2025 and the technique has been adopted in Hugging Face Transformers, vLLM, llama.cpp, and many open-source inference stacks. A 2025 critical analysis by Schaeffer et al. challenged some of the claimed benefits of min-p over top-p, arguing that reanalyzed human evaluations did not show the original paper's reported quality and diversity gains [18]. The debate over min-p's empirical value is ongoing, but it remains a widely available sampler in current frameworks.
Mirostat (Basu et al., 2020) takes a different approach: rather than fixing a static cutoff, it dynamically tunes the truncation level at every step to keep the cross-entropy of the generated text close to a target perplexity setpoint [4]. The idea is that very low perplexity correlates with the "boredom trap" of repetitive generation, while very high perplexity correlates with the "confusion trap" of incoherent text. Mirostat picks an operating point between these two failure modes and adjusts top-k or top-p adaptively to stay there.
Mirostat exposes two parameters: tau (target surprise per token, in nats) and eta (learning rate of the feedback loop). Common defaults are tau = 5.0 and eta = 0.1. Mirostat is implemented in llama.cpp, KoboldAI, and several front-ends for local LLM use, but it has not seen widespread adoption in commercial APIs, partly because temperature plus top-p is already adequate for most use cases.
Locally typical sampling (Meister, Pimentel, Wiher, and Cotterell, 2022) draws on information theory rather than direct probability cutoffs [6]. The method keeps tokens whose negative log-probability (their "surprisal" or "information content") is close to the model's conditional entropy at that step. The intuition is that natural human language tends to convey information at a roughly steady rate, neither too predictable nor too surprising, and that decoding should target this typical regime rather than always picking the most probable continuation.
The paper shows that locally typical sampling matches or exceeds top-p in human evaluation on summarization and story generation, while consistently reducing degenerate repetition. The technique is implemented in Hugging Face Transformers as typical_p, with values around 0.95 commonly recommended.
A frequent question is when to use temperature sampling versus beam search. The two methods belong to different decoding philosophies and serve different goals.
| Aspect | Temperature sampling | Beam search |
|---|---|---|
| Determinism | Stochastic; same prompt yields different outputs | Deterministic given fixed seed |
| Optimization target | Sample from the model's distribution | Approximate the most-probable sequence |
| Diversity | High; scales with T | Low; tends to produce minor variants of the same hypothesis |
| Memory cost | Low; one beam per request | High; K beams per request, scales with K |
| Latency cost | Low; one forward pass per token | Roughly K times higher per request |
| Quality on translation, ASR | Worse than beam at metrics like BLEU and WER | Standard production decoder |
| Quality on open-ended chat | Standard production decoder | Often produces dull, repetitive output |
| Failure mode | Incoherence at high T | Repetition, length bias, the "empty translation" problem |
| Compatibility with KV cache batching | Excellent; standard LLM serving stacks optimize for it | Awkward; K-fold KV expansion complicates continuous batching |
The rough rule of thumb is that beam search remains the default for short, high-fidelity outputs where there is essentially one right answer (machine translation, automatic speech recognition, structured code generation with grammar constraints). Temperature sampling, usually combined with top-p or min-p, is the default for long, open-ended outputs where one right answer does not exist (chat, creative writing, brainstorming). The Holtzman paper laid out the case that beam search is inappropriate for open-ended generation because the model's most-likely sequence tends to be repetitive and bland [1], and the LLM ecosystem has moved decisively in this direction since around 2020.
Different LLM providers implement temperature with varying ranges and defaults, and a few constrain the parameter for specific model families.
| Model / provider | Temperature range | Default | Notes |
|---|---|---|---|
| OpenAI GPT-4, GPT-4o | 0.0 to 2.0 | 1.0 | ChatGPT interface may use ~0.7 internally |
| OpenAI o1, o3 reasoning models | Fixed at 1.0 | 1.0 | Temperature, top_p, n cannot be changed; reasoning_effort controls thinking depth instead [15] |
| Anthropic Claude | 0.0 to 1.0 | 1.0 | Range capped at 1.0; extended thinking modes typically require T = 1 [12] |
| Google Gemini | 0.0 to 2.0 | 1.0 | Google explicitly recommends keeping T = 1.0 on Gemini 3 reasoning models [14] |
| Meta LLaMA | 0.0 to 2.0+ | Typically 0.6 to 0.8 | Open-weight; users can set any value |
| Mistral | 0.0 to 1.5 | 0.7 | Recommended values vary by task |
| Cohere Command R | 0.0 to 1.0 | 0.3 | Lower default reflects preference for precision |
| DeepSeek-V3, DeepSeek-R1 | 0.0 to 2.0 | Varies | Documentation suggests T = 0.6 for general chat, T = 0 for math/code |
Some providers also offer a "deterministic" or "greedy" mode that is functionally equivalent to T = 0 but may be implemented differently at the infrastructure level. Reasoning models like OpenAI's o1 and o3, Google's Gemini Thinking variants, and Anthropic's Claude with extended thinking generally restrict temperature changes because their internal chain-of-thought sampling is tuned at training time and altering it can damage benchmark performance [15]. Instead of a temperature dial these models expose a reasoning depth or budget control.
Fine-tuning can change how a model responds to temperature. A model fine-tuned on a narrow, specific task may produce high-quality outputs at T = 0 because its learned distribution is already sharply focused on the correct patterns. A general-purpose model may benefit from moderate temperature to explore its broader distribution.
Models that have undergone reinforcement learning from human feedback (RLHF) or related preference-tuning methods like DPO behave differently with temperature than their base versions. RLHF tends to sharpen the model's distribution toward preferred outputs, which means the effective behavior at a given temperature is less random than for the base model at the same temperature. Practitioners often find that fine-tuned chat models give serviceable output across a wider T range than the corresponding base model.
Choosing the right temperature depends on the task, the desired output characteristics, and the specific model being used. The values below are starting points, not commitments; empirical sweeps over a small validation set are the most reliable way to settle on a final number.
| Task | Recommended temperature | Reasoning |
|---|---|---|
| Factual question answering | 0.0 to 0.3 | Accuracy is paramount; minimize randomness |
| Code generation | 0.0 to 0.2 | Code must be syntactically and semantically correct |
| SQL generation, data extraction | 0.0 | Deterministic output needed to match expected formats |
| Summarization | 0.3 to 0.5 | Some variety in phrasing is acceptable, but fidelity matters |
| General conversation | 0.5 to 0.8 | Balance between coherence and natural-sounding variety |
| Translation | 0.0 to 0.3 (or use beam) | Accuracy matters; beam is often the better choice |
| Creative writing | 0.7 to 1.2 | Encourage diverse and surprising word choices |
| Brainstorming | 0.8 to 1.5 | Maximize diversity of ideas |
| Poetry and fiction | 1.0 to 1.5 | High creativity; unusual word combinations are desirable |
| Roleplay / character dialogue | 0.8 to 1.2 with min-p | Need diversity but coherent persona |
| Math / reasoning with self-consistency | 0.5 to 0.7, sample N times | Some randomness so samples disagree, then pick majority [7] |
| Math / reasoning, single sample | 0.0 | Treat as a deterministic search |
| Tool use / function calling | 0.0 to 0.2 | Argument JSON must be exact |
| Pitfall | Why it happens | How to fix |
|---|---|---|
| Repetitive outputs | Temperature too low; greedy decoding gets stuck | Increase temperature slightly (0.1 to 0.3) or add top-p |
| Incoherent or nonsensical text | Temperature too high without truncation | Lower temperature; add top-p or min-p filtering |
| Inconsistent behavior across runs | High temperature causes variance | Lower temperature for more consistent outputs; or fix a seed |
| Good first sentence, bad rest | Temperature effect compounds over long sequences | Use lower temperature for longer outputs |
| Model ignores instructions | Very high temperature causes random token selection | Reduce temperature; critical instructions should not rely on high-temperature generation |
| T = 0 still varies | Floating-point noise, parallel reductions, batching effects | Pin a seed; use deterministic kernels; or accept small variance |
| top_k = 1 with high T does nothing | top-k = 1 always picks the single top token regardless of T | Use top-k > 1 if you want any temperature effect |
| Slow code with beam plus high T | Combining beam search with sampling is rarely useful | Use one strategy; sampling for chat, beam for translation |
One of the most influential modern uses of temperature sampling is the self-consistency technique introduced by Wang et al. (2022) in "Self-Consistency Improves Chain of Thought Reasoning in Language Models" [7]. The idea is simple: instead of sampling a single chain-of-thought reasoning path with greedy decoding, sample multiple chains at temperature greater than zero, then pick the final answer that appears most often across the samples.
Formally, given a prompt that elicits step-by-step reasoning, self-consistency samples N completions (typically N = 10 to 40) at T around 0.5 to 0.7 and aggregates their final answers by majority vote. The intuition is that a complex reasoning problem usually admits multiple valid reasoning paths that all converge to the same correct answer, while incorrect reasoning paths tend to produce inconsistent answers. Wang et al. report large gains on arithmetic and commonsense reasoning benchmarks, including +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, +6.4% on StrategyQA, and +3.9% on ARC-Challenge over greedy chain-of-thought prompting.
Self-consistency was a pivotal change in how decoding was thought about for reasoning. Before this work, greedy decoding was the default for math and code because randomness was assumed to hurt accuracy. After, sampling at moderate T became standard for reasoning evaluations, and the same ideas underpin modern reasoning models that internally sample many parallel rollouts and pick the best one. The reasoning model trend culminating in OpenAI's o1 and o3, Google's Gemini Thinking, DeepSeek-R1, and Anthropic's Claude with extended thinking is in large part a story about exploiting sample-and-verify pipelines that depend on temperature sampling rather than greedy or beam decoding.
Variants of self-consistency include weighted voting (where the answer probabilities of each sample contribute weighted by their likelihood), universal self-consistency (which uses the LLM itself to judge consistency for free-form answers), and verifier-based reranking (where a separate model scores each candidate and the highest-scoring one is returned).
The practical tension at the heart of all sampling work is the tradeoff between quality and diversity.
This tradeoff is sometimes formalized as a quality-diversity Pareto frontier. Different decoding strategies sit at different points on the frontier. Greedy is at the high-quality, low-diversity extreme; pure sampling at T = infinity is at the high-diversity, low-quality extreme. Top-p, min-p, and locally typical sampling all attempt to push the frontier outward by removing the worst-quality tail tokens while preserving diversity in the middle of the distribution.
The right operating point depends on whether you are evaluating a single output or a population of outputs. For a chatbot returning one reply, you usually want high quality. For a creative tool generating ten candidate captions, you want diversity even at the cost of some bad candidates because the user (or a downstream reranker) will throw away the failures.
Although most commonly associated with LLMs, the concept of temperature scaling appears in several other areas of machine learning.
In knowledge distillation, temperature is used to "soften" the probability distribution of a teacher model's outputs. Hinton, Vinyals, and Dean (2015) introduced this technique, in which a large teacher model's soft predictions (generated with high temperature) are used to train a smaller student model [8]. The high temperature reveals more information about the teacher's learned relationships between classes than hard (one-hot) labels would. For example, a teacher model's output at T = 1 might assign 90% probability to "cat" and 5% each to "dog" and "tiger." At T = 5, the distribution softens to something like 50% "cat," 25% "dog," 25% "tiger," revealing that the model considers "dog" and "tiger" more similar to "cat" than to unrelated classes.
A technical detail from the Hinton paper is that, when training the student, gradients must be scaled by T squared to maintain proper gradient magnitudes when temperature is greater than 1. Without this scaling, the soft-target loss would dominate or be dominated by the hard-target loss in unintended ways.
In diffusion models used for image generation (such as Stable Diffusion and DALL-E), classifier-free guidance scale plays a similar role to temperature. Higher guidance scale values produce outputs that more closely match the prompt (analogous to lower temperature), while lower values allow more variation. Some text-to-image stacks also expose a separate sampling temperature that perturbs the noise schedule itself, but the guidance scale is the more commonly tuned dial.
In reinforcement learning, particularly in the Boltzmann exploration strategy, temperature controls the tradeoff between exploitation (choosing the best-known action) and exploration (trying other actions). Lower temperature favors exploitation; higher temperature favors exploration. This is mathematically identical to temperature-scaled softmax applied to action values (Q-values). Algorithms such as soft actor-critic (SAC) make this explicit, learning a temperature parameter alongside the policy to balance reward maximization against entropy.
In contrastive representation learning (SimCLR, CLIP, and the InfoNCE family of losses), a temperature parameter scales the similarity scores between embeddings before softmax. Lower temperature emphasizes hard negatives more strongly; higher temperature treats positives and negatives more uniformly. Tuning this temperature is a key hyperparameter in contrastive pretraining recipes.
A related but distinct use of "temperature" in machine learning is temperature scaling for model calibration, introduced by Guo, Pleiss, Sun, and Weinberger (2017) in "On Calibration of Modern Neural Networks" [9]. In this context, a single scalar temperature parameter T is learned on a held-out validation set and applied to a trained model's logits to improve the calibration of its probability estimates. This has nothing to do with sampling randomness; instead, it adjusts the model's confidence levels so that a predicted probability of 0.9 actually corresponds to 90% accuracy.
Guo et al. found that modern deep classifiers are systematically overconfident: their predicted probabilities are too high relative to their empirical accuracy. Dividing logits by a learned T > 1 (typically between 1.5 and 3) softens the distribution and brings predicted probabilities back into agreement with observed correctness rates. The technique is a single-parameter variant of Platt scaling, costs essentially nothing to apply at inference, and remains a strong baseline for post-hoc calibration in classification settings.
The mathematical form is identical to temperature sampling, but the goal is the opposite: sampling temperature changes the operational distribution to control sampling randomness, while calibration temperature changes the reported probabilities to make them honest. Both reuse the same softmax-with-temperature equation.
The entropy of the probability distribution increases monotonically with temperature. At T = 0, entropy is 0 (all probability on one token, which is a degenerate distribution). At T = 1, entropy matches the model's learned distribution. As T increases toward infinity, entropy approaches log(V), the maximum possible entropy for a vocabulary of size V. This monotonic relationship means temperature provides a smooth control knob for the amount of randomness in the output, and it is the basis for entropy-targeting decoders like Mirostat.
A useful identity is that the entropy of the temperature-scaled distribution equals the original entropy minus 1/T times the expected logit, plus a normalization term. In practice this means that at T = 2 the entropy is roughly halfway between the original entropy and uniform, and at T = 0.5 the entropy is roughly halfway between the original entropy and zero.
The Kullback-Leibler (KL) divergence between the T = 1 distribution and a temperature-scaled distribution increases as T moves further from 1 in either direction. Both very low and very high temperatures produce distributions that are significantly different from the model's learned distribution. This is useful to remember when reasoning about why high T can produce gibberish: the actually-sampled distribution at T = 5 is far from the model's training-time output distribution, and the model has no signal that its predictions remain valid in this regime.
The temperature-scaled softmax can be viewed through the lens of energy-based models. Each logit z_i represents the negative energy of token i, and the softmax converts energies into probabilities via the Boltzmann distribution. Temperature controls how sharply the model distinguishes between low-energy (preferred) and high-energy (dispreferred) tokens. This perspective is useful for connecting LLM decoding to the broader literature on Markov chain Monte Carlo, where temperature schedules play a central role in algorithms like simulated annealing and parallel tempering.
During training of a standard classifier, temperature does not affect gradient computation, since training uses the standard softmax or cross-entropy loss without temperature. During inference with temperature-scaled sampling, the choice of temperature can be seen as defining a different inference-time distribution over the model's output space. In knowledge distillation, where temperature is used during training, gradients must be scaled by T squared to maintain proper gradient magnitudes, as Hinton et al. (2015) noted [8].
Temperature is implemented in essentially every modern LLM serving stack and training framework, with minor variations in parameter names and combinations.
Hugging Face Transformers exposes temperature through the generate() method:
outputs = model.generate(
input_ids,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=50,
max_new_tokens=512,
)
Setting do_sample=False disables sampling and uses greedy decoding regardless of the temperature value. Setting temperature=0 is generally also handled by switching to greedy. The typical_p parameter activates locally typical sampling, and repetition_penalty, no_repeat_ngram_size, and frequency_penalty modify logits before sampling [16].
vLLM uses a SamplingParams object with the same conceptual parameters:
from vllm import LLM, SamplingParams
params = SamplingParams(
temperature=0.7,
top_p=0.95,
top_k=50,
max_tokens=512,
)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, params)
Text Generation Inference (TGI), TensorRT-LLM, and SGLang follow similar conventions. llama.cpp exposes a richer set of samplers including Mirostat, dynamic temperature, and the open-source XTC and DRY samplers, configurable on the command line or in its server API.
OpenAI, Anthropic, and Google's APIs accept temperature and top_p as JSON parameters to their chat-completion endpoints. The provider applies the parameters internally; the client only sees the final sampled tokens.
Temperature is set per request at inference time, not baked into the model. This makes it straightforward to use different temperatures for different calls in an agentic system. A common pattern in modern agent stacks is:
Frameworks like LangChain, LlamaIndex, and DSPy let practitioners attach different sampling configurations to different chain steps. Some agent frameworks also implement "temperature schedules" that lower T over the course of a long agent run to converge on a final answer, mirroring the simulated annealing tradition.
Speculative decoding accelerates inference by drafting tokens with a small model and verifying them with the large model. The Leviathan et al. (2023) paper proves that, with the correct rejection-sampling rule, speculative decoding produces samples from exactly the same distribution as direct sampling from the large model [17]. Temperature is preserved end-to-end: the draft and target models both apply the requested temperature, and the acceptance probability accounts for any disagreement between their distributions.
A practical observation, documented in temperature-centric studies of speculative decoding, is that higher temperatures tend to increase the acceptance rate of draft tokens because the target distribution is flatter and thus closer to the draft distribution [10]. This means speculative decoding gives the largest speedups for sampling workloads that already use moderate-to-high T, and somewhat smaller speedups for greedy or near-greedy decoding.
The basic temperature parameter has been stable for a decade, but the surrounding sampler ecosystem has continued to evolve. Several directions are active areas of research and engineering as of 2026.
Dynamic temperature schedules adjust T per token based on the current entropy or top-token probability. The intuition is that the model should commit hard when it is confident (low T) and explore widely when it is uncertain (high T). Open-source frameworks like KoboldCpp and llama.cpp offer dynamic temperature presets that scale T between configured minimum and maximum values according to a smoothing exponent.
Token-level adaptive sampling extends this idea to richer policies, sometimes learned. Recent papers have proposed using a small auxiliary network to predict the right sampler settings for each step, or training the model itself to output distributions that are well-calibrated for downstream sampling.
New samplers like XTC (Exclude Top Choices) and DRY (Don't Repeat Yourself) were proposed by the open-source community in 2024 to address specific failure modes [16]. XTC probabilistically removes the top tokens to force the model into less obvious continuations, breaking writing cliches. DRY penalizes tokens that would extend the input into a sequence that has already appeared in context, dramatically reducing exact-string repetition without the heavy hand of a global repetition penalty.
Temperature-aware speculative decoding tries to design draft models or distillation losses that maximize acceptance at the temperature settings used in production [10]. This has practical impact because most chat workloads run at T near 1, where the gap between draft and target is largest.
Uncertainty-driven decoding uses temperature in concert with uncertainty estimates to detect hallucinations. The premise is that a token sampled at low confidence is a likely site of factual error, and several monitoring systems use this signal to flag risky outputs.
Finally, the rise of reasoning models has shifted attention from per-token sampling to sequence-level sampling. When a model emits a long internal chain of thought before answering, the relevant decoding decisions are about how many parallel chains to sample, what temperature to sample them at, and how to aggregate their answers. Self-consistency, best-of-N reranking, and tree-of-thought search all extend the basic temperature-sampling idea to operate at the level of full reasoning trajectories rather than individual tokens.
As of 2026, temperature sampling combined with top-p (or increasingly min-p) is the de facto default decoding configuration for chat, creative writing, and most open-ended language tasks across the major commercial LLM APIs and open-source serving stacks. Greedy decoding remains the default for code generation, structured-output extraction, and tool-call argument formatting, and beam search retains a strong foothold in machine translation, automatic speech recognition, and grammar-constrained generation.
The reasoning-model era has changed the role of temperature in important ways. Models like OpenAI's o1 and o3 fix temperature internally and expose a reasoning-effort dial instead of a per-call temperature [15]. Anthropic's Claude with extended thinking similarly constrains temperature when the thinking mode is enabled [12]. Google's guidance for Gemini 3 strongly recommends T = 1 for reasoning workloads [14]. These choices reflect the fact that the model providers have already tuned the right temperature into the model's training and post-training, and exposing the dial to end users mostly creates ways to break the model rather than improve it.
For non-reasoning chat workloads, the field has converged on a small set of practical recipes: T around 0.7 with top-p around 0.95 for general chat; T = 0 for code and structured output; T around 1.0 with min-p around 0.05 for creative writing on local models; sample-and-vote with T around 0.5 to 0.7 for math and code reasoning. These recipes are not universal, but they represent the current consensus across the open-source and commercial communities. Temperature sampling itself is unlikely to disappear; the abstractions built on top of it (samplers, schedules, parallel rollouts, voting) are where most of the new work is happening.