Temperature sampling

Temperature sampling is a technique that controls the randomness of token generation in large language models and other generative neural networks by scaling the model's output logits before the softmax function is applied. The technique introduces a single hyperparameter, conventionally written as T, that reshapes the probability distribution over the next token. A lower temperature concentrates probability mass on the highest-scoring tokens and produces more deterministic, focused output. A higher temperature flattens the distribution and produces more diverse, creative, and sometimes incoherent output. At T = 1 the distribution is left unchanged, at T close to 0 the procedure becomes equivalent to argmax over the logits (also known as greedy decoding), and as T tends toward infinity the distribution approaches uniform.

Temperature is one of the most commonly adjusted parameters in LLM-based applications and is fundamental to controlling the behavior of models such as GPT-4, Claude, Gemini, LLaMA, and other transformer-based architectures. It is the simplest member of a broader family of stochastic decoding strategies that includes top-k sampling, top-p sampling (also called nucleus sampling), min-p sampling, Mirostat, locally typical sampling, and various combinations of these. As a companion technique to beam search, which performs deterministic best-first exploration, temperature sampling supplies the controlled randomness that has become standard for open-ended generation in modern chat-style models.

origins in statistical mechanics

The concept of temperature in sampling has its roots in statistical mechanics and thermodynamics. In physics, temperature governs the probability of a system occupying different energy states according to the Boltzmann distribution:

P(state_i) = exp(-E_i / (k_B * T)) / Z

Where E_i is the energy of state i, k_B is the Boltzmann constant, T is the absolute temperature, and Z is the partition function (a normalizing constant). At low temperatures, the system overwhelmingly occupies the lowest-energy state. At high temperatures, the system explores many states more uniformly.

This physical analogy carries directly over to machine learning. In the neural network context, logits play the role of negative energies: tokens with higher logits (lower "energy") are more probable. The temperature parameter T controls how aggressively the model favors high-logit tokens over low-logit ones, just as physical temperature controls how strongly a system prefers low-energy configurations.

The earliest uses of temperature in neural network sampling trace back to Boltzmann machines, introduced by Geoffrey Hinton and Terrence Sejnowski in the 1980s [5]. In these stochastic neural networks, temperature controlled the randomness of neuron activation, and simulated annealing (gradually lowering temperature) was used to find optimal configurations [11]. The same temperature trick reappeared in the soft-attention literature, in policy gradient reinforcement learning under the name Boltzmann exploration, and eventually in the autoregressive sequence models of the mid-2010s, where it became the standard knob for trading off diversity against fidelity in character-level RNNs and later in transformer language models.

how temperature works

During text generation, a language model predicts the next token by computing a score (called a logit) for every token in its vocabulary. These raw logits are then converted into a probability distribution using the softmax function. Temperature modifies this process by dividing each logit by the temperature value T before applying softmax.

the standard softmax function

Without temperature scaling, the standard softmax function converts logits z = (z_1, z_2, ..., z_V) for a vocabulary of size V into probabilities:

P(token_i) = exp(z_i) / sum from j=1 to V of exp(z_j)

Each logit is exponentiated, and the results are normalized so that all probabilities sum to 1. The exponential function amplifies differences between logits: a logit of 5 produces a value roughly 150 times larger than a logit of 0, so even modest gaps in raw scores translate into very large gaps in probability.

softmax with temperature

When temperature T is introduced, the logits are divided by T before exponentiation:

P(token_i) = exp(z_i / T) / sum from j=1 to V of exp(z_j / T)

This single modification has a significant effect on the output distribution. Because softmax converts logit differences into probability ratios, dividing by T < 1 amplifies the gap between the top token and its competitors, while dividing by T > 1 compresses that gap.

Temperature range	Effect on logits	Effect on distribution	Result
T approaching 0	Divides logits by a very small number, making them very large in magnitude	Distribution becomes extremely peaked (one-hot)	The highest-logit token gets probability near 1
T = 1	Logits remain unchanged	Original distribution as learned during training	Standard behavior
T > 1	Divides logits by a number greater than 1, compressing them toward zero	Distribution becomes flatter (more uniform)	Lower-probability tokens gain probability mass
T approaching infinity	All logits approach zero	Distribution approaches uniform (1/V for each token)	All tokens become equally likely

intuitive explanation

To understand why temperature works the way it does, consider what dividing logits by T does to the differences between them. Suppose two tokens have logits 5 and 3, giving a difference of 2. At T = 0.5, the effective logits become 10 and 6 (difference of 4, amplified). At T = 2, the effective logits become 2.5 and 1.5 (difference of 1, compressed). Because the softmax function converts differences in logits into ratios of probabilities, amplifying differences makes the distribution more peaked, while compressing differences makes it flatter.

A useful mental picture is that temperature stretches or compresses the model's confidence horizontally on the logit scale. The shape of the distribution stays "in the same family," but its sharpness varies. This is qualitatively different from top-k or top-p, which physically cut the tail off the distribution and renormalize what remains. Temperature still allows the model to occasionally pick tail tokens; it just makes that less or more likely than the model originally thought.

worked example

Consider a simple vocabulary of four tokens with the following logits:

Token	Logit (z)
"the"	5.0
"a"	3.0
"one"	1.0
"some"	0.5

The probabilities at different temperatures:

Token	T = 0.25	T = 0.5	T = 1.0	T = 2.0	T = 5.0
"the"	0.9997	0.9820	0.8360	0.5220	0.3150
"a"	0.0003	0.0177	0.1131	0.2363	0.2548
"one"	~0	0.0002	0.0153	0.1056	0.2192
"some"	~0	0.0001	0.0093	0.0796	0.2110

At T = 0.25, "the" has a near-certain probability of 99.97%. At T = 5.0, the distribution is much more even, and the model might plausibly select any of the four tokens. The probability of "some" (the least likely token at T = 1) increases from under 1% to over 21% as temperature rises from 1.0 to 5.0. In a real vocabulary of 100,000 or more tokens, this redistribution effect is what causes very high temperatures to produce gibberish: at T = 5 even quite ungrammatical or off-topic tokens accumulate enough probability mass to be sampled occasionally, and over a long generation those bad picks compound.

temperature = 0: greedy decoding

Setting temperature to 0 is a special case. Mathematically, dividing by zero is undefined, but the limit as T approaches 0 causes the softmax to assign all probability mass to the token with the highest logit. In practice, most LLM implementations handle T = 0 by switching to greedy decoding (also called argmax decoding), which always selects the most probable token at each step.

Characteristics of T = 0 / greedy decoding:

Property	Description
Output determinism	Nearly deterministic (same input produces same output)
Diversity	Minimal; always picks the single most likely token
Creativity	Very low
Risk of repetition	High; can get stuck in repetitive loops
Use cases	Factual questions, code generation, math problems, structured outputs

A well-known problem with greedy decoding is repetition degeneracy: the model can enter loops where the same phrase is generated repeatedly. Holtzman et al. (2019) documented this effect at length, showing that even strong autoregressive models like GPT-2 produce dull, repetitive text under maximization-style decoding [1]. The cause is that once a token is generated, it influences the context, making the same token likely again, and the model has no built-in mechanism for breaking out of the loop. This is one reason why some amount of sampling randomness (T > 0) is often preferred even for tasks where accuracy is prioritized.

Note that even with T = 0, outputs may not be perfectly deterministic. Floating-point arithmetic differences across hardware, batching effects, parallel reductions in matrix multiplications, and non-deterministic GPU operations can all cause runs to diverge. Anthropic's documentation explicitly warns that even with temperature 0.0, the results will not be fully deterministic [12]. Reproducibility of greedy outputs across hardware generations is a known engineering problem, and major labs use deterministic CUDA kernels and fixed seeds when they need bit-exact runs.

Some implementations also distinguish between "temperature 0" and a true argmax mode. In a few stacks T = 0 is silently clamped to a small positive value like 1e-5 and sampling proceeds normally, which means tied logits can still resolve differently across runs. If exact reproducibility matters, the safer choice is to call a documented greedy mode rather than relying on T = 0.

temperature = 1: default behavior

At T = 1, the logits pass through softmax without modification. The resulting probability distribution reflects the model's learned distribution from training. This is the default setting for many LLM APIs.

Characteristics of T = 1:

Property	Description
Output determinism	Moderate randomness
Diversity	Moderate
Creativity	Balanced between coherence and variety
Use cases	General-purpose conversation, summarization, most standard tasks

Many commercial LLM providers use T = 1 as the API default. OpenAI's GPT models default to T = 1 in API calls [13], though the ChatGPT interface reportedly uses a value around 0.7 internally. Anthropic's Claude API defaults to T = 1 with a range of 0.0 to 1.0 [12]. Google's Gemini API also defaults to 1.0 with a range of 0.0 to 2.0, and Google explicitly recommends keeping T = 1 for Gemini 3 reasoning tasks because lowering it can cause looping or degraded performance [14].

It is worth noting that T = 1 does not mean the model generates "randomly." Because the model's learned distribution strongly favors coherent, grammatical text, most tokens will still receive very low probabilities even at T = 1. The model will still mostly select high-probability tokens, but with enough variation to avoid repetitive patterns. The question of how "random" T = 1 actually is depends on how confident the model is on a given token. On a deterministic continuation like "The capital of France is" the next-token distribution is so peaked on "Paris" that even T = 1 sampling will essentially always pick the same word. On a wide-open continuation like "My favorite hobby is" the distribution is naturally flat, and T = 1 sampling can produce highly varied outputs.

temperature greater than 1: increased creativity

Setting temperature above 1 flattens the probability distribution, giving more weight to less likely tokens. This increases the diversity and unpredictability of generated text.

Characteristics of high temperature:

Property	Description
Output determinism	Low; outputs vary significantly between runs
Diversity	High; the model explores a wider range of vocabulary
Creativity	High; more surprising and novel word choices
Coherence risk	Can produce incoherent, grammatically incorrect, or nonsensical text
Use cases	Creative writing, brainstorming, poetry, generating diverse options

Most LLM APIs cap the temperature at 2.0 (OpenAI, Gemini) or 1.0 (Anthropic). Setting the temperature too high can make outputs effectively random and unusable. In practice, values above 1.5 frequently produce text with grammatical errors, nonsensical phrases, or abrupt topic changes. Open-source inference stacks like vLLM, llama.cpp, and Text Generation Inference allow values above 2.0, but at that point useful output usually requires combining temperature with truncation samplers (top-p, top-k, or min-p) that strip out the long, unreliable tail of the distribution before sampling.

The Min-p paper (Nguyen et al., 2024) makes the case that pairing very high temperatures (T = 1.5 to 3.0) with min-p truncation can produce text that is both creative and coherent, because min-p removes the implausible tail tokens that high temperature would otherwise allow [3]. This combination has become a popular preset in roleplay and creative-writing setups built on local models.

temperature between 0 and 1: the practical sweet spot

For most production applications, temperatures between 0.0 and 1.0 are the most commonly used range. Values in this range keep the model's outputs coherent while allowing varying degrees of variation:

Temperature	Behavior	Example application
0.0	Deterministic; always picks the top token	JSON extraction, classification
0.1 to 0.3	Nearly deterministic with slight variation	Code generation, factual Q&A
0.3 to 0.5	Minor variation; mostly follows the most likely path	Summarization, translation
0.5 to 0.7	Moderate variation; natural-sounding diversity	Chatbots, email drafting
0.7 to 0.9	Noticeable variation; occasionally surprising word choices	Creative writing, story generation
0.9 to 1.0	Full model distribution; maximum variety without over-randomness	Brainstorming, poetry, exploratory prompts

The sub-1.0 range can be thought of as "sharpening" the model's distribution: the model still follows its learned patterns but with less willingness to deviate from the most probable path. Production deployments that prioritize a consistent voice (chat assistants for customer support, internal copilots) typically settle in the 0.3 to 0.7 range, while deployments that need creative variety (story writing, marketing copy generation) push higher into 0.8 to 1.0.

the sampling methods family

Temperature is one decoding strategy in a larger family that has grown rapidly since 2018. Each method controls a different aspect of how candidate tokens are filtered or shaped before a token is finally drawn. Understanding the family is essential for choosing the right combination for a given task.

Method	Year introduced	Core idea	Key parameter
Greedy decoding	Classical	Always pick the highest-probability token	None
Pure sampling	Classical	Sample directly from the model's softmax distribution	None
Temperature sampling	1980s in Boltzmann machines, ubiquitous since 2015	Scale logits by 1/T before softmax	T (temperature)
Top-k sampling	Fan et al. 2018 [2]	Truncate to the k tokens with highest probability	k (cutoff count)
Top-p (nucleus) sampling	Holtzman et al. 2019 [1]	Truncate to the smallest set whose cumulative probability exceeds p	p (cumulative threshold)
Mirostat	Basu et al. 2020 [4]	Dynamically adjust truncation to target a perplexity setpoint	tau (target surprise), eta (learning rate)
Locally typical sampling	Meister et al. 2022 [6]	Sample tokens whose information content is close to the conditional entropy	tau (typicality threshold)
Min-p sampling	Nguyen et al. 2024 [3]	Keep tokens with probability at least min_p times the top token's probability	min_p (relative threshold)
Dynamic temperature	Open-source community 2023+	Adjust T per step based on entropy of the current distribution	min_T, max_T, exponent
XTC (Exclude Top Choices)	Open-source community 2024 [16]	Probabilistically remove the top tokens to force tail picks	xtc_threshold, xtc_probability
DRY (Don't Repeat Yourself)	Open-source community 2024 [16]	Penalize tokens that would extend a sequence already seen in the context	multiplier, base, allowed_length
Speculative decoding	Leviathan et al. 2023 [17]	Draft with a small model, verify with the large model	Draft model, acceptance rule

Speculative decoding is conceptually orthogonal to the others: it does not change which distribution is sampled from, only how that sampling is implemented. The other methods change the effective distribution and so directly affect output quality.

relationship to top-p and top-k

Temperature is most often used in combination with the truncation samplers, which filter which tokens are eligible for selection. The three most common combinations are temperature alone, temperature plus top-k, and temperature plus top-p.

top-k sampling

Top-k sampling restricts the set of candidate tokens to the k tokens with the highest probabilities. After filtering, the probabilities are renormalized and a token is sampled from this reduced set. The method was popularized by Fan, Lewis, and Dauphin (2018) in their hierarchical neural story generation paper, which used k = 10 to k = 100 paired with a temperature around 0.7 [2].

Key settings:

k = 1 is equivalent to greedy decoding (always picks the top token).
k = 50 considers the top 50 tokens, a common default in older Hugging Face configurations.
k = vocabulary size disables filtering; equivalent to standard sampling.

A central limitation of top-k is that it uses a fixed number of candidates regardless of how the probability is distributed. When the model is confident (probability concentrated on a few tokens), k = 50 may include many irrelevant low-probability tokens. When the model is uncertain (probability spread widely), k = 50 may exclude viable candidates. This shape-blindness is what motivated nucleus sampling.

top-p (nucleus) sampling

Top-p sampling, introduced by Holtzman et al. (2019) in "The Curious Case of Neural Text Degeneration," dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p [1]. Unlike top-k, which always considers a fixed number of tokens, top-p adapts the number of candidates based on the shape of the distribution.

p = 0.1 considers only tokens in the top 10% of cumulative probability.
p = 0.9 considers tokens covering 90% of the cumulative probability (a common default).
p = 1.0 considers all tokens (no filtering).

Top-p is generally preferred over top-k because of its adaptive nature. When the model is confident, only a few tokens are needed to reach the cumulative threshold. When the model is uncertain, more tokens are automatically included. The Holtzman paper shows that nucleus sampling closely matches the diversity statistics of human-written text on a range of generation tasks, while pure sampling at T = 1 is too noisy and beam search is too repetitive.

how they interact

Temperature and top-p/top-k operate at different stages of the sampling pipeline:

Temperature is applied first, modifying the logits before softmax.
Top-k then filters to retain only the k highest-probability tokens (after softmax).
Top-p then further filters to retain only the smallest set exceeding the probability threshold p.
A token is randomly sampled from the remaining candidates.

Parameter	What it controls	When it acts	Effect
Temperature	Shape of the probability distribution	Before filtering	Changes the relative probabilities of all tokens
Top-k	Maximum number of candidate tokens	After temperature	Hard cutoff on number of tokens
Top-p	Cumulative probability threshold	After temperature (and optionally after top-k)	Adaptive cutoff based on probability mass
Min-p	Relative probability threshold	After temperature	Adaptive cutoff based on the top token's probability
Repetition penalty	Penalize previously generated tokens	Modifies logits before temperature	Discourages literal repetition

OpenAI's API documentation recommends altering either temperature or top-p, but not both simultaneously, as their combined effect can be unpredictable. Despite this guidance, many practitioners find that using moderate temperature (0.5 to 0.8) with top-p around 0.9 to 0.95 produces good results, and most production stacks ship sensible defaults for both.

min-p sampling

Min-p sampling, introduced by Nguyen et al. in "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs" (2024), sets a minimum probability threshold relative to the top token's probability [3]. A token is only considered if its probability is at least (min_p times the probability of the top token). Unlike top-k, which uses a fixed count, and top-p, which uses cumulative probability, min-p uses a relative threshold.

This approach handles varying distribution shapes well because the threshold automatically adjusts based on how confident the model is about its top choice. When the top token has 90% probability and min_p = 0.05, only tokens with at least 4.5% probability survive, which is a very tight cutoff. When the top token has 10% probability, the same min_p admits anything above 0.5%, keeping more options open.

The Min-p paper claims that the method enables aggressive temperature settings (T > 1.5) without producing incoherent text, because the relative threshold strips out implausible tokens even when temperature has flattened the distribution. The paper was selected as an Oral presentation at ICLR 2025 and the technique has been adopted in Hugging Face Transformers, vLLM, llama.cpp, and many open-source inference stacks. A 2025 critical analysis by Schaeffer et al. challenged some of the claimed benefits of min-p over top-p, arguing that reanalyzed human evaluations did not show the original paper's reported quality and diversity gains [18]. The debate over min-p's empirical value is ongoing, but it remains a widely available sampler in current frameworks.

mirostat

Mirostat (Basu et al., 2020) takes a different approach: rather than fixing a static cutoff, it dynamically tunes the truncation level at every step to keep the cross-entropy of the generated text close to a target perplexity setpoint [4]. The idea is that very low perplexity correlates with the "boredom trap" of repetitive generation, while very high perplexity correlates with the "confusion trap" of incoherent text. Mirostat picks an operating point between these two failure modes and adjusts top-k or top-p adaptively to stay there.

Mirostat exposes two parameters: tau (target surprise per token, in nats) and eta (learning rate of the feedback loop). Common defaults are tau = 5.0 and eta = 0.1. Mirostat is implemented in llama.cpp, KoboldAI, and several front-ends for local LLM use, but it has not seen widespread adoption in commercial APIs, partly because temperature plus top-p is already adequate for most use cases.

locally typical sampling

Locally typical sampling (Meister, Pimentel, Wiher, and Cotterell, 2022) draws on information theory rather than direct probability cutoffs [6]. The method keeps tokens whose negative log-probability (their "surprisal" or "information content") is close to the model's conditional entropy at that step. The intuition is that natural human language tends to convey information at a roughly steady rate, neither too predictable nor too surprising, and that decoding should target this typical regime rather than always picking the most probable continuation.

The paper shows that locally typical sampling matches or exceeds top-p in human evaluation on summarization and story generation, while consistently reducing degenerate repetition. The technique is implemented in Hugging Face Transformers as typical_p, with values around 0.95 commonly recommended.

comparison with beam search

A frequent question is when to use temperature sampling versus beam search. The two methods belong to different decoding philosophies and serve different goals.

Aspect	Temperature sampling	Beam search
Determinism	Stochastic; same prompt yields different outputs	Deterministic given fixed seed
Optimization target	Sample from the model's distribution	Approximate the most-probable sequence
Diversity	High; scales with T	Low; tends to produce minor variants of the same hypothesis
Memory cost	Low; one beam per request	High; K beams per request, scales with K
Latency cost	Low; one forward pass per token	Roughly K times higher per request
Quality on translation, ASR	Worse than beam at metrics like BLEU and WER	Standard production decoder
Quality on open-ended chat	Standard production decoder	Often produces dull, repetitive output
Failure mode	Incoherence at high T	Repetition, length bias, the "empty translation" problem
Compatibility with KV cache batching	Excellent; standard LLM serving stacks optimize for it	Awkward; K-fold KV expansion complicates continuous batching

The rough rule of thumb is that beam search remains the default for short, high-fidelity outputs where there is essentially one right answer (machine translation, automatic speech recognition, structured code generation with grammar constraints). Temperature sampling, usually combined with top-p or min-p, is the default for long, open-ended outputs where one right answer does not exist (chat, creative writing, brainstorming). The Holtzman paper laid out the case that beam search is inappropriate for open-ended generation because the model's most-likely sequence tends to be repetitive and bland [1], and the LLM ecosystem has moved decisively in this direction since around 2020.

how different models handle temperature

Different LLM providers implement temperature with varying ranges and defaults, and a few constrain the parameter for specific model families.

Model / provider	Temperature range	Default	Notes
OpenAI GPT-4, GPT-4o	0.0 to 2.0	1.0	ChatGPT interface may use ~0.7 internally
OpenAI o1, o3 reasoning models	Fixed at 1.0	1.0	Temperature, top_p, n cannot be changed; reasoning_effort controls thinking depth instead [15]
Anthropic Claude	0.0 to 1.0	1.0	Range capped at 1.0; extended thinking modes typically require T = 1 [12]
Google Gemini	0.0 to 2.0	1.0	Google explicitly recommends keeping T = 1.0 on Gemini 3 reasoning models [14]
Meta LLaMA	0.0 to 2.0+	Typically 0.6 to 0.8	Open-weight; users can set any value
Mistral	0.0 to 1.5	0.7	Recommended values vary by task
Cohere Command R	0.0 to 1.0	0.3	Lower default reflects preference for precision
DeepSeek-V3, DeepSeek-R1	0.0 to 2.0	Varies	Documentation suggests T = 0.6 for general chat, T = 0 for math/code

Some providers also offer a "deterministic" or "greedy" mode that is functionally equivalent to T = 0 but may be implemented differently at the infrastructure level. Reasoning models like OpenAI's o1 and o3, Google's Gemini Thinking variants, and Anthropic's Claude with extended thinking generally restrict temperature changes because their internal chain-of-thought sampling is tuned at training time and altering it can damage benchmark performance [15]. Instead of a temperature dial these models expose a reasoning depth or budget control.

temperature in fine-tuned and RLHF models

Fine-tuning can change how a model responds to temperature. A model fine-tuned on a narrow, specific task may produce high-quality outputs at T = 0 because its learned distribution is already sharply focused on the correct patterns. A general-purpose model may benefit from moderate temperature to explore its broader distribution.

Models that have undergone reinforcement learning from human feedback (RLHF) or related preference-tuning methods like DPO behave differently with temperature than their base versions. RLHF tends to sharpen the model's distribution toward preferred outputs, which means the effective behavior at a given temperature is less random than for the base model at the same temperature. Practitioners often find that fine-tuned chat models give serviceable output across a wider T range than the corresponding base model.

practical guidance

Choosing the right temperature depends on the task, the desired output characteristics, and the specific model being used. The values below are starting points, not commitments; empirical sweeps over a small validation set are the most reliable way to settle on a final number.

temperature recommendations by task

Task	Recommended temperature	Reasoning
Factual question answering	0.0 to 0.3	Accuracy is paramount; minimize randomness
Code generation	0.0 to 0.2	Code must be syntactically and semantically correct
SQL generation, data extraction	0.0	Deterministic output needed to match expected formats
Summarization	0.3 to 0.5	Some variety in phrasing is acceptable, but fidelity matters
General conversation	0.5 to 0.8	Balance between coherence and natural-sounding variety
Translation	0.0 to 0.3 (or use beam)	Accuracy matters; beam is often the better choice
Creative writing	0.7 to 1.2	Encourage diverse and surprising word choices
Brainstorming	0.8 to 1.5	Maximize diversity of ideas
Poetry and fiction	1.0 to 1.5	High creativity; unusual word combinations are desirable
Roleplay / character dialogue	0.8 to 1.2 with min-p	Need diversity but coherent persona
Math / reasoning with self-consistency	0.5 to 0.7, sample N times	Some randomness so samples disagree, then pick majority [7]
Math / reasoning, single sample	0.0	Treat as a deterministic search
Tool use / function calling	0.0 to 0.2	Argument JSON must be exact

general guidelines

Start with the default. Unless you have a specific reason to change it, T = 1.0 is a reasonable starting point for most chat tasks; T = 0 is the right default for code and structured output.
Lower temperature for accuracy. When correctness matters more than creativity, reduce temperature toward 0.
Raise temperature for diversity. When you want varied outputs, for example when generating multiple candidate options to rerank, increase temperature.
Avoid extreme values. Temperatures above 1.5 often produce incoherent text without a truncation sampler. Temperatures very close to 0 can produce repetitive, boring outputs.
Combine with top-p or min-p for better control. Using moderate temperature with top-p between 0.9 and 0.95, or min-p around 0.05, often produces better results than using temperature alone.
Test empirically. The optimal temperature varies by model, task, and even prompt. Experiment with different values and evaluate outputs qualitatively or with a task-specific metric.
Consider the generation length. For long generations, even small probabilities of choosing a bad token compound over many steps. Lower temperatures may be needed for long outputs compared to short ones.
Use one knob at a time. When debugging quality issues, change temperature in isolation before touching top-p, repetition penalty, or anything else.

common pitfalls

Pitfall	Why it happens	How to fix
Repetitive outputs	Temperature too low; greedy decoding gets stuck	Increase temperature slightly (0.1 to 0.3) or add top-p
Incoherent or nonsensical text	Temperature too high without truncation	Lower temperature; add top-p or min-p filtering
Inconsistent behavior across runs	High temperature causes variance	Lower temperature for more consistent outputs; or fix a seed
Good first sentence, bad rest	Temperature effect compounds over long sequences	Use lower temperature for longer outputs
Model ignores instructions	Very high temperature causes random token selection	Reduce temperature; critical instructions should not rely on high-temperature generation
T = 0 still varies	Floating-point noise, parallel reductions, batching effects	Pin a seed; use deterministic kernels; or accept small variance
top_k = 1 with high T does nothing	top-k = 1 always picks the single top token regardless of T	Use top-k > 1 if you want any temperature effect
Slow code with beam plus high T	Combining beam search with sampling is rarely useful	Use one strategy; sampling for chat, beam for translation

self-consistency: temperature plus voting

One of the most influential modern uses of temperature sampling is the self-consistency technique introduced by Wang et al. (2022) in "Self-Consistency Improves Chain of Thought Reasoning in Language Models" [7]. The idea is simple: instead of sampling a single chain-of-thought reasoning path with greedy decoding, sample multiple chains at temperature greater than zero, then pick the final answer that appears most often across the samples.

Formally, given a prompt that elicits step-by-step reasoning, self-consistency samples N completions (typically N = 10 to 40) at T around 0.5 to 0.7 and aggregates their final answers by majority vote. The intuition is that a complex reasoning problem usually admits multiple valid reasoning paths that all converge to the same correct answer, while incorrect reasoning paths tend to produce inconsistent answers. Wang et al. report large gains on arithmetic and commonsense reasoning benchmarks, including +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, +6.4% on StrategyQA, and +3.9% on ARC-Challenge over greedy chain-of-thought prompting.

Self-consistency was a pivotal change in how decoding was thought about for reasoning. Before this work, greedy decoding was the default for math and code because randomness was assumed to hurt accuracy. After, sampling at moderate T became standard for reasoning evaluations, and the same ideas underpin modern reasoning models that internally sample many parallel rollouts and pick the best one. The reasoning model trend culminating in OpenAI's o1 and o3, Google's Gemini Thinking, DeepSeek-R1, and Anthropic's Claude with extended thinking is in large part a story about exploiting sample-and-verify pipelines that depend on temperature sampling rather than greedy or beam decoding.

Variants of self-consistency include weighted voting (where the answer probabilities of each sample contribute weighted by their likelihood), universal self-consistency (which uses the LLM itself to judge consistency for free-form answers), and verifier-based reranking (where a separate model scores each candidate and the highest-scoring one is returned).

diversity versus quality tradeoff

The practical tension at the heart of all sampling work is the tradeoff between quality and diversity.

Low temperature produces high-quality individual outputs but a narrow set of possible outputs. Repeated sampling produces near-duplicates.
High temperature produces a wide variety of outputs but each individual output is more likely to contain errors, contradictions, or off-topic content.

This tradeoff is sometimes formalized as a quality-diversity Pareto frontier. Different decoding strategies sit at different points on the frontier. Greedy is at the high-quality, low-diversity extreme; pure sampling at T = infinity is at the high-diversity, low-quality extreme. Top-p, min-p, and locally typical sampling all attempt to push the frontier outward by removing the worst-quality tail tokens while preserving diversity in the middle of the distribution.

The right operating point depends on whether you are evaluating a single output or a population of outputs. For a chatbot returning one reply, you usually want high quality. For a creative tool generating ten candidate captions, you want diversity even at the cost of some bad candidates because the user (or a downstream reranker) will throw away the failures.

temperature in other generative models

Although most commonly associated with LLMs, the concept of temperature scaling appears in several other areas of machine learning.

knowledge distillation

In knowledge distillation, temperature is used to "soften" the probability distribution of a teacher model's outputs. Hinton, Vinyals, and Dean (2015) introduced this technique, in which a large teacher model's soft predictions (generated with high temperature) are used to train a smaller student model [8]. The high temperature reveals more information about the teacher's learned relationships between classes than hard (one-hot) labels would. For example, a teacher model's output at T = 1 might assign 90% probability to "cat" and 5% each to "dog" and "tiger." At T = 5, the distribution softens to something like 50% "cat," 25% "dog," 25% "tiger," revealing that the model considers "dog" and "tiger" more similar to "cat" than to unrelated classes.

A technical detail from the Hinton paper is that, when training the student, gradients must be scaled by T squared to maintain proper gradient magnitudes when temperature is greater than 1. Without this scaling, the soft-target loss would dominate or be dominated by the hard-target loss in unintended ways.

diffusion models

In diffusion models used for image generation (such as Stable Diffusion and DALL-E), classifier-free guidance scale plays a similar role to temperature. Higher guidance scale values produce outputs that more closely match the prompt (analogous to lower temperature), while lower values allow more variation. Some text-to-image stacks also expose a separate sampling temperature that perturbs the noise schedule itself, but the guidance scale is the more commonly tuned dial.

reinforcement learning

In reinforcement learning, particularly in the Boltzmann exploration strategy, temperature controls the tradeoff between exploitation (choosing the best-known action) and exploration (trying other actions). Lower temperature favors exploitation; higher temperature favors exploration. This is mathematically identical to temperature-scaled softmax applied to action values (Q-values). Algorithms such as soft actor-critic (SAC) make this explicit, learning a temperature parameter alongside the policy to balance reward maximization against entropy.

contrastive learning and embeddings

In contrastive representation learning (SimCLR, CLIP, and the InfoNCE family of losses), a temperature parameter scales the similarity scores between embeddings before softmax. Lower temperature emphasizes hard negatives more strongly; higher temperature treats positives and negatives more uniformly. Tuning this temperature is a key hyperparameter in contrastive pretraining recipes.

temperature scaling for calibration

A related but distinct use of "temperature" in machine learning is temperature scaling for model calibration, introduced by Guo, Pleiss, Sun, and Weinberger (2017) in "On Calibration of Modern Neural Networks" [9]. In this context, a single scalar temperature parameter T is learned on a held-out validation set and applied to a trained model's logits to improve the calibration of its probability estimates. This has nothing to do with sampling randomness; instead, it adjusts the model's confidence levels so that a predicted probability of 0.9 actually corresponds to 90% accuracy.

Guo et al. found that modern deep classifiers are systematically overconfident: their predicted probabilities are too high relative to their empirical accuracy. Dividing logits by a learned T > 1 (typically between 1.5 and 3) softens the distribution and brings predicted probabilities back into agreement with observed correctness rates. The technique is a single-parameter variant of Platt scaling, costs essentially nothing to apply at inference, and remains a strong baseline for post-hoc calibration in classification settings.

The mathematical form is identical to temperature sampling, but the goal is the opposite: sampling temperature changes the operational distribution to control sampling randomness, while calibration temperature changes the reported probabilities to make them honest. Both reuse the same softmax-with-temperature equation.

mathematical properties

entropy and temperature

The entropy of the probability distribution increases monotonically with temperature. At T = 0, entropy is 0 (all probability on one token, which is a degenerate distribution). At T = 1, entropy matches the model's learned distribution. As T increases toward infinity, entropy approaches log(V), the maximum possible entropy for a vocabulary of size V. This monotonic relationship means temperature provides a smooth control knob for the amount of randomness in the output, and it is the basis for entropy-targeting decoders like Mirostat.

A useful identity is that the entropy of the temperature-scaled distribution equals the original entropy minus 1/T times the expected logit, plus a normalization term. In practice this means that at T = 2 the entropy is roughly halfway between the original entropy and uniform, and at T = 0.5 the entropy is roughly halfway between the original entropy and zero.

KL divergence between temperature-scaled distributions

The Kullback-Leibler (KL) divergence between the T = 1 distribution and a temperature-scaled distribution increases as T moves further from 1 in either direction. Both very low and very high temperatures produce distributions that are significantly different from the model's learned distribution. This is useful to remember when reasoning about why high T can produce gibberish: the actually-sampled distribution at T = 5 is far from the model's training-time output distribution, and the model has no signal that its predictions remain valid in this regime.

relationship to energy-based models

The temperature-scaled softmax can be viewed through the lens of energy-based models. Each logit z_i represents the negative energy of token i, and the softmax converts energies into probabilities via the Boltzmann distribution. Temperature controls how sharply the model distinguishes between low-energy (preferred) and high-energy (dispreferred) tokens. This perspective is useful for connecting LLM decoding to the broader literature on Markov chain Monte Carlo, where temperature schedules play a central role in algorithms like simulated annealing and parallel tempering.

gradient effects in distillation

During training of a standard classifier, temperature does not affect gradient computation, since training uses the standard softmax or cross-entropy loss without temperature. During inference with temperature-scaled sampling, the choice of temperature can be seen as defining a different inference-time distribution over the model's output space. In knowledge distillation, where temperature is used during training, gradients must be scaled by T squared to maintain proper gradient magnitudes, as Hinton et al. (2015) noted [8].

implementation

Temperature is implemented in essentially every modern LLM serving stack and training framework, with minor variations in parameter names and combinations.

Hugging Face Transformers exposes temperature through the generate() method:

outputs = model.generate(
    input_ids,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=50,
    max_new_tokens=512,
)

Setting do_sample=False disables sampling and uses greedy decoding regardless of the temperature value. Setting temperature=0 is generally also handled by switching to greedy. The typical_p parameter activates locally typical sampling, and repetition_penalty, no_repeat_ngram_size, and frequency_penalty modify logits before sampling [16].

vLLM uses a SamplingParams object with the same conceptual parameters:

from vllm import LLM, SamplingParams

params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    top_k=50,
    max_tokens=512,
)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, params)

Text Generation Inference (TGI), TensorRT-LLM, and SGLang follow similar conventions. llama.cpp exposes a richer set of samplers including Mirostat, dynamic temperature, and the open-source XTC and DRY samplers, configurable on the command line or in its server API.

OpenAI, Anthropic, and Google's APIs accept temperature and top_p as JSON parameters to their chat-completion endpoints. The provider applies the parameters internally; the client only sees the final sampled tokens.

prompt-time versus inference-time

Temperature is set per request at inference time, not baked into the model. This makes it straightforward to use different temperatures for different calls in an agentic system. A common pattern in modern agent stacks is:

Use T = 0 for tool-call argument formatting, where the JSON must parse.
Use T = 0.2 to 0.4 for code synthesis steps.
Use T = 0.7 to 1.0 for natural-language replies to the user.
Use higher T with self-consistency for math or planning steps that benefit from sampling diverse candidates.

Frameworks like LangChain, LlamaIndex, and DSPy let practitioners attach different sampling configurations to different chain steps. Some agent frameworks also implement "temperature schedules" that lower T over the course of a long agent run to converge on a final answer, mirroring the simulated annealing tradition.

speculative decoding and temperature

Speculative decoding accelerates inference by drafting tokens with a small model and verifying them with the large model. The Leviathan et al. (2023) paper proves that, with the correct rejection-sampling rule, speculative decoding produces samples from exactly the same distribution as direct sampling from the large model [17]. Temperature is preserved end-to-end: the draft and target models both apply the requested temperature, and the acceptance probability accounts for any disagreement between their distributions.

A practical observation, documented in temperature-centric studies of speculative decoding, is that higher temperatures tend to increase the acceptance rate of draft tokens because the target distribution is flatter and thus closer to the draft distribution [10]. This means speculative decoding gives the largest speedups for sampling workloads that already use moderate-to-high T, and somewhat smaller speedups for greedy or near-greedy decoding.

modern advances and research directions

The basic temperature parameter has been stable for a decade, but the surrounding sampler ecosystem has continued to evolve. Several directions are active areas of research and engineering as of 2026.

Dynamic temperature schedules adjust T per token based on the current entropy or top-token probability. The intuition is that the model should commit hard when it is confident (low T) and explore widely when it is uncertain (high T). Open-source frameworks like KoboldCpp and llama.cpp offer dynamic temperature presets that scale T between configured minimum and maximum values according to a smoothing exponent.

Token-level adaptive sampling extends this idea to richer policies, sometimes learned. Recent papers have proposed using a small auxiliary network to predict the right sampler settings for each step, or training the model itself to output distributions that are well-calibrated for downstream sampling.

New samplers like XTC (Exclude Top Choices) and DRY (Don't Repeat Yourself) were proposed by the open-source community in 2024 to address specific failure modes [16]. XTC probabilistically removes the top tokens to force the model into less obvious continuations, breaking writing cliches. DRY penalizes tokens that would extend the input into a sequence that has already appeared in context, dramatically reducing exact-string repetition without the heavy hand of a global repetition penalty.

Temperature-aware speculative decoding tries to design draft models or distillation losses that maximize acceptance at the temperature settings used in production [10]. This has practical impact because most chat workloads run at T near 1, where the gap between draft and target is largest.

Uncertainty-driven decoding uses temperature in concert with uncertainty estimates to detect hallucinations. The premise is that a token sampled at low confidence is a likely site of factual error, and several monitoring systems use this signal to flag risky outputs.

Finally, the rise of reasoning models has shifted attention from per-token sampling to sequence-level sampling. When a model emits a long internal chain of thought before answering, the relevant decoding decisions are about how many parallel chains to sample, what temperature to sample them at, and how to aggregate their answers. Self-consistency, best-of-N reranking, and tree-of-thought search all extend the basic temperature-sampling idea to operate at the level of full reasoning trajectories rather than individual tokens.

current state

As of 2026, temperature sampling combined with top-p (or increasingly min-p) is the de facto default decoding configuration for chat, creative writing, and most open-ended language tasks across the major commercial LLM APIs and open-source serving stacks. Greedy decoding remains the default for code generation, structured-output extraction, and tool-call argument formatting, and beam search retains a strong foothold in machine translation, automatic speech recognition, and grammar-constrained generation.

The reasoning-model era has changed the role of temperature in important ways. Models like OpenAI's o1 and o3 fix temperature internally and expose a reasoning-effort dial instead of a per-call temperature [15]. Anthropic's Claude with extended thinking similarly constrains temperature when the thinking mode is enabled [12]. Google's guidance for Gemini 3 strongly recommends T = 1 for reasoning workloads [14]. These choices reflect the fact that the model providers have already tuned the right temperature into the model's training and post-training, and exposing the dial to end users mostly creates ways to break the model rather than improve it.

For non-reasoning chat workloads, the field has converged on a small set of practical recipes: T around 0.7 with top-p around 0.95 for general chat; T = 0 for code and structured output; T around 1.0 with min-p around 0.05 for creative writing on local models; sample-and-vote with T around 0.5 to 0.7 for math and code reasoning. These recipes are not universal, but they represent the current consensus across the open-source and commercial communities. Temperature sampling itself is unlikely to disappear; the abstractions built on top of it (samplers, schedules, parallel rollouts, voting) are where most of the new work is happening.

references

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." Proceedings of ICLR 2020. https://arxiv.org/abs/1904.09751
Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). https://aclanthology.org/P18-1082/
Nguyen, M., Baker, A., Neo, C., Roush, A., Kirsch, A., & Shwartz-Ziv, R. (2024). "Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs." arXiv:2407.01082. https://arxiv.org/abs/2407.01082
Basu, S., Ramachandran, G. S., Keskar, N. S., & Varshney, L. R. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity." Proceedings of ICLR 2021. https://arxiv.org/abs/2007.14966
Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). "A Learning Algorithm for Boltzmann Machines." Cognitive Science, 9(1), 147-169.
Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Locally Typical Sampling." Transactions of the Association for Computational Linguistics, Volume 11. https://aclanthology.org/2023.tacl-1.7/
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E. H., Narang, S., Chowdhery, A., & Zhou, D. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." Proceedings of ICLR 2023. https://arxiv.org/abs/2203.11171
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531. https://arxiv.org/abs/1503.02531
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." Proceedings of the 34th International Conference on Machine Learning (ICML). https://arxiv.org/abs/1706.04599
Yan, S., et al. (2024). "A Temperature-Centric Investigation of Speculative Decoding with Knowledge Distillation." Findings of EMNLP 2024. https://aclanthology.org/2024.findings-emnlp.767/
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). "Optimization by Simulated Annealing." Science, 220(4598), 671-680.
Anthropic. "Messages API: temperature parameter." Anthropic API documentation. https://docs.anthropic.com/en/api/messages
OpenAI. "Chat Completions API: temperature parameter." OpenAI API reference. https://platform.openai.com/docs/api-reference/chat/create
Google. "Gemini API: generation parameters." Google AI for Developers documentation. https://ai.google.dev/gemini-api/docs/models/generative-models
Microsoft. "Azure OpenAI reasoning models: o1, o3, GPT-5 series." Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/reasoning
Hugging Face. "Generation strategies and Generation API." Transformers documentation. https://huggingface.co/docs/transformers/generation_strategies
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." Proceedings of ICML 2023. https://arxiv.org/abs/2211.17192
Schaeffer, R., et al. (2025). "Min-p, Max Exaggeration: A Critical Analysis of Min-p Sampling in Language Models." arXiv:2506.13681. https://arxiv.org/abs/2506.13681
Vaswani, A., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS). https://arxiv.org/abs/1706.03762
IBM. "What is LLM Temperature?" IBM Think. https://www.ibm.com/think/topics/llm-temperature

origins in statistical mechanics

how temperature works

the standard softmax function

softmax with temperature

intuitive explanation

worked example

temperature = 0: greedy decoding

temperature = 1: default behavior

temperature greater than 1: increased creativity

temperature between 0 and 1: the practical sweet spot

the sampling methods family

relationship to top-p and top-k

top-k sampling

top-p (nucleus) sampling

how they interact

min-p sampling

mirostat

locally typical sampling

comparison with beam search

how different models handle temperature

temperature in fine-tuned and RLHF models

practical guidance

temperature recommendations by task

general guidelines

common pitfalls

self-consistency: temperature plus voting

diversity versus quality tradeoff

temperature in other generative models

knowledge distillation

diffusion models

reinforcement learning

contrastive learning and embeddings

temperature scaling for calibration

mathematical properties

entropy and temperature

KL divergence between temperature-scaled distributions

relationship to energy-based models

gradient effects in distillation

implementation

prompt-time versus inference-time

speculative decoding and temperature

modern advances and research directions

current state

references

Improve this article

Related Articles

Beam search

Sampling with replacement

Subsampling

Undersampling

origins in statistical mechanics

how temperature works

the standard softmax function

softmax with temperature

intuitive explanation

worked example

temperature = 0: greedy decoding

temperature = 1: default behavior

temperature greater than 1: increased creativity

temperature between 0 and 1: the practical sweet spot

the sampling methods family

relationship to top-p and top-k

top-k sampling

top-p (nucleus) sampling

how they interact

min-p sampling

mirostat

locally typical sampling

comparison with beam search

how different models handle temperature

temperature in fine-tuned and RLHF models

practical guidance

temperature recommendations by task

general guidelines

common pitfalls

self-consistency: temperature plus voting

diversity versus quality tradeoff

temperature in other generative models

knowledge distillation

diffusion models