In the context of large language models (LLMs) and other generative models, temperature is a hyperparameter that controls the randomness of predictions during inference. It scales the model's output logits before the softmax function is applied, directly influencing the probability distribution over the next token. A lower temperature produces more deterministic and focused outputs, while a higher temperature produces more diverse and creative outputs.
Temperature is one of the most commonly adjusted parameters in LLM-based applications and is fundamental to controlling the behavior of models such as GPT-4, Claude, Gemini, LLaMA, and other transformer-based architectures.
The concept of temperature in sampling has its roots in statistical mechanics and thermodynamics. In physics, temperature governs the probability of a system occupying different energy states according to the Boltzmann distribution:
P(state_i) = exp(-E_i / (k_B * T)) / Z
Where E_i is the energy of state i, k_B is the Boltzmann constant, T is the absolute temperature, and Z is the partition function (a normalizing constant). At low temperatures, the system overwhelmingly occupies the lowest-energy state. At high temperatures, the system explores many states more uniformly.
This physical analogy directly carries over to machine learning. In the neural network context, logits play the role of negative energies: tokens with higher logits (lower "energy") are more probable. The temperature parameter T controls how aggressively the model favors high-logit tokens over low-logit ones, just as physical temperature controls how strongly a system prefers low-energy configurations.
The earliest uses of temperature in neural network sampling trace back to Boltzmann machines, introduced by Geoffrey Hinton and Terrence Sejnowski in the 1980s. In these stochastic neural networks, temperature controlled the randomness of neuron activation, and simulated annealing (gradually lowering temperature) was used to find optimal configurations.
During text generation, a language model predicts the next token by computing a score (called a logit) for every token in its vocabulary. These raw logits are then converted into a probability distribution using the softmax function. Temperature modifies this process by dividing each logit by the temperature value T before applying softmax.
Without temperature scaling, the standard softmax function converts logits z = (z_1, z_2, ..., z_V) for a vocabulary of size V into probabilities:
P(token_i) = exp(z_i) / sum from j=1 to V of exp(z_j)
Each logit is exponentiated, and the results are normalized so that all probabilities sum to 1. The exponential function amplifies differences between logits: a logit of 5 produces a value roughly 150 times larger than a logit of 0.
When temperature T is introduced, the logits are divided by T before exponentiation:
P(token_i) = exp(z_i / T) / sum from j=1 to V of exp(z_j / T)
This single modification has a significant effect on the output distribution:
| Temperature Range | Effect on Logits | Effect on Distribution | Result |
|---|---|---|---|
| T approaching 0 | Divides logits by a very small number, making them very large in magnitude | Distribution becomes extremely peaked (one-hot) | The highest-logit token gets probability near 1 |
| T = 1 | Logits remain unchanged | Original distribution as learned during training | Standard behavior |
| T > 1 | Divides logits by a number greater than 1, compressing them toward zero | Distribution becomes flatter (more uniform) | Lower-probability tokens gain probability mass |
| T approaching infinity | All logits approach zero | Distribution approaches uniform (1/V for each token) | All tokens become equally likely |
To understand why temperature works the way it does, consider what dividing logits by T does to the differences between them. Suppose two tokens have logits 5 and 3, giving a difference of 2. At T = 0.5, the effective logits become 10 and 6 (difference of 4, amplified). At T = 2, the effective logits become 2.5 and 1.5 (difference of 1, compressed). Because the softmax function converts differences in logits into ratios of probabilities, amplifying differences makes the distribution more peaked, while compressing differences makes it flatter.
Consider a simple vocabulary of four tokens with the following logits:
| Token | Logit (z) |
|---|---|
| "the" | 5.0 |
| "a" | 3.0 |
| "one" | 1.0 |
| "some" | 0.5 |
The probabilities at different temperatures:
| Token | T = 0.25 | T = 0.5 | T = 1.0 | T = 2.0 | T = 5.0 |
|---|---|---|---|---|---|
| "the" | 0.9997 | 0.9820 | 0.8360 | 0.5220 | 0.3150 |
| "a" | 0.0003 | 0.0177 | 0.1131 | 0.2363 | 0.2548 |
| "one" | ~0 | 0.0002 | 0.0153 | 0.1056 | 0.2192 |
| "some" | ~0 | 0.0001 | 0.0093 | 0.0796 | 0.2110 |
At T = 0.25, "the" has a near-certain probability of 99.97%. At T = 5.0, the distribution is much more even, and the model might plausibly select any of the four tokens. Note how the probability of "some" (the least likely token at T=1) increases from under 1% to over 21% as temperature rises from 1.0 to 5.0.
Setting temperature to 0 is a special case. Mathematically, dividing by zero is undefined, but the limit as T approaches 0 causes the softmax to assign all probability mass to the token with the highest logit. In practice, most LLM implementations handle T = 0 by switching to greedy decoding (also called argmax decoding), which always selects the most probable token at each step.
Characteristics of T = 0 / greedy decoding:
| Property | Description |
|---|---|
| Output determinism | Nearly deterministic (same input produces same output) |
| Diversity | Minimal; always picks the single most likely token |
| Creativity | Very low |
| Risk of repetition | High; can get stuck in repetitive loops |
| Use cases | Factual questions, code generation, math problems, structured outputs |
A well-known problem with greedy decoding is repetition degeneracy: the model can enter loops where the same phrase is generated repeatedly. This happens because once a token is generated, it influences the context, making the same token likely again. This is one reason why some amount of sampling randomness (T > 0) is often preferred even for tasks where accuracy is prioritized.
Note that even with T = 0, outputs may not be perfectly deterministic due to floating-point arithmetic differences across hardware, batching effects, and non-deterministic GPU operations.
At T = 1, the logits pass through softmax without modification. The resulting probability distribution reflects the model's learned distribution from training. This is the default setting for many LLM APIs.
Characteristics of T = 1:
| Property | Description |
|---|---|
| Output determinism | Moderate randomness |
| Diversity | Moderate |
| Creativity | Balanced between coherence and variety |
| Use cases | General-purpose conversation, summarization, most standard tasks |
Many commercial LLM providers use T = 1 as the API default. OpenAI's GPT models default to T = 1 in API calls, though the ChatGPT interface reportedly uses a value around 0.7 internally. Anthropic's Claude API defaults to T = 1 with a range of 0.0 to 1.0.
It is worth noting that T = 1 does not mean the model generates "randomly." Because the model's learned distribution strongly favors coherent, grammatical text, most tokens will still receive very low probabilities even at T = 1. The model will still mostly select high-probability tokens, but with enough variation to avoid repetitive patterns.
Setting temperature above 1 flattens the probability distribution, giving more weight to less likely tokens. This increases the diversity and unpredictability of generated text.
Characteristics of high temperature:
| Property | Description |
|---|---|
| Output determinism | Low; outputs vary significantly between runs |
| Diversity | High; the model explores a wider range of vocabulary |
| Creativity | High; more surprising and novel word choices |
| Coherence risk | Can produce incoherent, grammatically incorrect, or nonsensical text |
| Use cases | Creative writing, brainstorming, poetry, generating diverse options |
Most LLM APIs cap the temperature at 2.0 (OpenAI) or 1.0 (Anthropic). Setting the temperature too high can make outputs effectively random and unusable. In practice, values above 1.5 frequently produce text with grammatical errors, nonsensical phrases, or abrupt topic changes.
For most production applications, temperatures between 0.0 and 1.0 are the most commonly used range. Values in this range keep the model's outputs coherent while allowing varying degrees of variation:
| Temperature | Behavior | Example Application |
|---|---|---|
| 0.0 | Deterministic; always picks the top token | JSON extraction, classification |
| 0.1 to 0.3 | Nearly deterministic with slight variation | Code generation, factual Q&A |
| 0.3 to 0.5 | Minor variation; mostly follows the most likely path | Summarization, translation |
| 0.5 to 0.7 | Moderate variation; natural-sounding diversity | Chatbots, email drafting |
| 0.7 to 0.9 | Noticeable variation; occasionally surprising word choices | Creative writing, story generation |
| 0.9 to 1.0 | Full model distribution; maximum variety without over-randomness | Brainstorming, poetry, exploratory prompts |
The sub-1.0 range can be thought of as "sharpening" the model's distribution: the model still follows its learned patterns but with less willingness to deviate from the most probable path.
Temperature is often used in combination with other sampling strategies that filter which tokens are eligible for selection. The three main sampling parameters are temperature, top-k sampling, and top-p sampling (nucleus sampling).
Top-k sampling restricts the set of candidate tokens to the k tokens with the highest probabilities. After filtering, the probabilities are renormalized and a token is sampled from this reduced set.
Top-k was popularized by Fan et al. (2018) in their paper on hierarchical story generation. A key limitation of top-k is that it uses a fixed number of candidates regardless of how the probability is distributed. When the model is confident (probability concentrated on a few tokens), k = 50 may include many irrelevant low-probability tokens. When the model is uncertain (probability spread widely), k = 50 may exclude viable candidates.
Top-p sampling, introduced by Holtzman et al. (2019) in "The Curious Case of Neural Text Degeneration," dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p. Unlike top-k, which always considers a fixed number of tokens, top-p adapts the number of candidates based on the shape of the distribution.
Top-p is generally preferred over top-k because of its adaptive nature. When the model is confident, only a few tokens will be needed to reach the cumulative threshold. When the model is uncertain, more tokens are automatically included.
Temperature and top-p/top-k operate at different stages of the sampling pipeline:
| Parameter | What It Controls | When It Acts | Effect |
|---|---|---|---|
| Temperature | Shape of the probability distribution | Before filtering | Changes the relative probabilities of all tokens |
| Top-k | Maximum number of candidate tokens | After temperature | Hard cutoff on number of tokens |
| Top-p | Cumulative probability threshold | After temperature (and optionally after top-k) | Adaptive cutoff based on probability mass |
OpenAI's API documentation recommends altering either temperature or top-p, but not both simultaneously, as their combined effect can be unpredictable. However, many practitioners find that using moderate temperature (0.5 to 0.8) with top-p around 0.9 to 0.95 produces good results.
Min-p is a newer sampling strategy that sets a minimum probability threshold relative to the top token's probability. A token is only considered if its probability is at least (min_p * probability of the top token). Unlike top-k, which uses a fixed count, and top-p, which uses cumulative probability, min-p uses a relative threshold. This approach handles varying distribution shapes well because the threshold automatically adjusts based on how confident the model is about its top choice.
| Strategy | Threshold Type | Adaptive? | Key Advantage | Key Disadvantage |
|---|---|---|---|---|
| Temperature only | Distribution shaping | N/A | Simple; affects entire distribution | No hard cutoff; very low-probability tokens can still be sampled |
| Top-k | Fixed count | No | Simple to understand and implement | Ignores distribution shape; may include irrelevant or exclude viable tokens |
| Top-p | Cumulative probability | Yes | Adapts to distribution shape | Can include long tail of low-probability tokens when distribution is flat |
| Min-p | Relative probability | Yes | Natural cutoff relative to best token | Newer; less widely supported |
| Temperature + Top-p | Combined | Partially | Balances distribution shaping with adaptive filtering | Two parameters to tune; interaction can be complex |
Different LLM providers implement temperature with varying ranges and defaults:
| Model / Provider | Temperature Range | Default | Notes |
|---|---|---|---|
| OpenAI GPT-4, GPT-4o | 0.0 to 2.0 | 1.0 | ChatGPT interface may use ~0.7 internally |
| Anthropic Claude | 0.0 to 1.0 | 1.0 | Range capped at 1.0 |
| Google Gemini | 0.0 to 2.0 | Varies by model | Different defaults for different Gemini variants |
| Meta LLaMA | 0.0 to 2.0+ | Typically 0.6 to 0.8 | Open-weight; users can set any value |
| Mistral | 0.0 to 1.0+ | 0.7 | Recommended values vary by task |
| Cohere Command R | 0.0 to 1.0 | 0.3 | Lower default reflects preference for precision |
Some providers also offer a "deterministic" or "greedy" mode that is functionally equivalent to T = 0 but may be implemented differently at the infrastructure level.
Fine-tuning can change how a model responds to temperature. A model fine-tuned on a narrow, specific task may produce high-quality outputs at T = 0 because its learned distribution is already sharply focused on the correct patterns. A general-purpose model may benefit from moderate temperature to explore its broader distribution.
Models that have undergone reinforcement learning from human feedback (RLHF) may also behave differently with temperature than their base versions. RLHF tends to sharpen the model's distribution toward preferred outputs, which means the effective behavior at a given temperature may be less random than for the base model at the same temperature.
Choosing the right temperature depends on the task, the desired output characteristics, and the specific model being used.
| Task | Recommended Temperature | Reasoning |
|---|---|---|
| Factual question answering | 0.0 to 0.3 | Accuracy is paramount; minimize randomness |
| Code generation | 0.0 to 0.2 | Code must be syntactically and semantically correct |
| Summarization | 0.3 to 0.5 | Some variety in phrasing is acceptable, but fidelity matters |
| General conversation | 0.5 to 0.8 | Balance between coherence and natural-sounding variety |
| Translation | 0.2 to 0.4 | Accuracy matters, but some flexibility is needed for fluency |
| Creative writing | 0.7 to 1.2 | Encourage diverse and surprising word choices |
| Brainstorming | 0.8 to 1.5 | Maximize diversity of ideas |
| Poetry and fiction | 1.0 to 1.5 | High creativity; unusual word combinations are desirable |
| Data extraction / structured output | 0.0 | Deterministic output needed to match expected formats |
| Pitfall | Why It Happens | How to Fix |
|---|---|---|
| Repetitive outputs | Temperature too low; greedy decoding gets stuck | Increase temperature slightly (0.1 to 0.3) or add top-p |
| Incoherent or nonsensical text | Temperature too high | Lower temperature; add top-p or top-k filtering |
| Inconsistent behavior across runs | High temperature causes variance | Lower temperature for more consistent outputs |
| Good first sentence, bad rest | Temperature effect compounds over long sequences | Use lower temperature for longer outputs |
| Model ignores instructions | Very high temperature causes random token selection | Reduce temperature; critical instructions should not rely on high-temperature generation |
Although most commonly associated with LLMs, the concept of temperature scaling appears in several other areas of machine learning.
In knowledge distillation, temperature is used to "soften" the probability distribution of a teacher model's outputs. Hinton, Vinyals, and Dean (2015) introduced this technique, where a large teacher model's soft predictions (generated with high temperature) are used to train a smaller student model. The high temperature reveals more information about the teacher's learned relationships between classes than hard (one-hot) labels would. For example, a teacher model's output at T = 1 might assign 90% probability to "cat" and 5% each to "dog" and "tiger." At T = 5, the distribution softens to something like 50% "cat," 25% "dog," 25% "tiger," revealing that the model considers "dog" and "tiger" more similar to "cat" than to unrelated classes.
In diffusion models used for image generation (such as Stable Diffusion and DALL-E), a guidance scale parameter serves a similar purpose to temperature. Higher guidance scale values produce outputs that more closely match the prompt (analogous to lower temperature), while lower values allow more variation.
In reinforcement learning, particularly in the Boltzmann exploration strategy, temperature controls the tradeoff between exploitation (choosing the best-known action) and exploration (trying other actions). Lower temperature favors exploitation; higher temperature favors exploration. This is mathematically identical to temperature-scaled softmax applied to action values (Q-values).
A related but distinct use of "temperature" in machine learning is temperature scaling for model calibration, introduced by Guo et al. (2017). In this context, a single scalar temperature parameter T is learned on a validation set and applied to a trained model's logits to improve the calibration of its probability estimates. This has nothing to do with sampling randomness; instead, it adjusts the model's confidence levels so that a predicted probability of 0.9 actually corresponds to 90% accuracy. Modern neural networks tend to be overconfident (their predicted probabilities are systematically too high), and temperature scaling is a simple, effective post-hoc correction.
The entropy of the probability distribution increases monotonically with temperature. At T = 0, entropy is 0 (all probability on one token, which is a degenerate distribution). At T = 1, entropy matches the model's learned distribution. As T increases toward infinity, entropy approaches log(V), the maximum possible entropy for a vocabulary of size V. This monotonic relationship means temperature provides a smooth control knob for the amount of randomness in the output.
The Kullback-Leibler (KL) divergence between the T = 1 distribution and a temperature-scaled distribution increases as T moves further from 1 in either direction. This means that both very low and very high temperatures produce distributions that are significantly different from the model's learned distribution.
The temperature-scaled softmax can be viewed through the lens of energy-based models. Each logit z_i represents the negative energy of token i, and the softmax converts energies into probabilities via the Boltzmann distribution. Temperature controls how sharply the model distinguishes between low-energy (preferred) and high-energy (dispreferred) tokens.
During training, temperature does not typically affect gradient computation (since training uses the standard softmax or cross-entropy loss). However, during inference with temperature-scaled sampling, the choice of temperature can be seen as defining a different inference-time distribution over the model's output space. In knowledge distillation, where temperature is used during training, gradients must be scaled by T^2 to maintain proper gradient magnitudes, as Hinton et al. (2015) noted.