Temperature is a hyperparameter used in large language models and other neural networks to control the randomness of predictions during inference. It works by scaling the raw output scores (called logits) of a model before they are converted into probabilities through the softmax function. A low temperature produces more deterministic, focused outputs, while a high temperature yields more diverse and creative responses. The concept originates from the Boltzmann distribution in statistical mechanics, where temperature governs the probability of a system occupying different energy states [1].
Temperature has become one of the most commonly adjusted parameters in modern generative AI applications. Nearly every major API provider, including OpenAI, Anthropic, and Google, exposes temperature as a configurable setting. Understanding how temperature works, and when to adjust it, is essential for anyone building applications on top of language models.
To understand temperature, it helps to start with the softmax function. Given a vector of logits z = (z_1, z_2, ..., z_n) produced by the final layer of a neural network, the standard softmax function computes the probability of each token i as:
P(i) = exp(z_i) / sum_j(exp(z_j))
Temperature modifies this computation by dividing each logit by a scalar value T before applying softmax:
P(i) = exp(z_i / T) / sum_j(exp(z_j / T))
Here, T is the temperature parameter. This simple division has a profound effect on the resulting probability distribution [2].
When T = 1.0, the formula reduces to standard softmax, and the probability distribution is unchanged. When T is less than 1.0, dividing the logits by a fractional value effectively multiplies them, amplifying the differences between high and low logits. The resulting distribution becomes "sharper," concentrating more probability mass on the top-ranked tokens. In the limit as T approaches 0, the distribution collapses to a point mass on the single highest-scoring token, equivalent to greedy decoding.
Conversely, when T is greater than 1.0, dividing by a large number shrinks the logits toward zero, reducing the differences between them. The distribution becomes "flatter" and more uniform. In the extreme case as T approaches infinity, all tokens receive equal probability regardless of their original logit values.
A concrete example illustrates this. Suppose a model produces logits of [5.0, 3.0, 1.0] for three candidate tokens. At different temperatures, the resulting probabilities shift dramatically:
| Temperature | Token A (logit 5.0) | Token B (logit 3.0) | Token C (logit 1.0) | Distribution Character |
|---|---|---|---|---|
| 0.1 | ~1.000 | ~0.000 | ~0.000 | Nearly deterministic |
| 0.5 | 0.977 | 0.022 | 0.001 | Very peaked |
| 1.0 | 0.844 | 0.114 | 0.042 | Standard |
| 1.5 | 0.726 | 0.178 | 0.096 | Flatter |
| 2.0 | 0.650 | 0.212 | 0.138 | Much flatter |
The table shows how lower temperatures concentrate probability on Token A, while higher temperatures spread probability more evenly across all three tokens.
The name "temperature" in machine learning is borrowed directly from statistical mechanics. Ludwig Boltzmann formulated the distribution that bears his name in 1868, describing the probability that a physical system occupies a particular energy state. The Boltzmann distribution has the form:
P(state_i) = exp(-E_i / kT) / Z
where E_i is the energy of state i, k is the Boltzmann constant, T is the physical temperature, and Z is the partition function (a normalization constant). This is mathematically identical to the temperature-scaled softmax, with logits playing the role of negative energies [3].
In a physical system, high temperature means particles have lots of thermal energy and are likely to occupy many different states, including high-energy ones. Low temperature means particles settle into the lowest-energy configurations. The analogy to language model sampling is direct: high temperature lets the model "explore" unlikely tokens, while low temperature forces it to stick with the most probable choices.
J. Willard Gibbs later formalized and popularized this framework in his 1902 textbook Elementary Principles in Statistical Mechanics, which is why the distribution is sometimes called the Gibbs distribution [3].
The temperature concept entered neural network research through the Boltzmann machine, introduced by Geoffrey Hinton and Terry Sejnowski in 1985. In their paper "A Learning Algorithm for Boltzmann Machines," published in Cognitive Science, David Ackley, Hinton, and Sejnowski described a stochastic neural network whose units are activated probabilistically according to a Boltzmann distribution parameterized by temperature [4].
Their training procedure used simulated annealing, a technique borrowed from metallurgy. The network starts at a high temperature, where units flip states freely, and gradually cools to a low temperature, where the network settles into a low-energy (good solution) configuration. This cooling schedule directly parallels how physical materials are annealed: heated to remove defects, then slowly cooled to reach a stable crystalline structure [4].
Hinton and Hopfield received the 2024 Nobel Prize in Physics in part for this foundational work connecting statistical physics to machine learning [5].
The softmax function itself was introduced to the machine learning community by John S. Bridle in two conference papers in 1989, where he proposed using the normalized exponential as an output activation for classification networks [3]. The temperature-parameterized version naturally followed from its roots in statistical mechanics. Over the following decades, temperature scaling was applied in diverse contexts: training neural classifiers, calibrating model confidence, controlling exploration in reinforcement learning, and, eventually, sampling from language models.
The use of temperature during text generation gained broad visibility with the rise of GPT-2 and GPT-3 in 2019 and 2020, as researchers and developers experimented with generating text at various temperature settings. The OpenAI API exposed temperature as a user-configurable parameter from its earliest public release, establishing it as a standard interface convention that other providers followed.
Practitioners typically work with temperature values in the range of 0.0 to 2.0, though the exact range depends on the API. The effects can be grouped into several regimes.
At temperature 0, the model always selects the single token with the highest logit at each step. This is called greedy decoding. There is no sampling involved; the output is deterministic in theory.
In practice, however, temperature 0 does not guarantee perfectly identical outputs across repeated runs. Floating-point arithmetic on modern hardware is non-associative, meaning that the order of operations can produce slightly different numerical results. More significantly, modern inference servers use dynamic batching, where the batch size varies depending on how many requests are being processed simultaneously. Because batch size affects the order and grouping of floating-point operations, the same prompt can produce slightly different logit values under different server loads. These tiny differences occasionally change which token has the highest logit, causing the entire generation to diverge [6].
For applications that truly require reproducibility, some APIs offer a "seed" parameter alongside temperature 0 to make best-effort deterministic outputs, though even this is not an absolute guarantee.
Very low temperatures produce outputs that are almost deterministic but retain a tiny amount of variability. The model strongly favors high-probability tokens while occasionally sampling alternatives. These settings are well suited for tasks where accuracy and consistency matter, such as code generation, factual question answering, and data extraction.
This middle range offers a balance between consistency and variety. The model is still fairly predictable but can produce noticeably different phrasings across runs. Many production applications settle in this range for general-purpose chatbot interactions or business writing, where some variation is acceptable but hallucinations need to be minimized.
Higher temperatures encourage the model to sample from a wider range of tokens, producing more diverse and surprising text. Creative writing, brainstorming, and storytelling tasks often benefit from temperatures in this range. The model is more likely to use unexpected word choices and varied sentence structures.
A temperature of 1.0 applies the softmax function without any scaling, using the model's learned probability distribution as-is. This is the default for most APIs and represents the distribution the model was trained to produce. For many general-purpose tasks, T = 1.0 works well without further tuning.
Temperatures above 1.0 flatten the distribution beyond what the model learned during training, introducing substantial randomness. While this can occasionally be useful for generating highly diverse candidate outputs (for example, when using self-consistency to sample many reasoning paths), it often degrades output quality. At T = 2.0, even low-probability tokens receive meaningful sampling weight, which can produce incoherent or grammatically incorrect text.
Different AI providers implement temperature with varying ranges and defaults. The table below summarizes the settings for major APIs as of early 2026.
| Provider | Model(s) | Default Temperature | Range | Notes |
|---|---|---|---|---|
| OpenAI | GPT-4o, GPT-4.1 | 1.0 | 0.0 to 2.0 | Reasoning models (o1, o3, o4-mini) have temperature locked at 1.0 [7] |
| Anthropic | Claude Sonnet 4, Claude Opus 4 | 1.0 | 0.0 to 1.0 | Maximum of 1.0; extended thinking models manage sampling internally [8] |
| Gemini 2.5, Gemini 3 | 1.0 | 0.0 to 2.0 | Gemini 3 models strongly recommend keeping default of 1.0; changing may cause looping or degraded reasoning [9] | |
| Meta | Llama 3, Llama 4 | 0.6 | 0.0 to 2.0 | Open-weight models; actual range depends on serving framework |
| Mistral | Mistral Large, Codestral | 0.7 | 0.0 to 1.5 | Lower default suited to instruction-following tasks |
One notable trend is that reasoning models (such as OpenAI o1/o3/o4-mini and Gemini 3 with thinking) often lock or strongly recommend a fixed temperature. Because these models perform internal chain-of-thought reasoning before producing output, the sampling dynamics are managed internally, and user-specified temperature adjustments can interfere with the reasoning process [7] [9].
The optimal temperature depends heavily on the task. The table below provides guidance based on common use cases.
| Use Case | Recommended Temperature | Rationale |
|---|---|---|
| Code generation | 0.0 to 0.2 | Correctness is paramount; low randomness reduces syntax errors and logical mistakes |
| Factual question answering | 0.0 to 0.3 | Accuracy matters most; model should stick with its highest-confidence answers |
| Data extraction / parsing | 0.0 | Output must conform to a strict schema; any deviation is an error |
| Summarization | 0.2 to 0.5 | Mostly faithful to source material with slight variation in phrasing |
| Translation | 0.2 to 0.4 | Should be accurate but not overly literal; slight flexibility helps |
| General chatbot | 0.5 to 0.7 | Conversational and natural; avoids sounding robotic while staying on topic |
| Creative writing | 0.7 to 1.0 | Encourages varied word choice, unexpected turns, and stylistic richness |
| Brainstorming / ideation | 0.8 to 1.2 | Maximizes diversity of ideas; some incoherence is acceptable |
| Poetry / experimental text | 0.9 to 1.5 | Unusual language combinations and surprising imagery can enhance the work |
| Diverse candidate sampling | 1.0 to 1.5 | Used with self-consistency or reranking; high diversity is needed for coverage |
These are starting points, not absolute rules. The best temperature for a given application often emerges through systematic evaluation on representative examples.
Temperature is one of several parameters that control how tokens are selected during text generation. It is typically applied first (scaling the logits), after which additional filtering methods may be used.
Top-k sampling restricts the candidate pool to the k tokens with the highest probabilities. After temperature scaling and softmax, only the top k tokens are kept, and their probabilities are renormalized to sum to 1. A token is then sampled from this reduced distribution.
Top-k provides a hard cutoff: exactly k tokens are considered regardless of their probability values. This means the method does not adapt to the model's confidence. When the model is very confident (one token dominates), top-k still retains k candidates, potentially including irrelevant ones. When the model is uncertain (many tokens have similar probabilities), the same k might exclude reasonable options. This inflexibility was the motivation for top-p sampling [10].
Top-p sampling, also called nucleus sampling, was introduced by Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi in their 2020 paper "The Curious Case of Neural Text Degeneration" [10]. Instead of fixing the number of candidate tokens, top-p dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.
For example, with top-p = 0.9, the model sorts tokens by probability and includes tokens from the top until their cumulative probability reaches 0.9. If the model is confident and the top token already has probability 0.85, only a few tokens are included. If the model is uncertain and probabilities are spread out, many tokens are included. This dynamic behavior makes top-p more adaptive than top-k [10].
Min-p is a newer sampling method that filters out tokens whose probability falls below a fraction of the top token's probability. For instance, with min-p = 0.1, any token whose probability is less than 10% of the most probable token is eliminated. Like top-p, min-p adapts to the model's confidence, but it does so in a way that is more intuitive and less sensitive to temperature changes [11].
Introduced by Tang et al. at ACL 2025, top-n-sigma addresses a subtle problem called temperature coupling: because both the maximum logit and the standard deviation of logits scale identically when divided by temperature, probability-based truncation methods (top-p, min-p) select the same set of surviving tokens regardless of temperature. Top-n-sigma avoids this by filtering in logit space rather than probability space, keeping tokens within n standard deviations of the maximum logit. This makes the truncation mathematically independent of temperature, allowing temperature to control randomness while the truncation controls diversity [12].
In a typical generation pipeline, these parameters are applied in sequence:
Because temperature is applied before the filtering steps, it affects which tokens survive the filters. A higher temperature flattens the distribution, causing more tokens to pass top-p or min-p thresholds. A lower temperature sharpens the distribution, causing fewer tokens to survive. This interaction means that temperature and top-p (or top-k) are not fully independent: changing one often requires adjusting the other to maintain the desired output characteristics.
Most practitioners recommend tuning one parameter at a time and keeping the others at defaults. A common combination is temperature 0.7 with top-p 0.9 for general-purpose tasks, or temperature 0.0 with no additional filtering for deterministic outputs [11].
| Parameter | What It Controls | Adaptive? | Applied To | Introduced |
|---|---|---|---|---|
| Temperature | Overall distribution sharpness | No (fixed scalar) | Logits (before softmax) | Boltzmann, 1868 (physics); Hinton, 1985 (neural networks) |
| Top-k | Number of candidate tokens | No (fixed count) | Probabilities (after softmax) | Fan et al., 2018 |
| Top-p (nucleus) | Cumulative probability threshold | Yes (adapts to confidence) | Probabilities (after softmax) | Holtzman et al., 2020 |
| Min-p | Minimum relative probability | Yes (adapts to confidence) | Probabilities (after softmax) | Community-developed, 2023 |
| Top-n-sigma | Standard deviation threshold in logit space | Yes (temperature-invariant) | Logits (before softmax) | Tang et al., 2025 |
Beyond inference, temperature plays important roles during training and evaluation.
In knowledge distillation, a smaller "student" model is trained to mimic the outputs of a larger "teacher" model. Temperature is a key ingredient in this process: both models' logits are scaled by a high temperature (typically T = 2 to 20) before computing the softmax. This softening reveals the teacher's "dark knowledge," the relative probabilities it assigns to incorrect classes, which contains information about the structure of the problem that hard labels alone do not convey. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean formalized this approach in their 2015 paper "Distilling the Knowledge in a Neural Network" [13].
Temperature scaling is also used for model calibration, where the goal is to ensure that a model's predicted probabilities match real-world frequencies. A single temperature parameter is learned on a validation set after training and applied to the logits at test time. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger showed in 2017 that this simple post-processing step significantly improves the calibration of modern neural networks without affecting their accuracy [14].
In reinforcement learning, temperature controls the exploration-exploitation tradeoff when using a softmax policy over action values. A high temperature encourages the agent to explore diverse actions, while a low temperature favors exploiting the currently best-known action. Many RL algorithms use an annealing schedule that starts with high temperature and gradually reduces it as the agent learns, mirroring the simulated annealing approach of Boltzmann machines [4].
Based on current research and industry experience, several best practices have emerged for working with temperature.
Start with the default and adjust based on evaluation. Most APIs default to T = 1.0, which works reasonably well for general tasks. Only adjust temperature after evaluating your application's outputs and identifying specific issues (too repetitive, too random, too many hallucinations).
Use low temperature for tasks with verifiable correctness. Code generation, math problems, data extraction, and factual question answering all benefit from temperatures between 0.0 and 0.3. When there is a single correct answer, you want the model to commit to its best guess.
Use moderate to high temperature for creative and open-ended tasks. Creative writing, brainstorming, and conversational applications work well with temperatures between 0.7 and 1.0. Going above 1.0 is rarely necessary and often counterproductive.
Be cautious with temperature above 1.5. Very high temperatures degrade output quality rapidly. The model starts producing grammatically broken or semantically incoherent text. If you need high diversity, consider using moderate temperature combined with techniques like self-consistency (sampling multiple outputs and selecting the best) rather than pushing temperature to extreme values.
Do not rely on temperature 0 for determinism. If your application requires perfectly reproducible outputs, you need additional measures beyond setting temperature to 0. Use fixed seeds where available, control batch sizes, and implement output caching [6].
Consider temperature interactions with other parameters. If you are also using top-p or top-k, be aware that temperature changes the effective behavior of those filters. A standard combination for balanced outputs is T = 0.7 with top-p = 0.9. For deterministic outputs, use T = 0 and disable other sampling parameters.
Respect locked temperatures on reasoning models. Models like OpenAI o3, o4-mini, and Google Gemini 3 with thinking have fixed or strongly recommended temperature settings. These models perform internal reasoning that is calibrated for specific temperature values; overriding them can cause degraded performance, repetitive loops, or other unexpected behavior [7] [9].
Use systematic evaluation, not intuition. Rather than guessing the best temperature, run your prompts at several temperature settings and evaluate the outputs against your quality criteria. Tools like promptfoo and other LLM evaluation frameworks support temperature sweeps as part of their testing pipelines [15].
The trend toward Temperature + Min-P for open-source models. For open-weight models served through frameworks like vLLM or llama.cpp, the community has increasingly favored the combination of temperature with min-p sampling, as min-p is less sensitive to temperature coupling than top-p [12].
Several misunderstandings about temperature persist in practice.
"Temperature 0 is always deterministic." As discussed above, this is not strictly true. Non-determinism from hardware-level floating-point differences and dynamic batching can cause outputs to vary even at T = 0 [6].
"Higher temperature means better creativity." There is a sweet spot. Beyond approximately T = 1.0, output quality tends to decline. Truly creative output requires the model to make meaningful connections, not just random ones. Extremely high temperatures produce randomness, not creativity.
"Temperature and top-p do the same thing." They are complementary, not redundant. Temperature affects the shape of the entire probability distribution before sampling, while top-p truncates the distribution after it has been shaped. A flat distribution (high T) filtered by top-p = 0.9 behaves very differently from a sharp distribution (low T) filtered by the same top-p value.
"The default temperature is always best." Defaults are a reasonable starting point, but many applications benefit significantly from tuning. A code-generation pipeline running at T = 1.0 is likely producing more errors than it would at T = 0.2.