Temperature (artificial intelligence)

Large Language Models Machine Learning Natural Language Processing

20 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v6 · 4,010 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Temperature is a hyperparameter that controls the randomness of a large language model's output by scaling the model's raw scores, called logits, before they pass through the softmax function: the probability of token i becomes $P(i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$ , where T is the temperature. A temperature of 0 makes generation deterministic (the model always picks its single highest-scoring token, equivalent to greedy decoding); a temperature below 1 sharpens the distribution toward the most likely tokens; and a temperature above 1 flattens it, producing more diverse and creative but less reliable text. Most commercial APIs expose temperature on a 0 to 2 scale (Anthropic uses 0 to 1) and default it to 1.0 ^[1]^[7]^[8].

The concept is borrowed directly from the Boltzmann distribution in statistical mechanics, where temperature governs how likely a physical system is to occupy higher-energy states ^[1]. The same mathematics now underpins one of the most commonly adjusted controls in generative AI: nearly every major provider, including OpenAI, Anthropic, and Google, lets developers set temperature, and understanding how it reshapes the probability distribution is essential for building reliable applications on top of language models.

What does temperature do mathematically?

To understand temperature, it helps to start with the softmax function. Given a vector of logits $\mathbf{z} = (z_1, z_2, \ldots, z_n)$ produced by the final layer of a neural network, the standard softmax function computes the probability of each token i as:

P(i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}

Temperature modifies this computation by dividing each logit by a scalar value T before applying softmax:

P(i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

Here, T is the temperature parameter. This simple division has a profound effect on the resulting probability distribution ^[2].

When T = 1.0, the formula reduces to standard softmax, and the probability distribution is unchanged. When T is less than 1.0, dividing the logits by a fractional value effectively multiplies them, amplifying the differences between high and low logits. The resulting distribution becomes "sharper," concentrating more probability mass on the top-ranked tokens. In the limit as $T \to 0$ , the distribution collapses to a point mass on the single highest-scoring token, equivalent to greedy decoding.

Conversely, when T is greater than 1.0, dividing by a large number shrinks the logits toward zero, reducing the differences between them. The distribution becomes "flatter" and more uniform. In the extreme case as $T \to \infty$ , all tokens receive equal probability regardless of their original logit values.

A useful way to state this precisely is in terms of information entropy: the entropy of the softmax distribution is a non-decreasing function of temperature. Raising T flattens the distribution and increases its entropy (and the model's perplexity), encouraging exploration of lower-probability tokens, while lowering T concentrates probability mass and reduces entropy, yielding more predictable output ^[16].

A concrete example illustrates this. Suppose a model produces logits of [5.0, 3.0, 1.0] for three candidate tokens. At different temperatures, the resulting probabilities shift dramatically:

Temperature	Token A (logit 5.0)	Token B (logit 3.0)	Token C (logit 1.0)	Distribution Character
0.1	~1.000	~0.000	~0.000	Nearly deterministic
0.5	0.977	0.022	0.001	Very peaked
1.0	0.844	0.114	0.042	Standard
1.5	0.726	0.178	0.096	Flatter
2.0	0.650	0.212	0.138	Much flatter

The table shows how lower temperatures concentrate probability on Token A, while higher temperatures spread probability more evenly across all three tokens.

Where does the name temperature come from?

The Boltzmann Distribution

The name "temperature" in machine learning is borrowed directly from statistical mechanics. Ludwig Boltzmann formulated the distribution that bears his name in 1868, describing the probability that a physical system occupies a particular energy state. The Boltzmann distribution has the form:

P(\text{state}_i) = \frac{\exp(-E_i / kT)}{Z}

where E_i is the energy of state i, k is the Boltzmann constant, T is the physical temperature, and Z is the partition function (a normalization constant). This is mathematically identical to the temperature-scaled softmax, with logits playing the role of negative energies ^[3].

In a physical system, high temperature means particles have lots of thermal energy and are likely to occupy many different states, including high-energy ones. Low temperature means particles settle into the lowest-energy configurations. The analogy to language model sampling is direct: high temperature lets the model "explore" unlikely tokens, while low temperature forces it to stick with the most probable choices.

J. Willard Gibbs later formalized and popularized this framework in his 1902 textbook Elementary Principles in Statistical Mechanics, which is why the distribution is sometimes called the Gibbs distribution ^[3].

Boltzmann Machines and Simulated Annealing

The temperature concept entered neural network research through the Boltzmann machine, introduced by Geoffrey Hinton and Terry Sejnowski in 1985. In their paper "A Learning Algorithm for Boltzmann Machines," published in Cognitive Science, David Ackley, Hinton, and Sejnowski described a stochastic neural network whose units are activated probabilistically according to a Boltzmann distribution parameterized by temperature ^[4].

Their training procedure used simulated annealing, a technique borrowed from metallurgy. The network starts at a high temperature, where units flip states freely, and gradually cools to a low temperature, where the network settles into a low-energy (good solution) configuration. This cooling schedule directly parallels how physical materials are annealed: heated to remove defects, then slowly cooled to reach a stable crystalline structure ^[4].

Hinton and John Hopfield received the 2024 Nobel Prize in Physics in part for this foundational work connecting statistical physics to machine learning ^[5].

From Boltzmann Machines to Modern Softmax

The softmax function itself was introduced to the machine learning community by John S. Bridle in two conference papers in 1989, where he proposed using the normalized exponential as an output activation for classification networks ^[3]. The temperature-parameterized version naturally followed from its roots in statistical mechanics. Over the following decades, temperature scaling was applied in diverse contexts: training neural classifiers, calibrating model confidence, controlling exploration in reinforcement learning, and, eventually, sampling from language models.

The use of temperature during text generation gained broad visibility with the rise of GPT-2 and GPT-3 in 2019 and 2020, as researchers and developers experimented with generating text at various temperature settings. The OpenAI API exposed temperature as a user-configurable parameter from its earliest public release, establishing it as a standard interface convention that other providers followed.

What happens at different temperature values?

Practitioners typically work with temperature values in the range of 0.0 to 2.0, though the exact range depends on the API. The effects can be grouped into several regimes.

T = 0 (Greedy Decoding)

At temperature 0, the model always selects the single token with the highest logit at each step. This is called greedy decoding. There is no sampling involved; the output is deterministic in theory.

In practice, however, temperature 0 does not guarantee perfectly identical outputs across repeated runs. Floating-point arithmetic on modern hardware is non-associative, meaning that the order of operations can produce slightly different numerical results. More significantly, modern inference servers use dynamic batching, where the batch size varies depending on how many requests are being processed simultaneously. Because batch size affects the order and grouping of floating-point operations, the same prompt can produce slightly different logit values under different server loads. These tiny differences occasionally change which token has the highest logit, causing the entire generation to diverge ^[6]. Anthropic states this caveat directly in its documentation: "Even with temperature of 0.0, the results will not be fully deterministic" ^[8].

For applications that truly require reproducibility, some APIs offer a "seed" parameter alongside temperature 0 to make best-effort deterministic outputs, though even this is not an absolute guarantee.

T = 0.1 to 0.3 (Highly Focused)

Very low temperatures produce outputs that are almost deterministic but retain a tiny amount of variability. The model strongly favors high-probability tokens while occasionally sampling alternatives. These settings are well suited for tasks where accuracy and consistency matter, such as code generation, factual question answering, and data extraction.

T = 0.4 to 0.6 (Balanced)

This middle range offers a balance between consistency and variety. The model is still fairly predictable but can produce noticeably different phrasings across runs. Many production applications settle in this range for general-purpose chatbot interactions or business writing, where some variation is acceptable but hallucinations need to be minimized.

T = 0.7 to 0.9 (Creative)

Higher temperatures encourage the model to sample from a wider range of tokens, producing more diverse and surprising text. Creative writing, brainstorming, and storytelling tasks often benefit from temperatures in this range. The model is more likely to use unexpected word choices and varied sentence structures.

T = 1.0 (Standard)

A temperature of 1.0 applies the softmax function without any scaling, using the model's learned probability distribution as-is. This is the default for most APIs and represents the distribution the model was trained to produce. For many general-purpose tasks, T = 1.0 works well without further tuning.

T > 1.0 (High Randomness)

Temperatures above 1.0 flatten the distribution beyond what the model learned during training, introducing substantial randomness. While this can occasionally be useful for generating highly diverse candidate outputs (for example, when using self-consistency to sample many reasoning paths), it often degrades output quality. At T = 2.0, even low-probability tokens receive meaningful sampling weight, which can produce incoherent or grammatically incorrect text.

What temperature ranges do major APIs use?

Different AI providers implement temperature with varying ranges and defaults. The table below summarizes the settings for major APIs as of early 2026.

Provider	Model(s)	Default Temperature	Range	Notes
OpenAI	GPT-4o, GPT-4.1	1.0	0.0 to 2.0	Reasoning models (o1, o3, o4-mini) have temperature locked at 1.0 ^[7]
Anthropic	Claude Sonnet 4, Claude Opus 4	1.0	0.0 to 1.0	Maximum of 1.0; extended thinking models manage sampling internally ^[8]
Google	Gemini 2.5, Gemini 3	1.0	0.0 to 2.0	Gemini 3 models strongly recommend keeping default of 1.0; changing may cause looping or degraded reasoning ^[9]
Meta	Llama 3, Llama 4	0.6	0.0 to 2.0	Open-weight models; actual range depends on serving framework
Mistral	Mistral Large, Codestral	0.7	0.0 to 1.5	Lower default suited to instruction-following tasks

One notable trend is that reasoning models (such as OpenAI o1/o3/o4-mini and Gemini 3 with thinking) often lock or strongly recommend a fixed temperature. Because these models perform internal chain-of-thought reasoning before producing output, the sampling dynamics are managed internally, and user-specified temperature adjustments can interfere with the reasoning process ^[7]^[9].

What temperature should I use for each task?

The optimal temperature depends heavily on the task. The table below provides guidance based on common use cases.

Use Case	Recommended Temperature	Rationale
Code generation	0.0 to 0.2	Correctness is paramount; low randomness reduces syntax errors and logical mistakes
Factual question answering	0.0 to 0.3	Accuracy matters most; model should stick with its highest-confidence answers
Data extraction / parsing	0.0	Output must conform to a strict schema; any deviation is an error
Summarization	0.2 to 0.5	Mostly faithful to source material with slight variation in phrasing
Translation	0.2 to 0.4	Should be accurate but not overly literal; slight flexibility helps
General chatbot	0.5 to 0.7	Conversational and natural; avoids sounding robotic while staying on topic
Creative writing	0.7 to 1.0	Encourages varied word choice, unexpected turns, and stylistic richness
Brainstorming / ideation	0.8 to 1.2	Maximizes diversity of ideas; some incoherence is acceptable
Poetry / experimental text	0.9 to 1.5	Unusual language combinations and surprising imagery can enhance the work
Diverse candidate sampling	1.0 to 1.5	Used with self-consistency or reranking; high diversity is needed for coverage

These are starting points, not absolute rules. The best temperature for a given application often emerges through systematic evaluation on representative examples.

How does temperature differ from top-p and top-k?

Temperature is one of several parameters that control how tokens are selected during text generation. It is typically applied first (scaling the logits), after which additional filtering methods may be used.

Top-k Sampling

Top-k sampling restricts the candidate pool to the k tokens with the highest probabilities. After temperature scaling and softmax, only the top k tokens are kept, and their probabilities are renormalized to sum to 1. A token is then sampled from this reduced distribution.

Top-k provides a hard cutoff: exactly k tokens are considered regardless of their probability values. This means the method does not adapt to the model's confidence. When the model is very confident (one token dominates), top-k still retains k candidates, potentially including irrelevant ones. When the model is uncertain (many tokens have similar probabilities), the same k might exclude reasonable options. This inflexibility was the motivation for top-p sampling ^[10].

Top-p (Nucleus) Sampling

Top-p sampling, also called nucleus sampling, was introduced by Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi in their 2020 paper "The Curious Case of Neural Text Degeneration" ^[10]. The authors found that maximization-based decoding such as beam search produces text that is "bland and strangely repetitive," and proposed nucleus sampling to "avoid text degeneration by truncating the unreliable tail of the probability distribution" ^[10]. Instead of fixing the number of candidate tokens, top-p dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.

For example, with top-p = 0.9, the model sorts tokens by probability and includes tokens from the top until their cumulative probability reaches 0.9. If the model is confident and the top token already has probability 0.85, only a few tokens are included. If the model is uncertain and probabilities are spread out, many tokens are included. This dynamic behavior makes top-p more adaptive than top-k ^[10].

Min-p Sampling

Min-p is a newer sampling method that filters out tokens whose probability falls below a fraction of the top token's probability. For instance, with min-p = 0.1, any token whose probability is less than 10% of the most probable token is eliminated. Like top-p, min-p adapts to the model's confidence, but it does so in a way that is more intuitive and less sensitive to temperature changes ^[11].

Top-n-sigma Sampling

Introduced by Chenxia Tang and colleagues in a paper presented at ACL 2025, top-n-sigma addresses a subtle problem called temperature coupling: because both the maximum logit and the standard deviation of logits scale identically when divided by temperature, probability-based truncation methods (top-p, min-p) can select the same set of surviving tokens regardless of temperature. Top-n-sigma avoids this by filtering in logit space rather than probability space, keeping tokens within n standard deviations of the maximum logit. The authors prove that "for any temperature T > 0, the nucleus of top-nσ remains invariant," which lets temperature control randomness while the truncation independently controls diversity ^[12].

How They Interact

In a typical generation pipeline, these parameters are applied in sequence:

The model produces raw logits.
Temperature scaling divides the logits by T.
Softmax converts scaled logits to probabilities.
Top-k, top-p, or min-p filtering removes low-probability tokens.
A token is sampled from the remaining distribution.

Because temperature is applied before the filtering steps, it affects which tokens survive the filters. A higher temperature flattens the distribution, causing more tokens to pass top-p or min-p thresholds. A lower temperature sharpens the distribution, causing fewer tokens to survive. This interaction means that temperature and top-p (or top-k) are not fully independent: changing one often requires adjusting the other to maintain the desired output characteristics.

Most practitioners recommend tuning one parameter at a time and keeping the others at defaults. Anthropic, for instance, advises adjusting either temperature or top_p but not both ^[8]. A common combination is temperature 0.7 with top-p 0.9 for general-purpose tasks, or temperature 0.0 with no additional filtering for deterministic outputs ^[11].

Comparison Summary

Parameter	What It Controls	Adaptive?	Applied To	Introduced
Temperature	Overall distribution sharpness	No (fixed scalar)	Logits (before softmax)	Boltzmann, 1868 (physics); Hinton, 1985 (neural networks)
Top-k	Number of candidate tokens	No (fixed count)	Probabilities (after softmax)	Fan et al., 2018
Top-p (nucleus)	Cumulative probability threshold	Yes (adapts to confidence)	Probabilities (after softmax)	Holtzman et al., 2020
Min-p	Minimum relative probability	Yes (adapts to confidence)	Probabilities (after softmax)	Community-developed, 2023
Top-n-sigma	Standard deviation threshold in logit space	Yes (temperature-invariant)	Logits (before softmax)	Tang et al., 2025

How is temperature used in training, not just inference?

Beyond inference, temperature plays important roles during training and evaluation.

Knowledge Distillation

In knowledge distillation, a smaller "student" model is trained to mimic the outputs of a larger "teacher" model. Temperature is a key ingredient in this process: both models' logits are scaled by a high temperature (typically T = 2 to 20) before computing the softmax. This softening reveals the teacher's "dark knowledge," the relative probabilities it assigns to incorrect classes, which contains information about the structure of the problem that hard labels alone do not convey. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean formalized this approach in their 2015 paper "Distilling the Knowledge in a Neural Network," describing how they "raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets" and then use that same high temperature to train the small model ^[13].

Calibration

Temperature scaling is also used for model calibration, where the goal is to ensure that a model's predicted probabilities match real-world frequencies. A single temperature parameter is learned on a validation set after training and applied to the logits at test time. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger showed in 2017 that modern neural networks are systematically overconfident, and that this simple post-processing step "can almost perfectly restore network calibration" and "can be implemented in 2 lines of code," all without affecting the model's accuracy ^[14].

Reinforcement Learning

In reinforcement learning, temperature controls the exploration-exploitation tradeoff when using a softmax policy over action values. A high temperature encourages the agent to explore diverse actions, while a low temperature favors exploiting the currently best-known action. Many RL algorithms use an annealing schedule that starts with high temperature and gradually reduces it as the agent learns, mirroring the simulated annealing approach of Boltzmann machines ^[4].

Best Practices (2025-2026)

Based on current research and industry experience, several best practices have emerged for working with temperature.

Start with the default and adjust based on evaluation. Most APIs default to T = 1.0, which works reasonably well for general tasks. Only adjust temperature after evaluating your application's outputs and identifying specific issues (too repetitive, too random, too many hallucinations).

Use low temperature for tasks with verifiable correctness. Code generation, math problems, data extraction, and factual question answering all benefit from temperatures between 0.0 and 0.3. When there is a single correct answer, you want the model to commit to its best guess.

Use moderate to high temperature for creative and open-ended tasks. Creative writing, brainstorming, and conversational applications work well with temperatures between 0.7 and 1.0. Going above 1.0 is rarely necessary and often counterproductive.

Be cautious with temperature above 1.5. Very high temperatures degrade output quality rapidly. The model starts producing grammatically broken or semantically incoherent text. If you need high diversity, consider using moderate temperature combined with techniques like self-consistency (sampling multiple outputs and selecting the best) rather than pushing temperature to extreme values.

Do not rely on temperature 0 for determinism. If your application requires perfectly reproducible outputs, you need additional measures beyond setting temperature to 0. Use fixed seeds where available, control batch sizes, and implement output caching ^[6].

Consider temperature interactions with other parameters. If you are also using top-p or top-k, be aware that temperature changes the effective behavior of those filters. A standard combination for balanced outputs is T = 0.7 with top-p = 0.9. For deterministic outputs, use T = 0 and disable other sampling parameters.

Respect locked temperatures on reasoning models. Models like OpenAI o3, o4-mini, and Google Gemini 3 with thinking have fixed or strongly recommended temperature settings. These models perform internal reasoning that is calibrated for specific temperature values; overriding them can cause degraded performance, repetitive loops, or other unexpected behavior ^[7]^[9].

Use systematic evaluation, not intuition. Rather than guessing the best temperature, run your prompts at several temperature settings and evaluate the outputs against your quality criteria. Tools like promptfoo and other LLM evaluation frameworks support temperature sweeps as part of their testing pipelines ^[15].

The trend toward Temperature + Min-P for open-source models. For open-weight models served through frameworks like vLLM or llama.cpp, the community has increasingly favored the combination of temperature with min-p sampling, as min-p is less sensitive to temperature coupling than top-p ^[12].

Common Misconceptions

Several misunderstandings about temperature persist in practice.

"Temperature 0 is always deterministic." As discussed above, this is not strictly true. Non-determinism from hardware-level floating-point differences and dynamic batching can cause outputs to vary even at T = 0, a caveat that providers including Anthropic state explicitly ^[6]^[8].

"Higher temperature means better creativity." There is a sweet spot. Beyond approximately T = 1.0, output quality tends to decline. Truly creative output requires the model to make meaningful connections, not just random ones. Extremely high temperatures produce randomness, not creativity.

"Temperature and top-p do the same thing." They are complementary, not redundant. Temperature affects the shape of the entire probability distribution before sampling, while top-p truncates the distribution after it has been shaped. A flat distribution (high T) filtered by top-p = 0.9 behaves very differently from a sharp distribution (low T) filtered by the same top-p value.

"The default temperature is always best." Defaults are a reasonable starting point, but many applications benefit significantly from tuning. A code-generation pipeline running at T = 1.0 is likely producing more errors than it would at T = 0.2.

References

IBM. "What is LLM Temperature?" https://www.ibm.com/think/topics/llm-temperature ↩
Sharma, H. "Softmax Temperature." Medium. https://medium.com/@harshit158/softmax-temperature-5492e4007f71 ↩
"Softmax function." Wikipedia. https://en.wikipedia.org/wiki/Softmax_function ↩
Ackley, D.H., Hinton, G.E., & Sejnowski, T.J. (1985). "A Learning Algorithm for Boltzmann Machines." Cognitive Science, 9(1):147-169. ↩
The Nobel Foundation. (2024). "The Nobel Prize in Physics 2024." https://www.nobelprize.org/prizes/physics/2024/summary/ ↩
Mikulski, B. "The Temperature=0 Myth: Why Your LLM Still Isn't Deterministic (And How to Fix It)." https://mikulskibartosz.name/why-temperature-0-isnt-deterministic ↩
OpenAI. "API Reference." https://platform.openai.com/docs/api-reference/chat/create ↩
Anthropic. "Glossary" and "Messages API Reference." https://platform.claude.com/docs/en/about-claude/glossary ↩
Google. "Gemini 3 Developer Guide." https://ai.google.dev/gemini-api/docs/gemini-3 ↩
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. arXiv:1904.09751. ↩
Let's Data Science. "LLM Sampling Parameters Explained: Intuition to Math." https://letsdatascience.com/blog/llm-sampling-temperature-top-k-top-p-and-min-p-explained ↩
Tang, C., Liu, J., Xu, H., & Huang, L. (2025). "Top-nσ: Not All Logits Are You Need." Proceedings of ACL 2025, pages 10758-10774. arXiv:2411.07641. ↩
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." arXiv:1503.02531. ↩
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017). "On Calibration of Modern Neural Networks." ICML 2017. arXiv:1706.04599. ↩
Promptfoo. "How to Choose the Right LLM Temperature Setting." https://www.promptfoo.dev/docs/guides/evaluate-llm-temperature/ ↩
Scharringhausen, M. (2026). "Entropy in Large Language Models." arXiv:2602.20052. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Beam search CRUXEval Calibration (machine learning)Causal Language Model DistilBERT Full Softmax Greedy decoding Logits MBPP Multinomial regression Out-Group Homogeneity Bias Prompt Engineering Top-p and top-k sampling

What does temperature do mathematically?

Where does the name temperature come from?

The Boltzmann Distribution

Boltzmann Machines and Simulated Annealing

From Boltzmann Machines to Modern Softmax

What happens at different temperature values?

T = 0 (Greedy Decoding)

T = 0.1 to 0.3 (Highly Focused)

T = 0.4 to 0.6 (Balanced)

T = 0.7 to 0.9 (Creative)

T = 1.0 (Standard)

T > 1.0 (High Randomness)

What temperature ranges do major APIs use?

What temperature should I use for each task?

How does temperature differ from top-p and top-k?

Top-k Sampling

Top-p (Nucleus) Sampling

Min-p Sampling

Top-n-sigma Sampling

How They Interact

Comparison Summary

How is temperature used in training, not just inference?

Knowledge Distillation

Calibration

Reinforcement Learning

Best Practices (2025-2026)

Common Misconceptions

See Also

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here