Logits
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 · 5,711 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v3 · 5,711 words
Add missing citations, update stale details, or suggest a clearer explanation.
In machine learning and statistics, logits refer to the raw, unnormalized scores output by a model before they are converted into probabilities. The term has two closely related but distinct meanings: (1) the output of the logit function, which maps a probability to a log-odds value, and (2) the raw output values from the final layer of a neural network before an activation function such as softmax or sigmoid is applied. Both usages are connected by the idea that these values live on an unbounded real-number scale and must be transformed to become interpretable as probabilities.
Logits sit at the boundary between the linear, gradient-friendly inner workings of a model and the bounded, interpretable probability space that humans use to evaluate predictions. Almost every modern classifier, from a simple logistic regression fitted on tabular data to a frontier large language model generating text, computes logits as an intermediate quantity. Loss functions, calibration techniques, sampling strategies, and interpretability methods all manipulate logits directly, which makes the concept indispensable for anyone reading deep learning code or research papers.
The word "logit" was coined by American statistician Joseph Berkson in his 1944 paper "Application of the Logistic Function to Bio-Assay," published in the Journal of the American Statistical Association. Berkson constructed the term as an abbreviation of "logistic unit," deliberately mirroring Chester Ittner Bliss's earlier term "probit" (probability unit), which had been introduced in 1934. Berkson proposed the logit model (later known as logistic regression) as a computationally simpler alternative to the probit model that was prevalent at the time. His approach used the logarithm of the odds ratio as the response variable, and the technique gradually gained widespread adoption across biostatistics, economics, and eventually machine learning.
Berkson (1899 to 1982) was head of the Division of Biometry and Medical Statistics at the Mayo Clinic from 1934 to 1964. He held degrees in physics, medicine, and statistics, and is also remembered for Berkson's paradox, a form of selection bias in observational studies. The term "log-odds" was popularized somewhat later by G. A. Barnard in 1949, although Charles Sanders Peirce had used closely related ideas as early as the late nineteenth century.
The softmax function, the multi-class generalization that is now most commonly paired with logits in deep learning, was named by John S. Bridle in his 1989 paper "Probabilistic Interpretation of Feedforward Classification Network Outputs." Bridle described softmax as "a normalised exponential, multi-input generalisation of the logistic non-linearity" and explicitly framed it as a differentiable substitute for a hard winner-take-all operation. The same mathematical form had appeared a century earlier in Boltzmann's 1868 work on statistical mechanics (the Boltzmann distribution), connecting modern classifiers to a long tradition in physics.
The logit function maps a probability value p in the open interval (0, 1) to a real number on the entire number line:
logit(p) = ln(p / (1 - p))
Here, p / (1 - p) is known as the odds of the event, and taking the natural logarithm yields the log-odds. The logit function is therefore also called the log-odds function. Equivalent expressions include:
The domain of the logit function is (0, 1), and its range is all real numbers from negative infinity to positive infinity.
The logit function is the inverse of the standard logistic function (also called the sigmoid function):
sigma(x) = 1 / (1 + e^(-x))
Applying the logistic function to a logit value recovers the original probability. Conversely, applying the logit function to a probability produced by the sigmoid yields the original unbounded score. This inverse relationship is central to logistic regression, where the model's linear predictor (a weighted sum of features) produces logit values, and the sigmoid function converts them into predicted probabilities.
The following table shows how probabilities map to logits and odds. It illustrates how logits stretch the (0, 1) probability range across the entire real line, with a logit of 0 corresponding to a fair 50/50 outcome.
| Probability p | Odds p/(1-p) | Logit ln(p/(1-p)) |
|---|---|---|
| 0.01 | 0.0101 | -4.595 |
| 0.10 | 0.111 | -2.197 |
| 0.25 | 0.333 | -1.099 |
| 0.50 | 1.000 | 0.000 |
| 0.75 | 3.000 | 1.099 |
| 0.90 | 9.000 | 2.197 |
| 0.99 | 99.00 | 4.595 |
| 0.999 | 999.0 | 6.907 |
A shift of about 2.2 logit units corresponds to roughly a tenfold change in odds, which is why logit-space differences are often interpreted in terms of decibans or evidence weights in Bayesian writing.
In the context of deep learning, the term "logits" most commonly refers to the raw output values of a neural network's final layer before any probability-producing activation function is applied. These values result from a linear transformation of the previous layer's activations:
z = Wx + b
where W is the weight matrix of the output layer, x is the input from the preceding layer, and b is the bias vector. The resulting vector z contains one logit value per class. These logit values can be any real number, positive or negative, and do not sum to any particular value.
The logits are then passed through an activation function to produce a probability distribution over the output classes. The choice of activation function depends on the classification task.
In strict statistical terms, only the output of a binary classifier with a sigmoid head literally equals the log-odds of the positive class. The vector of pre-softmax scores in a multi-class classifier is not a vector of log-odds in the same sense; it is identifiable only up to an additive constant, since softmax is invariant to adding the same value to every entry. The deep learning community nonetheless adopted the word "logits" for these vectors by analogy with the binary case, which is the most common source of confusion when the term first appears.
In binary classification, a neural network typically produces a single logit value z. The sigmoid function then maps this value to a probability:
P(y = 1 | x) = sigma(z) = 1 / (1 + e^(-z))
The probability of the negative class is simply the complement: P(y = 0 | x) = 1 - sigma(z). A logit of 0 corresponds to a probability of 0.5 (equal odds for both classes). Positive logits indicate the model favors the positive class, while negative logits indicate it favors the negative class. The magnitude of the logit reflects the model's confidence: a logit of 5.0 corresponds to a probability of about 0.993, while a logit of -5.0 corresponds to about 0.007.
For problems with more than two mutually exclusive classes, the softmax function generalizes the sigmoid to a vector of logits:
P(y = k | x) = e^(z_k) / sum(e^(z_j) for j = 1 to K)
where z_k is the logit for class k and K is the total number of classes. The softmax function ensures that all output probabilities are positive and sum to 1. In the special case where K = 2, the softmax function reduces to the sigmoid function.
The following table summarizes the key differences between the two approaches:
| Property | Sigmoid | Softmax |
|---|---|---|
| Number of classes | 2 (binary) | K >= 2 (multi-class) |
| Output range | (0, 1) per logit | (0, 1) per class, sums to 1 |
| Class relationship | Independent (can also be used for multi-label) | Mutually exclusive |
| Formula | 1 / (1 + e^(-z)) | e^(z_k) / sum(e^(z_j)) |
| Typical loss function | Binary cross-entropy | Categorical cross-entropy |
| Use case example | Spam detection, sentiment polarity | ImageNet classification, language modeling |
A useful property of softmax is translation invariance: adding the same constant c to every logit leaves the resulting probabilities unchanged, since exp(z_k + c) / sum(exp(z_j + c)) = exp(z_k) / sum(exp(z_j)). Implementations exploit this by subtracting max(z) from all logits before exponentiation, which keeps intermediate values bounded above by 1 and avoids floating-point overflow without changing the answer.
In multi-label classification, where each input can belong to multiple classes simultaneously, the sigmoid function is applied independently to each logit. Unlike softmax, this produces probabilities that do not need to sum to 1, since the classes are not mutually exclusive. For example, an image might be tagged as both "outdoor" and "sunny" at the same time. Common examples include scene tagging in computer vision, medical imaging where multiple findings may be present in the same scan, and text classification where a document can belong to several topics.
In practice, cross-entropy loss is almost always computed directly from logits rather than from probabilities, for reasons of numerical stability. When softmax probabilities are computed as a separate step and then passed to a logarithm, the computation can produce extremely small values that underflow to zero, causing the logarithm to return negative infinity. Computing the log-softmax in a single fused operation avoids this problem by using the log-sum-exp trick:
log_softmax(z_k) = z_k - log(sum(e^(z_j) for j = 1 to K))
This formulation subtracts the maximum logit value before exponentiation, keeping all intermediate values in a numerically stable range.
For binary cross-entropy with logits, PyTorch and TensorFlow both use a reformulation that avoids overflow:
L = max(z, 0) - y * z + log(1 + e^(-|z|))
where z is the logit and y is the target label (0 or 1).
The following table lists the primary loss functions that accept raw logits in popular deep learning frameworks:
| Framework | Loss function | Task |
|---|---|---|
| PyTorch | torch.nn.CrossEntropyLoss | Multi-class classification |
| PyTorch | torch.nn.BCEWithLogitsLoss | Binary / multi-label classification |
| TensorFlow | tf.nn.softmax_cross_entropy_with_logits | Multi-class classification |
| TensorFlow | tf.nn.sigmoid_cross_entropy_with_logits | Binary / multi-label classification |
| JAX / Flax | optax.softmax_cross_entropy_with_integer_labels | Multi-class classification |
| JAX / Flax | optax.sigmoid_binary_cross_entropy | Binary / multi-label classification |
PyTorch's CrossEntropyLoss internally combines LogSoftmax and NLLLoss (negative log-likelihood loss) into a single numerically stable operation. Passing pre-computed softmax probabilities to a separate log function is generally discouraged because it sacrifices this stability.
Production loss functions on logits often expose two additional knobs. Label smoothing replaces the one-hot target distribution with a softened version that assigns a small probability epsilon to incorrect classes and 1 - epsilon to the correct class. This regularizer, popularized by Szegedy et al. (2016) for ImageNet training, mitigates over-confidence by preventing logits from growing without bound. Class weighting multiplies the per-sample loss by a value that depends on the class label, which is useful for imbalanced datasets. Both options are accepted by torch.nn.CrossEntropyLoss through its label_smoothing and weight arguments and act on the same logit tensor that the loss already consumes.
Temperature scaling is a technique that divides logits by a scalar parameter T (called the temperature) before applying the softmax function:
P(y = k | x) = e^(z_k / T) / sum(e^(z_j / T) for j = 1 to K)
The temperature parameter controls how "sharp" or "flat" the resulting probability distribution is:
| Temperature value | Effect on distribution | Behavior |
|---|---|---|
| T < 1 | Sharper (more peaked) | Model becomes more confident; highest-probability class dominates |
| T = 1 | Standard softmax | Default behavior with no modification |
| T > 1 | Flatter (more uniform) | Model becomes less confident; probability mass spreads across classes |
| T approaches 0 | Approaches argmax | Deterministic selection of the highest-logit class |
| T approaches infinity | Approaches uniform distribution | All classes become equally likely |
Temperature scaling is used in several important contexts.
In their 2015 paper "Distilling the Knowledge in a Neural Network," Geoffrey Hinton, Oriol Vinyals, and Jeff Dean introduced the use of high-temperature softmax outputs as "soft targets" for training smaller student models. When a large teacher model's logits are divided by a high temperature (typically between 2 and 20), the resulting probability distribution reveals the teacher's learned relationships between classes. For instance, a digit classifier might assign a small but meaningful probability to "7" when shown a "1," because the two digits share visual features. These soft targets carry more information than hard one-hot labels, allowing a smaller model to learn from the teacher's "dark knowledge." During distillation training, both the teacher and student use the same temperature value.
A typical knowledge distillation loss combines a standard hard-label cross-entropy term with a soft-target term computed at temperature T. The soft-target term is multiplied by T squared to keep gradient magnitudes comparable, since softening the distribution at temperature T divides gradients by T squared. The following table summarizes representative hyperparameter ranges from the original paper and follow-up work.
| Hyperparameter | Typical range | Role |
|---|---|---|
| Temperature T | 2 to 20 | Higher T exposes more relative ranking information among non-target classes |
| Soft-target weight alpha | 0.5 to 0.9 | Weight on the distillation loss relative to the hard-label loss |
| Gradient scaling factor | T^2 | Multiplied into the soft-target loss to compensate for the 1/T^2 gradient shrinkage |
| Batch size for distillation | Same as teacher pretraining or larger | Distillation often benefits from larger batches |
Guo et al. (2017) demonstrated in "On Calibration of Modern Neural Networks" that modern deep networks tend to be poorly calibrated, meaning their predicted probabilities do not accurately reflect the true likelihood of correctness. Temperature scaling with a single learned parameter T, optimized on a validation set to minimize negative log-likelihood, can restore calibration with minimal computational cost. This post-processing technique simply divides all logits by T before applying softmax and has been shown to be surprisingly effective.
Temperature scaling sits within a wider family of post-hoc calibration methods that act on logits, including Platt scaling (a learned affine transform followed by sigmoid for binary models), histogram binning, isotonic regression, and Dirichlet calibration. All of these operate on the validation logits and probabilities of an already trained model, leaving the network weights untouched. Calibration quality is typically measured by expected calibration error (ECE) and reliability diagrams, both of which compare predicted confidence with empirical accuracy across binned probability ranges.
In autoregressive large language models (LLMs) such as GPT-4, Claude, and LLaMA, logits play a central role in next-token prediction. At each generation step, the model produces a vector of logits with one entry per token in its vocabulary (often 32,000 to 128,000 or more tokens). These logits represent the model's assessment of how likely each token is to come next given the preceding context.
The logits are then transformed into a probability distribution (typically via softmax with optional temperature scaling), and a token is sampled from this distribution. Several sampling strategies operate directly on the logit or probability space.
| Strategy | Description |
|---|---|
| Greedy decoding | Always selects the token with the highest logit (argmax). Deterministic but can produce repetitive output. |
| Temperature sampling | Divides logits by temperature T before softmax. Higher T increases diversity; lower T increases focus. |
| Top-k sampling | After computing probabilities from logits, keeps only the k most probable tokens and redistributes probability mass among them. |
| Top-p sampling (nucleus) | Sorts tokens by probability, selects the smallest set whose cumulative probability exceeds threshold p (typically 0.9 or 0.95), and samples from that set. Introduced by Holtzman et al. (2019). |
| Min-p sampling | Filters out tokens whose probability is below a fraction of the most probable token's probability. Adapts dynamically to the shape of the distribution. |
| Typical sampling | Keeps tokens whose log-probability is closest to the entropy of the distribution, suppressing both extreme outliers and overly dominant tokens. |
| Eta sampling | Dynamically sets a probability cutoff using a function of the distribution's entropy, generalizing epsilon and typical sampling. |
| Beam search | Maintains multiple candidate sequences (beams), extending each by the highest-logit tokens at each step. Common in machine translation. |
In practice, these strategies are often combined. A typical LLM inference pipeline applies temperature scaling first, then filters candidates with top-k or top-p, and finally samples from the remaining tokens. API providers such as OpenAI, Anthropic, and Google expose temperature, top-k, and top-p as user-configurable parameters, although the exact set of exposed knobs varies by provider.
Consider a tiny vocabulary of five tokens with logits [3.0, 2.5, 1.0, 0.5, -1.0]. The table below traces how different transformations reshape the resulting distribution.
| Step | Operation | Resulting probabilities (rounded) |
|---|---|---|
| 1 | Softmax with T = 1 | [0.500, 0.303, 0.068, 0.041, 0.009] |
| 2 | Softmax with T = 0.5 | [0.728, 0.268, 0.001, 0.000, 0.000] |
| 3 | Softmax with T = 2.0 | [0.397, 0.309, 0.146, 0.114, 0.034] |
| 4 | Top-k with k = 2 (after T=1) | [0.623, 0.377, 0, 0, 0] |
| 5 | Top-p with p = 0.9 (after T=1) | [0.519, 0.314, 0.071, 0.043, 0.054 set to 0 then renormalized as needed] |
The example shows that lowering temperature concentrates mass on the leading token, while top-k and top-p filtering simply zero out unlikely tokens before renormalizing.
Some LLM APIs allow users to add a fixed bias to specific token logits before sampling. This can be used to encourage or discourage particular tokens. For example, adding a large negative bias to the logit for a profanity token effectively prevents the model from generating it, while a positive bias can steer the model toward a preferred vocabulary.
OpenAI's chat completions API exposes a logit_bias field that maps token IDs to additive biases. Acceptable values range from -100 to +100, with the extreme values acting as soft bans and forced selections respectively. Anthropic's Claude Messages API does not expose a logit bias parameter. The Hugging Face Transformers library implements similar functionality through LogitsProcessor subclasses, which mutate the logit tensor in place during generate() calls.
The transformers library packages dozens of logit transformations under the LogitsProcessor abstraction. Each processor receives the current input IDs and the logits for the next token, mutates the logits, and passes them to the next processor in a LogitsProcessorList. Categories include warpers that reshape the distribution (temperature, top-k, top-p, min-p, typical, eta), repetition controls (RepetitionPenaltyLogitsProcessor, NoRepeatNGramLogitsProcessor), length controls (MinLengthLogitsProcessor, ExponentialDecayLengthPenalty), token restrictions (NoBadWordsLogitsProcessor, SuppressTokensLogitsProcessor, ForcedBOSTokenLogitsProcessor), structured-output controls (PrefixConstrainedLogitsProcessor), and watermarking (SynthIDTextWatermarkLogitsProcessor). Together they let users implement constrained decoding, JSON-mode generation, and bespoke sampling rules without retraining the underlying model.
Speculative decoding, introduced by Chen et al. (2023) and Leviathan et al. (2023), accelerates LLM inference by drafting candidate tokens with a smaller model and then verifying them in a single batched forward pass through the larger target model. The verification step compares the draft model's logits to the target model's logits and accepts or rejects each draft token using a modified rejection sampling scheme that provably preserves the target model's distribution. Reported speedups range from 2x to 3x on production-scale models without any change to output quality.
Contrastive decoding, proposed by Li et al. (2022) and accepted at ACL 2023, scores candidate tokens by the difference between an expert (large) model's log-probabilities and an amateur (small) model's log-probabilities. The intuition is that failure modes such as repetition and dullness are amplified in the smaller model, so subtracting its logits highlights tokens where the expert has genuine, non-trivial preferences. The method is implemented purely as a logit transformation: compute logits from both models, subtract, apply a plausibility constraint to filter out implausible candidates, and then sample as usual.
| Manipulation | What it does to logits | Typical use |
|---|---|---|
| Temperature scaling | Divide by T | Tune randomness |
| Top-k truncation | Set all but top k logits to negative infinity | Cap low-quality tail |
| Top-p (nucleus) truncation | Mask out logits whose tokens fall outside the top probability mass p | Adapt cutoff to distribution shape |
| Repetition penalty | Multiply or subtract from logits of recently generated tokens | Reduce loops |
| Logit bias | Add a fixed offset per token | Force or ban specific tokens |
| Speculative decoding | Compare draft and target logits | Accelerate inference |
| Contrastive decoding | Subtract amateur log-probabilities from expert log-probabilities | Suppress generic continuations |
Logits are also a primary target of mechanistic interpretability research, which studies the internal computations of neural networks at the level of individual circuits.
The logit lens is an interpretability technique introduced by an author writing under the name nostalgebraist on LessWrong on August 31, 2020. It applies the model's final unembedding matrix (the same matrix used to convert the last hidden state into output logits) to the residual stream activations at every intermediate layer of a transformer. This produces an "intermediate logit" distribution at each depth, allowing researchers to read off the model's evolving prediction for the next token as information flows through the network.
The usual qualitative findings are that early layers produce broad, often uninterpretable distributions; middle layers begin to assign mass to grammatically and semantically plausible tokens; and the final few layers refine the prediction toward the model's eventual choice. Early evidence suggests that residual streams quickly leave token space and operate in a predictive space throughout the network, which has been influential in shaping later interpretability theories.
The tuned lens, introduced by Belrose, Ostrovsky, McKinney, Furman, Smith, Halawi, Biderman, and Steinhardt (2023), is a successor to the logit lens that addresses several of its weaknesses. Instead of applying the unembedding matrix directly to intermediate residual streams, the tuned lens learns a small affine probe at each layer that maps the residual stream into a distribution over the vocabulary. The authors evaluated the technique on autoregressive language models up to 20B parameters and reported that it is more predictive, more reliable, and less biased than the original logit lens. They also demonstrated practical applications, including detecting prompt injections by examining the trajectory of intermediate predictions across layers.
Within the transformer circuits research program at Anthropic, a popular analytical tool is the logit difference: the difference between the logit assigned to one specific token and the logit assigned to another, both measured at the model's final layer. Because softmax is invariant to additive constants, the logit difference is the meaningful signed quantity that determines which of two candidates the model prefers.
Direct logit attribution decomposes a logit difference into additive contributions from each attention head, MLP layer, and individual residual stream component. The decomposition is exact for the linear unembedding step and reveals which circuit components push the prediction toward one token versus another. This technique was central to the discovery and analysis of induction heads, which copy and complete repeated patterns of the form A...BC by lifting the logit of token C whenever AB appears earlier in the context.
In logistic regression, the logit appears in its original statistical sense. The model assumes that the log-odds of the positive class is a linear function of the input features:
logit(P(y = 1 | x)) = ln(P(y = 1 | x) / P(y = 0 | x)) = w^T x + b
Solving for the probability gives the familiar logistic function:
P(y = 1 | x) = 1 / (1 + e^(-(w^T x + b)))
Logistic regression can be viewed as a single-layer neural network with a sigmoid activation, making it the simplest case of using logits for classification. The connection between classical logistic regression and modern neural network output layers is direct: both compute a linear combination of inputs and interpret the result as a logit.
Maximum likelihood estimation for logistic regression maximizes the log-likelihood of the training labels under the logit-based model, which is equivalent to minimizing binary cross-entropy on the predicted probabilities. The Newton-Raphson and iteratively reweighted least squares (IRLS) algorithms are classical solvers for this objective; modern statistical software such as scikit-learn's LogisticRegression and statsmodels' Logit use either L-BFGS, liblinear, SAGA, or IRLS depending on the regularization choice. Because logistic regression learns interpretable per-feature coefficients, the fitted weights are themselves log-odds ratios: a coefficient of 0.7 implies that a one-unit increase in the corresponding feature multiplies the odds by approximately e^0.7 = 2.01, all else equal. This interpretability is one reason logistic regression remains a workhorse in epidemiology, credit scoring, and clinical risk modeling.
Several conventions and best practices apply when working with logits in code:
CrossEntropyLoss in PyTorch expect raw logits. Applying softmax first and then passing the result leads to numerically unstable gradients and incorrect loss values.BCEWithLogitsLoss instead of BCELoss with a manual sigmoid. In TensorFlow, use from_logits=True in the loss constructor.The table below summarizes how to access raw logits from common interfaces. Logits exposure is sometimes restricted for security or product reasons.
| Interface | Access to logits | Notes |
|---|---|---|
| OpenAI Chat Completions API | logprobs field on completions | Returns top log-probabilities, not raw logits; influence via logit_bias (-100 to +100) |
| Anthropic Claude Messages API | Not exposed | Only temperature and top_p style sampling parameters are configurable |
Hugging Face transformers (PyTorch) | model(...).logits | Returns the full vocabulary tensor; can be intercepted with a LogitsProcessor |
vllm and TensorRT-LLM | Custom logits processors and sampling parameters | Tuned for high-throughput serving |
| llama.cpp | llama_get_logits C API | Returns raw fp32 logits per token |
scikit-learn LogisticRegression | decision_function(X) | Returns the raw linear score (logit) for binary problems |
| Term | Definition | Range | Relationship to logits |
|---|---|---|---|
| Logit (statistical) | log(p / (1 - p)), the log-odds of a probability | (-inf, +inf) | The original definition |
| Logit (neural network) | Raw output of the final dense layer before activation | (-inf, +inf) | Named by analogy; not always literal log-odds |
| Probability | Normalized value representing likelihood | [0, 1] | Obtained from logits via sigmoid or softmax |
| Log-probability | Natural log of a probability | (-inf, 0] | log(softmax(logits)) |
| Odds | p / (1 - p) | (0, +inf) | exp(logit) |
| Score | General term for a model's raw output | Varies | Logits are a specific type of score |
| Probit | Inverse of the standard normal CDF | (-inf, +inf) | Analogous quantity from probit regression, introduced by Bliss in 1934 |
| Decibans / nats | Units of evidence used in information theory | (-inf, +inf) | A logit can be expressed as a number of nats (natural log) of evidence |
Suppose a language model with a vocabulary of 50,000 tokens processes the prompt "The capital of France is" and produces a logit vector z of length 50,000. The token "Paris" has logit z_Paris = 9.4, and the next-best token "the" has logit z_the = 5.2, while the rest of the distribution lies below 4.0.
This example illustrates how a single scalar (the logit difference between the top two tokens) controls almost everything about the next-token distribution, and how each sampling parameter can be understood as a deterministic transformation of the logit vector.
Imagine you are trying to guess what flavor of ice cream your friend likes best. Before you figure out the actual chances (like "60% chance they like chocolate, 30% vanilla, 10% strawberry"), your brain first comes up with some rough numbers. Maybe chocolate gets a score of 3, vanilla gets 1, and strawberry gets -1. These rough scores are logits. They are just your gut feeling before you turn them into neat percentages.
To go from these rough scores to real percentages, you use a special math recipe called softmax. It takes your scores and squishes them so they are all between 0 and 1 and add up to 100%. The bigger the score, the bigger the percentage. That is essentially what a computer does when it uses logits to make predictions.
If you turn up a knob called temperature, the computer becomes less sure. The chocolate, vanilla, and strawberry chances become closer together, like 40%, 35%, 25%. Turn the knob down, and the computer becomes more confident, picking chocolate almost every time. Big language models work the same way. They produce a long list of logits, one per word in their dictionary, and the temperature knob lets you decide how surprising or boring their answers will be.