Logits

In machine learning and statistics, logits refer to the raw, unnormalized scores output by a model before they are converted into probabilities. The term has two closely related but distinct meanings: (1) the output of the logit function, which maps a probability to a log-odds value, and (2) the raw output values from the final layer of a neural network before an activation function such as softmax or sigmoid is applied. Both usages are connected by the idea that these values live on an unbounded real-number scale and must be transformed to become interpretable as probabilities.

Logits sit at the boundary between the linear, gradient-friendly inner workings of a model and the bounded, interpretable probability space that humans use to evaluate predictions. Almost every modern classifier, from a simple logistic regression fitted on tabular data to a frontier large language model generating text, computes logits as an intermediate quantity. Loss functions, calibration techniques, sampling strategies, and interpretability methods all manipulate logits directly, which makes the concept indispensable for anyone reading deep learning code or research papers.

Historical origin of the term

The word "logit" was coined by American statistician Joseph Berkson in his 1944 paper "Application of the Logistic Function to Bio-Assay," published in the Journal of the American Statistical Association. Berkson constructed the term as an abbreviation of "logistic unit," deliberately mirroring Chester Ittner Bliss's earlier term "probit" (probability unit), which had been introduced in 1934. Berkson proposed the logit model (later known as logistic regression) as a computationally simpler alternative to the probit model that was prevalent at the time. His approach used the logarithm of the odds ratio as the response variable, and the technique gradually gained widespread adoption across biostatistics, economics, and eventually machine learning.

Berkson (1899 to 1982) was head of the Division of Biometry and Medical Statistics at the Mayo Clinic from 1934 to 1964. He held degrees in physics, medicine, and statistics, and is also remembered for Berkson's paradox, a form of selection bias in observational studies. The term "log-odds" was popularized somewhat later by G. A. Barnard in 1949, although Charles Sanders Peirce had used closely related ideas as early as the late nineteenth century.

The softmax function, the multi-class generalization that is now most commonly paired with logits in deep learning, was named by John S. Bridle in his 1989 paper "Probabilistic Interpretation of Feedforward Classification Network Outputs." Bridle described softmax as "a normalised exponential, multi-input generalisation of the logistic non-linearity" and explicitly framed it as a differentiable substitute for a hard winner-take-all operation. The same mathematical form had appeared a century earlier in Boltzmann's 1868 work on statistical mechanics (the Boltzmann distribution), connecting modern classifiers to a long tradition in physics.

Mathematical definition

The logit function

The logit function maps a probability value p in the open interval (0, 1) to a real number on the entire number line:

logit(p) = ln(p / (1 - p))

Here, p / (1 - p) is known as the odds of the event, and taking the natural logarithm yields the log-odds. The logit function is therefore also called the log-odds function. Equivalent expressions include:

logit(p) = ln(p) - ln(1 - p)
logit(p) = -ln(1/p - 1)

The domain of the logit function is (0, 1), and its range is all real numbers from negative infinity to positive infinity.

The logistic function (inverse logit)

The logit function is the inverse of the standard logistic function (also called the sigmoid function):

sigma(x) = 1 / (1 + e^(-x))

Applying the logistic function to a logit value recovers the original probability. Conversely, applying the logit function to a probability produced by the sigmoid yields the original unbounded score. This inverse relationship is central to logistic regression, where the model's linear predictor (a weighted sum of features) produces logit values, and the sigmoid function converts them into predicted probabilities.

Selected probability to logit conversions

The following table shows how probabilities map to logits and odds. It illustrates how logits stretch the (0, 1) probability range across the entire real line, with a logit of 0 corresponding to a fair 50/50 outcome.

Probability p	Odds p/(1-p)	Logit ln(p/(1-p))
0.01	0.0101	-4.595
0.10	0.111	-2.197
0.25	0.333	-1.099
0.50	1.000	0.000
0.75	3.000	1.099
0.90	9.000	2.197
0.99	99.00	4.595
0.999	999.0	6.907

A shift of about 2.2 logit units corresponds to roughly a tenfold change in odds, which is why logit-space differences are often interpreted in terms of decibans or evidence weights in Bayesian writing.

Logits in neural networks

In the context of deep learning, the term "logits" most commonly refers to the raw output values of a neural network's final layer before any probability-producing activation function is applied. These values result from a linear transformation of the previous layer's activations:

z = Wx + b

where W is the weight matrix of the output layer, x is the input from the preceding layer, and b is the bias vector. The resulting vector z contains one logit value per class. These logit values can be any real number, positive or negative, and do not sum to any particular value.

The logits are then passed through an activation function to produce a probability distribution over the output classes. The choice of activation function depends on the classification task.

In strict statistical terms, only the output of a binary classifier with a sigmoid head literally equals the log-odds of the positive class. The vector of pre-softmax scores in a multi-class classifier is not a vector of log-odds in the same sense; it is identifiable only up to an additive constant, since softmax is invariant to adding the same value to every entry. The deep learning community nonetheless adopted the word "logits" for these vectors by analogy with the binary case, which is the most common source of confusion when the term first appears.

Binary classification with the sigmoid function

In binary classification, a neural network typically produces a single logit value z. The sigmoid function then maps this value to a probability:

P(y = 1 | x) = sigma(z) = 1 / (1 + e^(-z))

The probability of the negative class is simply the complement: P(y = 0 | x) = 1 - sigma(z). A logit of 0 corresponds to a probability of 0.5 (equal odds for both classes). Positive logits indicate the model favors the positive class, while negative logits indicate it favors the negative class. The magnitude of the logit reflects the model's confidence: a logit of 5.0 corresponds to a probability of about 0.993, while a logit of -5.0 corresponds to about 0.007.

Multi-class classification with the softmax function

For problems with more than two mutually exclusive classes, the softmax function generalizes the sigmoid to a vector of logits:

P(y = k | x) = e^(z_k) / sum(e^(z_j) for j = 1 to K)

where z_k is the logit for class k and K is the total number of classes. The softmax function ensures that all output probabilities are positive and sum to 1. In the special case where K = 2, the softmax function reduces to the sigmoid function.

The following table summarizes the key differences between the two approaches:

Property	Sigmoid	Softmax
Number of classes	2 (binary)	K >= 2 (multi-class)
Output range	(0, 1) per logit	(0, 1) per class, sums to 1
Class relationship	Independent (can also be used for multi-label)	Mutually exclusive
Formula	1 / (1 + e^(-z))	e^(z_k) / sum(e^(z_j))
Typical loss function	Binary cross-entropy	Categorical cross-entropy
Use case example	Spam detection, sentiment polarity	ImageNet classification, language modeling

A useful property of softmax is translation invariance: adding the same constant c to every logit leaves the resulting probabilities unchanged, since exp(z_k + c) / sum(exp(z_j + c)) = exp(z_k) / sum(exp(z_j)). Implementations exploit this by subtracting max(z) from all logits before exponentiation, which keeps intermediate values bounded above by 1 and avoids floating-point overflow without changing the answer.

Multi-label classification

In multi-label classification, where each input can belong to multiple classes simultaneously, the sigmoid function is applied independently to each logit. Unlike softmax, this produces probabilities that do not need to sum to 1, since the classes are not mutually exclusive. For example, an image might be tagged as both "outdoor" and "sunny" at the same time. Common examples include scene tagging in computer vision, medical imaging where multiple findings may be present in the same scan, and text classification where a document can belong to several topics.

Cross-entropy loss and numerical stability

In practice, cross-entropy loss is almost always computed directly from logits rather than from probabilities, for reasons of numerical stability. When softmax probabilities are computed as a separate step and then passed to a logarithm, the computation can produce extremely small values that underflow to zero, causing the logarithm to return negative infinity. Computing the log-softmax in a single fused operation avoids this problem by using the log-sum-exp trick:

log_softmax(z_k) = z_k - log(sum(e^(z_j) for j = 1 to K))

This formulation subtracts the maximum logit value before exponentiation, keeping all intermediate values in a numerically stable range.

For binary cross-entropy with logits, PyTorch and TensorFlow both use a reformulation that avoids overflow:

L = max(z, 0) - y * z + log(1 + e^(-|z|))

where z is the logit and y is the target label (0 or 1).

The following table lists the primary loss functions that accept raw logits in popular deep learning frameworks:

Framework	Loss function	Task
PyTorch	`torch.nn.CrossEntropyLoss`	Multi-class classification
PyTorch	`torch.nn.BCEWithLogitsLoss`	Binary / multi-label classification
TensorFlow	`tf.nn.softmax_cross_entropy_with_logits`	Multi-class classification
TensorFlow	`tf.nn.sigmoid_cross_entropy_with_logits`	Binary / multi-label classification
JAX / Flax	`optax.softmax_cross_entropy_with_integer_labels`	Multi-class classification
JAX / Flax	`optax.sigmoid_binary_cross_entropy`	Binary / multi-label classification

PyTorch's CrossEntropyLoss internally combines LogSoftmax and NLLLoss (negative log-likelihood loss) into a single numerically stable operation. Passing pre-computed softmax probabilities to a separate log function is generally discouraged because it sacrifices this stability.

Label smoothing and class weights

Production loss functions on logits often expose two additional knobs. Label smoothing replaces the one-hot target distribution with a softened version that assigns a small probability epsilon to incorrect classes and 1 - epsilon to the correct class. This regularizer, popularized by Szegedy et al. (2016) for ImageNet training, mitigates over-confidence by preventing logits from growing without bound. Class weighting multiplies the per-sample loss by a value that depends on the class label, which is useful for imbalanced datasets. Both options are accepted by torch.nn.CrossEntropyLoss through its label_smoothing and weight arguments and act on the same logit tensor that the loss already consumes.

Temperature scaling of logits

Temperature scaling is a technique that divides logits by a scalar parameter T (called the temperature) before applying the softmax function:

P(y = k | x) = e^(z_k / T) / sum(e^(z_j / T) for j = 1 to K)

The temperature parameter controls how "sharp" or "flat" the resulting probability distribution is:

Temperature value	Effect on distribution	Behavior
T < 1	Sharper (more peaked)	Model becomes more confident; highest-probability class dominates
T = 1	Standard softmax	Default behavior with no modification
T > 1	Flatter (more uniform)	Model becomes less confident; probability mass spreads across classes
T approaches 0	Approaches argmax	Deterministic selection of the highest-logit class
T approaches infinity	Approaches uniform distribution	All classes become equally likely

Temperature scaling is used in several important contexts.

Knowledge distillation

In their 2015 paper "Distilling the Knowledge in a Neural Network," Geoffrey Hinton, Oriol Vinyals, and Jeff Dean introduced the use of high-temperature softmax outputs as "soft targets" for training smaller student models. When a large teacher model's logits are divided by a high temperature (typically between 2 and 20), the resulting probability distribution reveals the teacher's learned relationships between classes. For instance, a digit classifier might assign a small but meaningful probability to "7" when shown a "1," because the two digits share visual features. These soft targets carry more information than hard one-hot labels, allowing a smaller model to learn from the teacher's "dark knowledge." During distillation training, both the teacher and student use the same temperature value.

A typical knowledge distillation loss combines a standard hard-label cross-entropy term with a soft-target term computed at temperature T. The soft-target term is multiplied by T squared to keep gradient magnitudes comparable, since softening the distribution at temperature T divides gradients by T squared. The following table summarizes representative hyperparameter ranges from the original paper and follow-up work.

Hyperparameter	Typical range	Role
Temperature T	2 to 20	Higher T exposes more relative ranking information among non-target classes
Soft-target weight alpha	0.5 to 0.9	Weight on the distillation loss relative to the hard-label loss
Gradient scaling factor	T^2	Multiplied into the soft-target loss to compensate for the 1/T^2 gradient shrinkage
Batch size for distillation	Same as teacher pretraining or larger	Distillation often benefits from larger batches

Confidence calibration

Guo et al. (2017) demonstrated in "On Calibration of Modern Neural Networks" that modern deep networks tend to be poorly calibrated, meaning their predicted probabilities do not accurately reflect the true likelihood of correctness. Temperature scaling with a single learned parameter T, optimized on a validation set to minimize negative log-likelihood, can restore calibration with minimal computational cost. This post-processing technique simply divides all logits by T before applying softmax and has been shown to be surprisingly effective.

Temperature scaling sits within a wider family of post-hoc calibration methods that act on logits, including Platt scaling (a learned affine transform followed by sigmoid for binary models), histogram binning, isotonic regression, and Dirichlet calibration. All of these operate on the validation logits and probabilities of an already trained model, leaving the network weights untouched. Calibration quality is typically measured by expected calibration error (ECE) and reliability diagrams, both of which compare predicted confidence with empirical accuracy across binned probability ranges.

Logits in large language models

In autoregressive large language models (LLMs) such as GPT-4, Claude, and LLaMA, logits play a central role in next-token prediction. At each generation step, the model produces a vector of logits with one entry per token in its vocabulary (often 32,000 to 128,000 or more tokens). These logits represent the model's assessment of how likely each token is to come next given the preceding context.

The logits are then transformed into a probability distribution (typically via softmax with optional temperature scaling), and a token is sampled from this distribution. Several sampling strategies operate directly on the logit or probability space.

Common sampling strategies

Strategy	Description
Greedy decoding	Always selects the token with the highest logit (argmax). Deterministic but can produce repetitive output.
Temperature sampling	Divides logits by temperature T before softmax. Higher T increases diversity; lower T increases focus.
Top-k sampling	After computing probabilities from logits, keeps only the k most probable tokens and redistributes probability mass among them.
Top-p sampling (nucleus)	Sorts tokens by probability, selects the smallest set whose cumulative probability exceeds threshold p (typically 0.9 or 0.95), and samples from that set. Introduced by Holtzman et al. (2019).
Min-p sampling	Filters out tokens whose probability is below a fraction of the most probable token's probability. Adapts dynamically to the shape of the distribution.
Typical sampling	Keeps tokens whose log-probability is closest to the entropy of the distribution, suppressing both extreme outliers and overly dominant tokens.
Eta sampling	Dynamically sets a probability cutoff using a function of the distribution's entropy, generalizing epsilon and typical sampling.
Beam search	Maintains multiple candidate sequences (beams), extending each by the highest-logit tokens at each step. Common in machine translation.

In practice, these strategies are often combined. A typical LLM inference pipeline applies temperature scaling first, then filters candidates with top-k or top-p, and finally samples from the remaining tokens. API providers such as OpenAI, Anthropic, and Google expose temperature, top-k, and top-p as user-configurable parameters, although the exact set of exposed knobs varies by provider.

Numerical example of token sampling

Consider a tiny vocabulary of five tokens with logits [3.0, 2.5, 1.0, 0.5, -1.0]. The table below traces how different transformations reshape the resulting distribution.

Step	Operation	Resulting probabilities (rounded)
1	Softmax with T = 1	[0.500, 0.303, 0.068, 0.041, 0.009]
2	Softmax with T = 0.5	[0.728, 0.268, 0.001, 0.000, 0.000]
3	Softmax with T = 2.0	[0.397, 0.309, 0.146, 0.114, 0.034]
4	Top-k with k = 2 (after T=1)	[0.623, 0.377, 0, 0, 0]
5	Top-p with p = 0.9 (after T=1)	[0.519, 0.314, 0.071, 0.043, 0.054 set to 0 then renormalized as needed]

The example shows that lowering temperature concentrates mass on the leading token, while top-k and top-p filtering simply zero out unlikely tokens before renormalizing.

Logit bias

Some LLM APIs allow users to add a fixed bias to specific token logits before sampling. This can be used to encourage or discourage particular tokens. For example, adding a large negative bias to the logit for a profanity token effectively prevents the model from generating it, while a positive bias can steer the model toward a preferred vocabulary.

OpenAI's chat completions API exposes a logit_bias field that maps token IDs to additive biases. Acceptable values range from -100 to +100, with the extreme values acting as soft bans and forced selections respectively. Anthropic's Claude Messages API does not expose a logit bias parameter. The Hugging Face Transformers library implements similar functionality through LogitsProcessor subclasses, which mutate the logit tensor in place during generate() calls.

Hugging Face logits processors

The transformers library packages dozens of logit transformations under the LogitsProcessor abstraction. Each processor receives the current input IDs and the logits for the next token, mutates the logits, and passes them to the next processor in a LogitsProcessorList. Categories include warpers that reshape the distribution (temperature, top-k, top-p, min-p, typical, eta), repetition controls (RepetitionPenaltyLogitsProcessor, NoRepeatNGramLogitsProcessor), length controls (MinLengthLogitsProcessor, ExponentialDecayLengthPenalty), token restrictions (NoBadWordsLogitsProcessor, SuppressTokensLogitsProcessor, ForcedBOSTokenLogitsProcessor), structured-output controls (PrefixConstrainedLogitsProcessor), and watermarking (SynthIDTextWatermarkLogitsProcessor). Together they let users implement constrained decoding, JSON-mode generation, and bespoke sampling rules without retraining the underlying model.

Speculative decoding

Speculative decoding, introduced by Chen et al. (2023) and Leviathan et al. (2023), accelerates LLM inference by drafting candidate tokens with a smaller model and then verifying them in a single batched forward pass through the larger target model. The verification step compares the draft model's logits to the target model's logits and accepts or rejects each draft token using a modified rejection sampling scheme that provably preserves the target model's distribution. Reported speedups range from 2x to 3x on production-scale models without any change to output quality.

Contrastive decoding

Contrastive decoding, proposed by Li et al. (2022) and accepted at ACL 2023, scores candidate tokens by the difference between an expert (large) model's log-probabilities and an amateur (small) model's log-probabilities. The intuition is that failure modes such as repetition and dullness are amplified in the smaller model, so subtracting its logits highlights tokens where the expert has genuine, non-trivial preferences. The method is implemented purely as a logit transformation: compute logits from both models, subtract, apply a plausibility constraint to filter out implausible candidates, and then sample as usual.

Summary of common manipulations

Manipulation	What it does to logits	Typical use
Temperature scaling	Divide by T	Tune randomness
Top-k truncation	Set all but top k logits to negative infinity	Cap low-quality tail
Top-p (nucleus) truncation	Mask out logits whose tokens fall outside the top probability mass p	Adapt cutoff to distribution shape
Repetition penalty	Multiply or subtract from logits of recently generated tokens	Reduce loops
Logit bias	Add a fixed offset per token	Force or ban specific tokens
Speculative decoding	Compare draft and target logits	Accelerate inference
Contrastive decoding	Subtract amateur log-probabilities from expert log-probabilities	Suppress generic continuations

Logits in mechanistic interpretability

Logits are also a primary target of mechanistic interpretability research, which studies the internal computations of neural networks at the level of individual circuits.

Logit lens

The logit lens is an interpretability technique introduced by an author writing under the name nostalgebraist on LessWrong on August 31, 2020. It applies the model's final unembedding matrix (the same matrix used to convert the last hidden state into output logits) to the residual stream activations at every intermediate layer of a transformer. This produces an "intermediate logit" distribution at each depth, allowing researchers to read off the model's evolving prediction for the next token as information flows through the network.

The usual qualitative findings are that early layers produce broad, often uninterpretable distributions; middle layers begin to assign mass to grammatically and semantically plausible tokens; and the final few layers refine the prediction toward the model's eventual choice. Early evidence suggests that residual streams quickly leave token space and operate in a predictive space throughout the network, which has been influential in shaping later interpretability theories.

Tuned lens

The tuned lens, introduced by Belrose, Ostrovsky, McKinney, Furman, Smith, Halawi, Biderman, and Steinhardt (2023), is a successor to the logit lens that addresses several of its weaknesses. Instead of applying the unembedding matrix directly to intermediate residual streams, the tuned lens learns a small affine probe at each layer that maps the residual stream into a distribution over the vocabulary. The authors evaluated the technique on autoregressive language models up to 20B parameters and reported that it is more predictive, more reliable, and less biased than the original logit lens. They also demonstrated practical applications, including detecting prompt injections by examining the trajectory of intermediate predictions across layers.

Logit difference and direct logit attribution

Within the transformer circuits research program at Anthropic, a popular analytical tool is the logit difference: the difference between the logit assigned to one specific token and the logit assigned to another, both measured at the model's final layer. Because softmax is invariant to additive constants, the logit difference is the meaningful signed quantity that determines which of two candidates the model prefers.

Direct logit attribution decomposes a logit difference into additive contributions from each attention head, MLP layer, and individual residual stream component. The decomposition is exact for the linear unembedding step and reveals which circuit components push the prediction toward one token versus another. This technique was central to the discovery and analysis of induction heads, which copy and complete repeated patterns of the form A...BC by lifting the logit of token C whenever AB appears earlier in the context.

Logits in logistic regression

In logistic regression, the logit appears in its original statistical sense. The model assumes that the log-odds of the positive class is a linear function of the input features:

logit(P(y = 1 | x)) = ln(P(y = 1 | x) / P(y = 0 | x)) = w^T x + b

Solving for the probability gives the familiar logistic function:

P(y = 1 | x) = 1 / (1 + e^(-(w^T x + b)))

Logistic regression can be viewed as a single-layer neural network with a sigmoid activation, making it the simplest case of using logits for classification. The connection between classical logistic regression and modern neural network output layers is direct: both compute a linear combination of inputs and interpret the result as a logit.

Maximum likelihood estimation for logistic regression maximizes the log-likelihood of the training labels under the logit-based model, which is equivalent to minimizing binary cross-entropy on the predicted probabilities. The Newton-Raphson and iteratively reweighted least squares (IRLS) algorithms are classical solvers for this objective; modern statistical software such as scikit-learn's LogisticRegression and statsmodels' Logit use either L-BFGS, liblinear, SAGA, or IRLS depending on the regularization choice. Because logistic regression learns interpretable per-feature coefficients, the fitted weights are themselves log-odds ratios: a coefficient of 0.7 implies that a one-unit increase in the corresponding feature multiplies the odds by approximately e^0.7 = 2.01, all else equal. This interpretability is one reason logistic regression remains a workhorse in epidemiology, credit scoring, and clinical risk modeling.

Practical implementation notes

Several conventions and best practices apply when working with logits in code:

Do not apply softmax before the loss function. Loss functions like CrossEntropyLoss in PyTorch expect raw logits. Applying softmax first and then passing the result leads to numerically unstable gradients and incorrect loss values.
Use the "with logits" variants of loss functions. Both PyTorch and TensorFlow provide loss functions that accept raw logits and handle the activation internally. In PyTorch, use BCEWithLogitsLoss instead of BCELoss with a manual sigmoid. In TensorFlow, use from_logits=True in the loss constructor.
Store and transmit logits rather than probabilities when possible. Logits preserve full numerical precision, while probabilities can suffer from rounding near 0 and 1.
Convert logits to probabilities only at inference time. During training, the model and optimizer work with logits and the fused loss function. The softmax or sigmoid conversion is only needed when presenting results to users or making final predictions.
Mask invalid choices with negative infinity. When some classes are not allowed (for example, padding tokens or syntactically invalid options), set their logits to a very negative value such as -1e9 instead of zero. Setting them to zero would still leave nonzero softmax probability after normalization.
Watch out for fp16 and bf16 overflow. In half precision, exp(x) overflows for x larger than about 11 (fp16) or 88 (bf16). Fused log-softmax kernels handle this internally, but custom code that exponentiates logits directly should clip or rescale them first.
Cache logits for sampling, not probabilities. Many decoding pipelines cache the previous step's logits to apply transformations like temperature, top-k, and top-p in the same pass. Caching probabilities forces an unnecessary log/exp round trip.

Reading logits from popular APIs and libraries

The table below summarizes how to access raw logits from common interfaces. Logits exposure is sometimes restricted for security or product reasons.

Interface	Access to logits	Notes
OpenAI Chat Completions API	`logprobs` field on completions	Returns top log-probabilities, not raw logits; influence via `logit_bias` (-100 to +100)
Anthropic Claude Messages API	Not exposed	Only `temperature` and `top_p` style sampling parameters are configurable
Hugging Face `transformers` (PyTorch)	`model(...).logits`	Returns the full vocabulary tensor; can be intercepted with a `LogitsProcessor`
`vllm` and `TensorRT-LLM`	Custom logits processors and sampling parameters	Tuned for high-throughput serving
llama.cpp	`llama_get_logits` C API	Returns raw fp32 logits per token
scikit-learn `LogisticRegression`	`decision_function(X)`	Returns the raw linear score (logit) for binary problems

Term	Definition	Range	Relationship to logits
Logit (statistical)	log(p / (1 - p)), the log-odds of a probability	(-inf, +inf)	The original definition
Logit (neural network)	Raw output of the final dense layer before activation	(-inf, +inf)	Named by analogy; not always literal log-odds
Probability	Normalized value representing likelihood	[0, 1]	Obtained from logits via sigmoid or softmax
Log-probability	Natural log of a probability	(-inf, 0]	log(softmax(logits))
Odds	p / (1 - p)	(0, +inf)	exp(logit)
Score	General term for a model's raw output	Varies	Logits are a specific type of score
Probit	Inverse of the standard normal CDF	(-inf, +inf)	Analogous quantity from probit regression, introduced by Bliss in 1934
Decibans / nats	Units of evidence used in information theory	(-inf, +inf)	A logit can be expressed as a number of nats (natural log) of evidence

Worked example: vocabulary-sized logits in an LLM

Suppose a language model with a vocabulary of 50,000 tokens processes the prompt "The capital of France is" and produces a logit vector z of length 50,000. The token "Paris" has logit z_Paris = 9.4, and the next-best token "the" has logit z_the = 5.2, while the rest of the distribution lies below 4.0.

The logit difference z_Paris - z_the = 4.2 means the model assigns roughly e^4.2 = 67 times more probability mass to "Paris" than to "the."
After softmax with T = 1, P(Paris) is approximately 0.97 and P(the) is approximately 0.014.
Setting T = 0.5 sharpens this to P(Paris) close to 0.9999.
Setting T = 2 flattens it to P(Paris) approximately 0.86.
Top-p with p = 0.9 selects only "Paris," producing deterministic output.
A logit bias of -10 applied to "Paris" lowers its logit to -0.6 and pushes the model toward "the" or other completions.

This example illustrates how a single scalar (the logit difference between the top two tokens) controls almost everything about the next-token distribution, and how each sampling parameter can be understood as a deterministic transformation of the logit vector.

Explain like I'm 5 (ELI5)

Imagine you are trying to guess what flavor of ice cream your friend likes best. Before you figure out the actual chances (like "60% chance they like chocolate, 30% vanilla, 10% strawberry"), your brain first comes up with some rough numbers. Maybe chocolate gets a score of 3, vanilla gets 1, and strawberry gets -1. These rough scores are logits. They are just your gut feeling before you turn them into neat percentages.

To go from these rough scores to real percentages, you use a special math recipe called softmax. It takes your scores and squishes them so they are all between 0 and 1 and add up to 100%. The bigger the score, the bigger the percentage. That is essentially what a computer does when it uses logits to make predictions.

If you turn up a knob called temperature, the computer becomes less sure. The chocolate, vanilla, and strawberry chances become closer together, like 40%, 35%, 25%. Turn the knob down, and the computer becomes more confident, picking chocolate almost every time. Big language models work the same way. They produce a long list of logits, one per word in their dictionary, and the temperature knob lets you decide how surprising or boring their answers will be.

References

Berkson, J. (1944). "Application of the Logistic Function to Bio-Assay." *Journal of the American Statistical Association*, 39(227), 357-365.
Bliss, C. I. (1934). "The Method of Probits." *Science*, 79(2037), 38-39.
Bridle, J. S. (1990). "Probabilistic Interpretation of Feedforward Classification Network Outputs." In *Neurocomputing: Algorithms, Architectures and Applications*, NATO ASI Series, 227-236.
Cramer, J. S. (2002). "The Origins of Logistic Regression." *Tinbergen Institute Discussion Paper*, TI 2002-119/4.
Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On Calibration of Modern Neural Networks." *Proceedings of the 34th International Conference on Machine Learning (ICML)*, 1321-1330.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." *Proceedings of the International Conference on Learning Representations (ICLR)*. arXiv:1904.09751.
Nostalgebraist (2020). "interpreting GPT: the logit lens." LessWrong, August 31, 2020.
Belrose, N., Ostrovsky, I., McKinney, L., Furman, Z., Smith, L., Halawi, D., Biderman, S., & Steinhardt, J. (2023). "Eliciting Latent Predictions from Transformers with the Tuned Lens." *arXiv preprint arXiv:2303.08112*.
Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., & Lewis, M. (2023). "Contrastive Decoding: Open-ended Text Generation as Optimization." *Proceedings of ACL 2023*. arXiv:2210.15097.
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." *arXiv preprint arXiv:2302.01318*.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." *Proceedings of the International Conference on Machine Learning (ICML)*.
Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Transformer Circuits Thread, Anthropic.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6: Deep Feedforward Networks.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 4.2: Probabilistic Generative Models.
Murphy, K. P. (2022). *Probabilistic Machine Learning: An Introduction*. MIT Press. Chapter 10: Logistic Regression.
Paszke, A., Gross, S., Massa, F., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems (NeurIPS)*, 32.
OpenAI (2024). "Using logit bias to alter token probability with the OpenAI API." OpenAI Help Center.
Hugging Face (2024). "Utilities for Generation: LogitsProcessor." *transformers* Documentation.

Historical origin of the term

Mathematical definition

The logit function

The logistic function (inverse logit)

Selected probability to logit conversions

Logits in neural networks

Binary classification with the sigmoid function

Multi-class classification with the softmax function

Multi-label classification

Cross-entropy loss and numerical stability

Label smoothing and class weights

Temperature scaling of logits

Knowledge distillation

Confidence calibration

Logits in large language models

Common sampling strategies

Numerical example of token sampling

Logit bias

Hugging Face logits processors

Speculative decoding

Contrastive decoding

Summary of common manipulations

Logits in mechanistic interpretability

Logit lens

Tuned lens

Logit difference and direct logit attribution

Logits in logistic regression

Practical implementation notes

Reading logits from popular APIs and libraries

Comparison of related terms

Worked example: vocabulary-sized logits in an LLM

Explain like I'm 5 (ELI5)

References

See also

Improve this article

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Sparse autoencoder

ARC-AGI 2

LeNet

AUC-ROC

Historical origin of the term

Mathematical definition

The logit function

The logistic function (inverse logit)

Selected probability to logit conversions

Logits in neural networks

Binary classification with the sigmoid function

Multi-class classification with the softmax function

Multi-label classification

Cross-entropy loss and numerical stability

Label smoothing and class weights

Temperature scaling of logits

Knowledge distillation

Confidence calibration

Logits in large language models

Common sampling strategies

Numerical example of token sampling

Logit bias

Hugging Face logits processors

Speculative decoding

Contrastive decoding

Summary of common manipulations

Logits in mechanistic interpretability

Logit lens

Tuned lens

Logit difference and direct logit attribution

Logits in logistic regression

Practical implementation notes

Reading logits from popular APIs and libraries

Comparison of related terms

Worked example: vocabulary-sized logits in an LLM

Explain like I'm 5 (ELI5)

References

See also

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Sparse autoencoder