In machine learning and statistics, logits refer to the raw, unnormalized scores output by a model before they are converted into probabilities. The term has two closely related but distinct meanings: (1) the output of the logit function, which maps a probability to a log-odds value, and (2) the raw output values from the final layer of a neural network before an activation function such as softmax or sigmoid is applied. Both usages are connected by the idea that these values live on an unbounded real-number scale and must be transformed to become interpretable as probabilities.
The word "logit" was coined by American statistician Joseph Berkson in his 1944 paper "Application of the Logistic Function to Bio-Assay," published in the Journal of the American Statistical Association. Berkson constructed the term as an abbreviation of "logistic unit," deliberately mirroring Chester Ittner Bliss's earlier term "probit" (probability unit), which had been introduced in 1934. Berkson proposed the logit model (later known as logistic regression) as a computationally simpler alternative to the probit model that was prevalent at the time. His approach used the logarithm of the odds ratio as the response variable, and the technique gradually gained widespread adoption across biostatistics, economics, and eventually machine learning.
The logit function maps a probability value p in the open interval (0, 1) to a real number on the entire number line:
logit(p) = ln(p / (1 - p))
Here, p / (1 - p) is known as the odds of the event, and taking the natural logarithm yields the log-odds. The logit function is therefore also called the log-odds function. Equivalent expressions include:
The domain of the logit function is (0, 1), and its range is all real numbers from negative infinity to positive infinity.
The logit function is the inverse of the standard logistic function (also called the sigmoid function):
sigma(x) = 1 / (1 + e^(-x))
Applying the logistic function to a logit value recovers the original probability. Conversely, applying the logit function to a probability produced by the sigmoid yields the original unbounded score. This inverse relationship is central to logistic regression, where the model's linear predictor (a weighted sum of features) produces logit values, and the sigmoid function converts them into predicted probabilities.
In the context of deep learning, the term "logits" most commonly refers to the raw output values of a neural network's final layer before any probability-producing activation function is applied. These values result from a linear transformation of the previous layer's activations:
z = Wx + b
where W is the weight matrix of the output layer, x is the input from the preceding layer, and b is the bias vector. The resulting vector z contains one logit value per class. These logit values can be any real number, positive or negative, and do not sum to any particular value.
The logits are then passed through an activation function to produce a probability distribution over the output classes. The choice of activation function depends on the classification task.
In binary classification, a neural network typically produces a single logit value z. The sigmoid function then maps this value to a probability:
P(y = 1 | x) = sigma(z) = 1 / (1 + e^(-z))
The probability of the negative class is simply the complement: P(y = 0 | x) = 1 - sigma(z). A logit of 0 corresponds to a probability of 0.5 (equal odds for both classes). Positive logits indicate the model favors the positive class, while negative logits indicate it favors the negative class. The magnitude of the logit reflects the model's confidence: a logit of 5.0 corresponds to a probability of about 0.993, while a logit of -5.0 corresponds to about 0.007.
For problems with more than two mutually exclusive classes, the softmax function generalizes the sigmoid to a vector of logits:
P(y = k | x) = e^(z_k) / sum(e^(z_j) for j = 1 to K)
where z_k is the logit for class k and K is the total number of classes. The softmax function ensures that all output probabilities are positive and sum to 1. In the special case where K = 2, the softmax function reduces to the sigmoid function.
The following table summarizes the key differences between the two approaches:
| Property | Sigmoid | Softmax |
|---|---|---|
| Number of classes | 2 (binary) | K >= 2 (multi-class) |
| Output range | (0, 1) per logit | (0, 1) per class, sums to 1 |
| Class relationship | Independent (can also be used for multi-label) | Mutually exclusive |
| Formula | 1 / (1 + e^(-z)) | e^(z_k) / sum(e^(z_j)) |
| Typical loss function | Binary cross-entropy | Categorical cross-entropy |
| Use case example | Spam detection, sentiment polarity | ImageNet classification, language modeling |
In multi-label classification, where each input can belong to multiple classes simultaneously, the sigmoid function is applied independently to each logit. Unlike softmax, this produces probabilities that do not need to sum to 1, since the classes are not mutually exclusive. For example, an image might be tagged as both "outdoor" and "sunny" at the same time.
In practice, cross-entropy loss is almost always computed directly from logits rather than from probabilities, for reasons of numerical stability. When softmax probabilities are computed as a separate step and then passed to a logarithm, the computation can produce extremely small values that underflow to zero, causing the logarithm to return negative infinity. Computing the log-softmax in a single fused operation avoids this problem by using the log-sum-exp trick:
log_softmax(z_k) = z_k - log(sum(e^(z_j) for j = 1 to K))
This formulation subtracts the maximum logit value before exponentiation, keeping all intermediate values in a numerically stable range.
For binary cross-entropy with logits, PyTorch and TensorFlow both use a reformulation that avoids overflow:
L = max(z, 0) - y * z + log(1 + e^(-|z|))
where z is the logit and y is the target label (0 or 1).
The following table lists the primary loss functions that accept raw logits in popular deep learning frameworks:
| Framework | Loss function | Task |
|---|---|---|
| PyTorch | torch.nn.CrossEntropyLoss | Multi-class classification |
| PyTorch | torch.nn.BCEWithLogitsLoss | Binary / multi-label classification |
| TensorFlow | tf.nn.softmax_cross_entropy_with_logits | Multi-class classification |
| TensorFlow | tf.nn.sigmoid_cross_entropy_with_logits | Binary / multi-label classification |
PyTorch's CrossEntropyLoss internally combines LogSoftmax and NLLLoss (negative log-likelihood loss) into a single numerically stable operation. Passing pre-computed softmax probabilities to a separate log function is generally discouraged because it sacrifices this stability.
Temperature scaling is a technique that divides logits by a scalar parameter T (called the temperature) before applying the softmax function:
P(y = k | x) = e^(z_k / T) / sum(e^(z_j / T) for j = 1 to K)
The temperature parameter controls how "sharp" or "flat" the resulting probability distribution is:
| Temperature value | Effect on distribution | Behavior |
|---|---|---|
| T < 1 | Sharper (more peaked) | Model becomes more confident; highest-probability class dominates |
| T = 1 | Standard softmax | Default behavior with no modification |
| T > 1 | Flatter (more uniform) | Model becomes less confident; probability mass spreads across classes |
| T approaches 0 | Approaches argmax | Deterministic selection of the highest-logit class |
| T approaches infinity | Approaches uniform distribution | All classes become equally likely |
Temperature scaling is used in several important contexts.
In their 2015 paper "Distilling the Knowledge in a Neural Network," Geoffrey Hinton, Oriol Vinyals, and Jeff Dean introduced the use of high-temperature softmax outputs as "soft targets" for training smaller student models. When a large teacher model's logits are divided by a high temperature (typically between 2 and 20), the resulting probability distribution reveals the teacher's learned relationships between classes. For instance, a digit classifier might assign a small but meaningful probability to "7" when shown a "1," because the two digits share visual features. These soft targets carry more information than hard one-hot labels, allowing a smaller model to learn from the teacher's "dark knowledge." During distillation training, both the teacher and student use the same temperature value.
Guo et al. (2017) demonstrated in "On Calibration of Modern Neural Networks" that modern deep networks tend to be poorly calibrated, meaning their predicted probabilities do not accurately reflect the true likelihood of correctness. Temperature scaling with a single learned parameter T, optimized on a validation set to minimize negative log-likelihood, can restore calibration with minimal computational cost. This post-processing technique simply divides all logits by T before applying softmax and has been shown to be surprisingly effective.
In autoregressive large language models (LLMs) such as GPT-4, Claude, and LLaMA, logits play a central role in next-token prediction. At each generation step, the model produces a vector of logits with one entry per token in its vocabulary (often 32,000 to 128,000 or more tokens). These logits represent the model's assessment of how likely each token is to come next given the preceding context.
The logits are then transformed into a probability distribution (typically via softmax with optional temperature scaling), and a token is sampled from this distribution. Several sampling strategies operate directly on the logit or probability space.
| Strategy | Description |
|---|---|
| Greedy decoding | Always selects the token with the highest logit (argmax). Deterministic but can produce repetitive output. |
| Temperature sampling | Divides logits by temperature T before softmax. Higher T increases diversity; lower T increases focus. |
| Top-k sampling | After computing probabilities from logits, keeps only the k most probable tokens and redistributes probability mass among them. |
| Top-p (nucleus) sampling | Sorts tokens by probability, selects the smallest set whose cumulative probability exceeds threshold p (typically 0.9 or 0.95), and samples from that set. |
| Min-p sampling | Filters out tokens whose probability is below a fraction of the most probable token's probability. Adapts dynamically to the shape of the distribution. |
| Beam search | Maintains multiple candidate sequences (beams), extending each by the highest-logit tokens at each step. Common in machine translation. |
In practice, these strategies are often combined. A typical LLM inference pipeline applies temperature scaling first, then filters candidates with top-k or top-p, and finally samples from the remaining tokens. API providers such as OpenAI, Anthropic, and Google expose temperature, top-k, and top-p as user-configurable parameters.
Some LLM APIs allow users to add a fixed bias to specific token logits before sampling. This can be used to encourage or discourage particular tokens. For example, adding a large negative bias to the logit for a profanity token effectively prevents the model from generating it, while a positive bias can steer the model toward a preferred vocabulary.
In logistic regression, the logit appears in its original statistical sense. The model assumes that the log-odds of the positive class is a linear function of the input features:
logit(P(y = 1 | x)) = ln(P(y = 1 | x) / P(y = 0 | x)) = w^T x + b
Solving for the probability gives the familiar logistic function:
P(y = 1 | x) = 1 / (1 + e^(-(w^T x + b)))
Logistic regression can be viewed as a single-layer neural network with a sigmoid activation, making it the simplest case of using logits for classification. The connection between classical logistic regression and modern neural network output layers is direct: both compute a linear combination of inputs and interpret the result as a logit.
Several conventions and best practices apply when working with logits in code:
CrossEntropyLoss in PyTorch expect raw logits. Applying softmax first and then passing the result leads to numerically unstable gradients and incorrect loss values.BCEWithLogitsLoss instead of BCELoss with a manual sigmoid. In TensorFlow, use from_logits=True in the loss constructor.| Term | Definition | Range | Relationship to logits |
|---|---|---|---|
| Logit (statistical) | log(p / (1 - p)), the log-odds of a probability | (-inf, +inf) | The original definition |
| Logit (neural network) | Raw output of the final dense layer before activation | (-inf, +inf) | Named by analogy; not always literal log-odds |
| Probability | Normalized value representing likelihood | [0, 1] | Obtained from logits via sigmoid or softmax |
| Log-probability | Natural log of a probability | (-inf, 0] | log(softmax(logits)) |
| Odds | p / (1 - p) | (0, +inf) | exp(logit) |
| Score | General term for a model's raw output | Varies | Logits are a specific type of score |
Imagine you are trying to guess what flavor of ice cream your friend likes best. Before you figure out the actual chances (like "60% chance they like chocolate, 30% vanilla, 10% strawberry"), your brain first comes up with some rough numbers. Maybe chocolate gets a score of 3, vanilla gets 1, and strawberry gets -1. These rough scores are logits. They are just your gut feeling before you turn them into neat percentages.
To go from these rough scores to real percentages, you use a special math recipe called softmax. It takes your scores and squishes them so they are all between 0 and 1 and add up to 100%. The bigger the score, the bigger the percentage. That is essentially what a computer does when it uses logits to make predictions.