Log-odds, also known as the logit, is a mathematical transformation that converts a probability value between 0 and 1 into a real number spanning from negative infinity to positive infinity. Formally, the log-odds of a probability p is defined as the natural logarithm of the odds ratio: logit(p) = ln(p / (1 - p)). This transformation is foundational in logistic regression, Bayesian statistics, information theory, and modern deep learning, where it provides a mathematically convenient way to model binary outcomes, interpret classifier outputs, and update beliefs.
The term "logit" was coined in 1944 by the American statistician Joseph Berkson as a contraction of "logistic unit," paralleling the earlier "probit" (probability unit) introduced by Chester Ittner Bliss in 1934. The related term "log-odds" was popularized by G. A. Barnard in 1949. Despite being nearly 80 years old, the logit remains one of the most widely used functions in both classical statistics and contemporary machine learning.
Imagine you have a bag with 9 red marbles and 1 blue marble. The chance (probability) of picking a red marble is 90%, or 0.9. Now, the "odds" of picking red are 9 to 1, because for every 1 blue marble there are 9 red ones. Log-odds just means taking the logarithm of those odds, which turns the number 9 into about 2.2.
Why bother? Because probabilities are stuck between 0 and 1, which makes math tricky. Odds are stuck above 0, which is still limiting. But log-odds can be any number, positive or negative, which makes it much easier for a computer or a formula to work with. When the log-odds is 0, you have a 50-50 chance. Positive log-odds mean "more likely than not," and negative log-odds mean "less likely than not." The further the number is from 0, the more lopsided things are.
So log-odds is really just a clever trick to turn "how likely is something" into a number that plays nicely with addition, subtraction, and all the math that machine learning models need to do.
The logit function maps probabilities from the open interval (0, 1) to the entire real line (-infinity, +infinity). For a probability p:
logit(p) = ln(p / (1 - p)) = ln(p) - ln(1 - p)
Several equivalent expressions exist:
The logit can also be expressed as a Taylor series expansion around p = 0.5:
logit(x) = 2 * sum of [(2x - 1)^(2n+1) / (2n + 1)] for n = 0, 1, 2, ...
The transformation proceeds in two stages. First, a probability p is converted to odds:
odds = p / (1 - p)
Then the natural logarithm is applied:
log-odds = ln(odds) = ln(p / (1 - p))
Each representation has a different range, which is summarized in the following table:
| Representation | Range | Key property |
|---|---|---|
| Probability | [0, 1] | Bounded on both sides |
| Odds | [0, +infinity) | Bounded on one side |
| Log-odds (logit) | (-infinity, +infinity) | Unbounded |
The following table illustrates how specific probabilities, odds, and log-odds relate to each other:
| Probability (p) | Odds (p / (1 - p)) | Log-odds (ln(odds)) | Interpretation |
|---|---|---|---|
| 0.01 | 0.0101 | -4.595 | Very unlikely |
| 0.05 | 0.0526 | -2.944 | Unlikely |
| 0.10 | 0.1111 | -2.197 | Unlikely |
| 0.20 | 0.2500 | -1.386 | Somewhat unlikely |
| 0.30 | 0.4286 | -0.847 | Somewhat unlikely |
| 0.40 | 0.6667 | -0.405 | Slightly below even |
| 0.50 | 1.0000 | 0.000 | Even odds |
| 0.60 | 1.5000 | 0.405 | Slightly above even |
| 0.70 | 2.3333 | 0.847 | Somewhat likely |
| 0.80 | 4.0000 | 1.386 | Likely |
| 0.90 | 9.0000 | 2.197 | Very likely |
| 0.95 | 19.000 | 2.944 | Very likely |
| 0.99 | 99.000 | 4.595 | Near certain |
The table highlights a key symmetry: log-odds values are symmetric around zero. A probability of 0.9 yields log-odds of approximately +2.197, while a probability of 0.1 yields log-odds of approximately -2.197.
The logit function has several mathematical properties that make it well suited for statistical modeling.
The domain of the logit function is the open interval (0, 1), excluding the endpoints 0 and 1 where the function is undefined (since ln(0) is undefined). The range covers all real numbers from -infinity to +infinity. This unbounded range is one of the main reasons log-odds are preferred over raw probabilities in regression modeling, since a linear function of predictors naturally produces values across the entire real line.
The logit function is antisymmetric around p = 0.5:
logit(1 - p) = -logit(p)
This means that the log-odds of an event not happening is the negative of the log-odds of it happening. At p = 0.5, the logit equals zero, providing a natural reference point where neither outcome is favored. This symmetry also means that positive log-odds indicate the event is more likely than not, negative log-odds indicate it is less likely than not, and zero log-odds correspond to a 50-50 chance.
The logit is a strictly increasing function. As p increases from 0 to 1, logit(p) increases from -infinity to +infinity. This preserves the ordering of probabilities: if p_1 > p_2, then logit(p_1) > logit(p_2). The one-to-one mapping ensures that every probability corresponds to exactly one log-odds value and vice versa.
The derivative of the logit function is:
d/dp logit(p) = 1 / (p(1 - p))
This derivative is always positive on (0, 1), confirming monotonicity. The derivative is smallest at p = 0.5 (where it equals 4) and grows without bound as p approaches 0 or 1, meaning the logit "stretches" the probability scale more at the extremes.
Because the logarithm converts multiplication to addition, independent contributions to the odds can be combined by adding their log-odds rather than multiplying odds directly. This additive property is why linear models work naturally in log-odds space. In logistic regression, each predictor contributes an additive term to the log-odds, and the combined effect is the sum of all individual contributions.
The sampling distribution of the log-odds ratio has a better normal distribution approximation than the odds ratio itself. This property makes the log-odds ratio especially useful for constructing confidence intervals and performing hypothesis tests in medical and social science research. It also means that the log-odds scale is more suitable for meta-analytic combination of results across studies.
The logit function is the negative of the derivative of the binary cross-entropy function H(p) = -p ln(p) - (1 - p) ln(1 - p). This connection is significant in information theory and in the design of loss functions for classification tasks.
The inverse of the logit function is the sigmoid function (also called the logistic function or expit function):
sigma(x) = 1 / (1 + e^(-x)) = e^x / (e^x + 1)
Given a log-odds value x, the sigmoid function converts it back to a probability. The sigmoid can also be expressed as:
sigma(x) = (tanh(x/2) + 1) / 2
The logit and sigmoid functions form an inverse pair:
sigma(logit(p)) = p
logit(sigma(x)) = x
This inverse relationship is central to logistic regression, where the model assumes that the log-odds of the outcome is a linear function of the input features, and the sigmoid function converts those log-odds back into predicted probabilities.
| Property | Logit (log-odds) | Sigmoid (inverse logit) |
|---|---|---|
| Formula | ln(p / (1 - p)) | 1 / (1 + e^(-x)) |
| Input range | (0, 1) | (-infinity, +infinity) |
| Output range | (-infinity, +infinity) | (0, 1) |
| Midpoint | logit(0.5) = 0 | sigma(0) = 0.5 |
| Symmetry | logit(1 - p) = -logit(p) | sigma(-x) = 1 - sigma(x) |
| Use case | Transform probability to real line | Transform real line to probability |
Logistic regression is a generalized linear model for binary classification in which the logit serves as the canonical link function for the Bernoulli distribution. The model assumes that the log-odds of the positive class is a linear combination of the input features:
logit(p) = ln(p / (1 - p)) = B_0 + B_1*x_1 + B_2*x_2 + ... + B_n*x_n
Here, B_0 is the intercept (the log-odds when all predictors are zero) and each B_k represents the change in log-odds for a one-unit increase in x_k, holding all other variables constant.
The coefficients of a logistic regression model have a direct interpretation in terms of log-odds and odds ratios:
| Coefficient value | Effect on log-odds | Odds ratio (e^B) | Interpretation |
|---|---|---|---|
| B = 0 | No change | OR = 1.00 | Predictor has no effect |
| B = 0.5 | Increase by 0.5 | OR = 1.65 | 65% increase in odds per unit |
| B = 1.0 | Increase by 1.0 | OR = 2.72 | Odds nearly triple per unit |
| B = -0.5 | Decrease by 0.5 | OR = 0.61 | 39% decrease in odds per unit |
| B = -1.0 | Decrease by 1.0 | OR = 0.37 | Odds cut to about one-third per unit |
Exponentiating a coefficient yields the odds ratio: OR = e^(B_k). An odds ratio greater than 1 indicates that higher values of x_k are associated with higher odds of the outcome, while an odds ratio less than 1 indicates lower odds. This straightforward interpretation is one of the main reasons logistic regression remains popular in applied research.
Modeling probabilities directly with a linear function creates problems. A linear function can produce values below 0 or above 1, which are invalid probabilities. The logit transformation eliminates this issue by mapping probabilities to an unbounded range where linear modeling is appropriate. Applying the sigmoid to the resulting linear combination guarantees that the predicted probability stays between 0 and 1.
Additionally, log-odds produce a model whose output is a continuous and differentiable function of the parameters, which is necessary for optimization methods like gradient descent and maximum likelihood estimation.
The parameters of a logistic regression model are typically estimated using maximum likelihood estimation (MLE). Unlike linear regression, there is no closed-form solution for the logistic regression coefficients. Instead, iterative optimization algorithms such as Newton-Raphson, iteratively reweighted least squares (IRLS), or gradient descent are used to find the parameter values that maximize the log-likelihood function.
The log-likelihood for logistic regression is:
L(B) = sum of [y_i * ln(sigma(x_i^T * B)) + (1 - y_i) * ln(1 - sigma(x_i^T * B))]
where y_i in {0, 1} are the observed labels and sigma is the sigmoid function. This is a concave function, guaranteeing that any local maximum is also the global maximum.
The log-odds representation plays a natural role in Bayesian inference, particularly when updating beliefs in light of new evidence.
Bayes' theorem can be expressed in odds form. If H is a hypothesis and D is observed data:
posterior odds = prior odds * Bayes factor
Taking the logarithm of both sides:
log(posterior odds) = log(prior odds) + log(Bayes factor)
More explicitly, for hypotheses H_1 and H_2:
log(P(H_1|D) / P(H_2|D)) = log(P(H_1) / P(H_2)) + log(P(D|H_1) / P(D|H_2))
This is an additive update rule: in the log-odds space, updating beliefs is simply a matter of adding the log Bayes factor to the prior log-odds. This is far more intuitive than multiplying probabilities, especially during sequential updating where multiple pieces of evidence arrive over time.
The log Bayes factor, sometimes called the "weight of evidence" (a term introduced by I. J. Good in his 1950 book Probability and the Weighing of Evidence), quantifies how much a piece of evidence supports one hypothesis over another:
W = ln(P(D|H_1) / P(D|H_2))
When W > 0, the evidence supports H_1. When W < 0, the evidence supports H_2. When W = 0, the evidence is neutral. This additive framework makes it straightforward to accumulate evidence from multiple independent sources by simply summing their individual weights.
Some authors measure evidence strength in decibans (tenths of a ban), analogous to decibels in acoustics. One ban corresponds to a likelihood ratio of 10 (using base-10 logarithms), while one deciban corresponds to a likelihood ratio of approximately 1.26.
| Evidence strength | Decibans | Likelihood ratio | Meaning |
|---|---|---|---|
| Barely worth mentioning | 0-5 | 1-3.2 | Very weak evidence |
| Substantial | 5-10 | 3.2-10 | Moderate support |
| Strong | 10-15 | 10-32 | Strong support |
| Very strong | 15-20 | 32-100 | Very strong support |
| Decisive | >20 | >100 | Overwhelming evidence |
This logarithmic scale makes it natural to compare and combine evidence from different sources. Sums are often more intuitive than products, which is one of the reasons thinking in log-odds is practical for working with cumulative evidence.
In many real-world applications, evidence arrives incrementally. The log-odds formulation is particularly convenient for sequential Bayesian updating because:
These properties make log-odds the preferred representation in applications like robotics (occupancy grid mapping), medical diagnosis, and online learning systems.
In modern deep learning, the term "logits" refers to the raw, unnormalized output scores produced by the final layer of a neural network before any activation function is applied. Although this usage is technically an extension of the original statistical definition, it has become standard terminology in the field.
In a classification network, the final fully connected layer produces a vector of real-valued scores, one for each possible class. These scores are called logits. For a binary classification problem, a single logit is produced, and the sigmoid function converts it to a probability. For multi-class classification with K classes, a vector of K logits is produced, and the softmax function converts them to a probability distribution. In this multi-class context, individual logits are not strict log-odds in the binary sense, but they function analogously as unnormalized log-probabilities.
| Task | Number of logits | Activation function | Output |
|---|---|---|---|
| Binary classification | 1 | Sigmoid | Single probability |
| Multi-class classification | K (one per class) | Softmax | Probability distribution over K classes |
| Multi-label classification | K (one per label) | Sigmoid (per logit) | Independent probabilities |
Several practical reasons explain why deep learning frameworks like PyTorch and TensorFlow prefer logits over probabilities:
CrossEntropyLoss in PyTorch and tf.nn.softmax_cross_entropy_with_logits in TensorFlow combine the softmax and the log operation into a single numerically stable computation.Temperature scaling is a technique that divides logits by a scalar T > 0 before applying the softmax function:
P(class = k) = e^(z_k / T) / sum_j e^(z_j / T)
where z_k is the logit for class k.
| Temperature | Effect | Use case |
|---|---|---|
| T < 1 | Sharpens distribution; model becomes more confident | Inference with high-confidence predictions |
| T = 1 | No change; recovers standard softmax | Default behavior |
| T > 1 | Flattens distribution; model becomes less confident | Model calibration, knowledge distillation |
| T approaches infinity | Approaches uniform distribution | Maximum entropy sampling, exploration |
| T approaches 0 | Approaches argmax (winner-take-all) | Greedy decoding |
Temperature scaling is widely used in large language models to control the randomness of text generation and in knowledge distillation to transfer knowledge from a large teacher model to a smaller student model.
Large language models (LLMs) such as GPT, Claude, and LLaMA produce logits as their primary output at each step of text generation. For every position in the output sequence, the model computes a logit for each token in its vocabulary (often 30,000 to 100,000+ tokens). These logits are divided by a temperature parameter and then passed through softmax to obtain the probability distribution for the next token:
adjusted_logits = logits / temperature
probabilities = softmax(adjusted_logits)
A temperature below 1.0 sharpens the distribution (making the most probable tokens even more likely), while a temperature above 1.0 flattens it (making less probable tokens more competitive). Sampling strategies such as top-k and top-p (nucleus sampling) operate on these adjusted logits to balance between deterministic and creative text generation.
Many LLM APIs provide access to "logprobs" (log probabilities), which are closely related to log-odds. Log probabilities are useful for:
In gradient boosting algorithms for classification, such as XGBoost, LightGBM, and CatBoost, the model operates internally in log-odds space. The initial prediction is typically the log-odds of the positive class computed from the base rate in the training data. Each subsequent decision tree in the ensemble adds a correction to the current log-odds prediction, and the final prediction is the sum of all tree outputs. The log-odds total is then converted to a probability using the sigmoid function.
This approach works because the gradient of the binary cross-entropy loss with respect to the log-odds prediction has a clean mathematical form. Specifically, the gradient (pseudo-residual) at each step is simply the difference between the observed label and the current predicted probability, which is straightforward to compute and leads to efficient tree construction.
initial prediction = log(count of positives / count of negatives)
for each boosting round:
residuals = observed labels - current predicted probabilities
fit a decision tree to the residuals
update log-odds predictions by adding tree outputs
final probability = sigmoid(sum of all tree outputs)
In epidemiology and clinical research, the odds ratio (OR) is one of the most commonly reported measures of association between an exposure and an outcome, particularly in case-control studies and clinical trials. The log of the odds ratio is used extensively in meta-analysis and in logistic regression models applied to observational data.
| Odds ratio value | Log odds ratio | Interpretation |
|---|---|---|
| OR = 0.25 | ln(OR) = -1.386 | Strong protective association |
| OR = 0.5 | ln(OR) = -0.693 | Odds are halved |
| OR = 1 | ln(OR) = 0 | No association |
| OR > 1 | ln(OR) > 0 | Exposure associated with higher odds of outcome |
| OR = 2 | ln(OR) = 0.693 | Odds are doubled |
| OR = 4 | ln(OR) = 1.386 | Strong positive association |
A 95% confidence interval for the odds ratio that does not include 1.0 indicates statistical significance at the 5% level.
The log-odds ratio is preferred over the raw odds ratio in meta-analyses for several reasons:
When the outcome of interest is rare (typically less than 10% incidence), the odds ratio approximates the relative risk (RR). This is known as the rare disease assumption and is one reason odds ratios are widely used in case-control studies, where relative risk cannot be directly estimated. However, when the outcome is common, the odds ratio can substantially overestimate the relative risk, and caution is needed when interpreting results.
The logit is not the only function that maps probabilities to the real line. Several alternatives exist, each derived from a different probability distribution.
| Link function | Formula | Underlying distribution | Common use case |
|---|---|---|---|
| Logit | ln(p / (1 - p)) | Logistic distribution | Logistic regression; default for binary outcomes |
| Probit | inverse of normal CDF | Normal distribution | Dose-response models; econometrics |
| Complementary log-log | ln(-ln(1 - p)) | Gumbel distribution | Survival analysis; asymmetric responses |
| Cauchit | tan(pi * (p - 0.5)) | Cauchy distribution | Heavy-tailed outcomes |
| Log-log | -ln(-ln(p)) | Reversed Gumbel | Rare events with right skew |
The logit and probit functions are both sigmoid-shaped transformations used in binary response models. They differ in their underlying distributional assumptions but produce nearly identical results for moderate probabilities. They diverge noticeably only in the extreme tails.
| Property | Logit model | Probit model |
|---|---|---|
| Link function | Logistic CDF inverse | Normal CDF inverse |
| Error distribution | Logistic distribution | Normal (Gaussian) distribution |
| Tail behavior | Heavier tails | Lighter tails |
| Coefficient interpretation | Log-odds change; exponentiate for odds ratio | Z-score change; no direct odds ratio |
| Computational cost | Closed-form sigmoid; faster | Requires numerical evaluation of normal CDF |
| Typical fields | Medicine, epidemiology, ML | Econometrics, psychometrics |
The logit model is generally preferred in applied work because the coefficients can be directly interpreted as log-odds changes, and the odds ratio interpretation (e^B) is intuitive and widely understood.
The choice of logarithmic base in the logit function determines the unit of measurement, connecting log-odds to information theory:
| Logarithmic base | Unit name | Symbol | Common usage |
|---|---|---|---|
| Base e (natural log) | Nat | nat | Statistics, machine learning |
| Base 2 | Shannon (or bit) | Sh | Information theory, digital systems |
| Base 10 | Hartley (or ban) | Hart | Engineering, some Bayesian analyses |
When using base-2 logarithms, the log-odds can be interpreted in terms of information content. A change of 1 bit in log-odds corresponds to a doubling or halving of the odds. Converting between bases is a simple scaling: log_b(x) = ln(x) / ln(b). The choice of base affects the numerical values but not the qualitative interpretation.
This information-theoretic perspective also connects to the use of log probabilities in natural language processing. The negative expected value of log probabilities is the entropy of a distribution, a concept that underpins both cross-entropy loss functions and perplexity metrics.
Working with log-odds offers practical advantages in computation, especially when dealing with very small or very large probabilities.
When probabilities are extremely small (e.g., 10^-15), multiplying them together in floating-point arithmetic can lead to underflow (the result rounds to zero). In the log-odds domain, multiplications become additions, which are numerically stable:
log(p_1 * p_2 * ... * p_n) = log(p_1) + log(p_2) + ... + log(p_n)
This property is especially valuable in applications like hidden Markov models, Bayesian networks, naive Bayes classifiers, and occupancy grid mapping in robotics.
When converting back from log-odds to probabilities, the log-sum-exp trick prevents overflow and underflow:
log(e^a + e^b) = max(a, b) + log(1 + e^(-|a - b|))
This technique is used internally by deep learning frameworks when computing the softmax of logits and when evaluating the cross-entropy loss in a numerically stable way. All major frameworks, including PyTorch, TensorFlow, and JAX, implement this optimization.
The core numerical challenge arises from the fact that computers represent numbers with finite precision. Floating-point formats like FP32 have strict limits on how large or small a stored value can be. When computing exp(x), large positive values can exceed the maximum representable number and overflow to infinity, while large negative values can underflow to zero. By working with logits rather than probabilities, these extreme cases are handled more gracefully.
The concept of log-odds has a long intellectual history spanning several centuries and disciplines.
| Year | Contributor | Contribution |
|---|---|---|
| 1838 | Pierre-Francois Verhulst | Proposed the logistic function for population growth modeling |
| 1860s | Charles Sanders Peirce | Used log-odds extensively in probability theory |
| 1934 | Chester Ittner Bliss | Introduced the probit function for dose-response analysis |
| 1944 | Joseph Berkson | Coined "logit" (logistic unit); proposed as alternative to probit for bioassay |
| 1949 | G. A. Barnard | Popularized the term "log-odds" |
| 1950 | I. J. Good | Formalized "weight of evidence" using log-odds in Bayesian theory |
| 1958 | David Cox | Developed the modern framework of logistic regression |
| 1970s-1980s | Daniel McFadden | Extended logit models to discrete choice; won Nobel Prize in Economics (2000) |
| 2010s | Deep learning era | "Logits" becomes standard terminology for raw neural network outputs |
Consider a logistic regression model for email spam detection with two features: the number of exclamation marks (x_1) and the number of suspicious links (x_2). Suppose the fitted model is:
logit(p) = -2.0 + 0.8 * x_1 + 1.5 * x_2
For an email with 3 exclamation marks and 2 suspicious links:
logit(p) = -2.0 + 0.8(3) + 1.5(2) = -2.0 + 2.4 + 3.0 = 3.4
p = sigma(3.4) = 1 / (1 + e^(-3.4)) = 0.968
The model predicts a 96.8% probability that this email is spam.
Interpreting the coefficients: each additional exclamation mark increases the log-odds of spam by 0.8, which corresponds to an odds ratio of e^0.8 = 2.23 (the odds of spam roughly double per exclamation mark). Each additional suspicious link increases the log-odds by 1.5, corresponding to an odds ratio of e^1.5 = 4.48 (the odds of spam roughly quadruple per link).
A doctor estimates a 10% prior probability that a patient has a disease (prior log-odds = ln(0.1/0.9) = -2.197). A test with a likelihood ratio of 15 (meaning the test is 15 times more likely to come back positive if the patient is sick than if they are healthy) is performed, and the result is positive:
log(Bayes factor) = ln(15) = 2.708
posterior log-odds = -2.197 + 2.708 = 0.511
posterior probability = sigma(0.511) = 1 / (1 + e^(-0.511)) = 0.625
After the positive test, the probability has risen from 10% to about 62.5%. If a second independent test (with a likelihood ratio of 8) is also positive, the update is another simple addition:
log(second Bayes factor) = ln(8) = 2.079
updated log-odds = 0.511 + 2.079 = 2.590
updated probability = sigma(2.590) = 0.930
After two positive tests, the probability has risen to about 93.0%.
In a dataset where 200 out of 1000 training examples belong to the positive class, a gradient boosting classifier initializes its prediction as:
initial log-odds = ln(200 / 800) = ln(0.25) = -1.386
initial probability = sigma(-1.386) = 0.20
Each subsequent tree then adjusts this log-odds value. If the first tree outputs +0.5 for a particular sample, the updated prediction becomes:
updated log-odds = -1.386 + 0.5 = -0.886
updated probability = sigma(-0.886) = 0.292