Log-Odds

Log-odds, also known as the logit, is a mathematical transformation that converts a probability value between 0 and 1 into a real number spanning from negative infinity to positive infinity. Formally, the log-odds of a probability p is defined as the natural logarithm of the odds ratio: logit(p) = ln(p / (1 - p)). This transformation is foundational in logistic regression, Bayesian statistics, information theory, and modern deep learning, where it provides a mathematically convenient way to model binary outcomes, interpret classifier outputs, and update beliefs.

The term "logit" was coined in 1944 by the American statistician Joseph Berkson as a contraction of "logistic unit," paralleling the earlier "probit" (probability unit) introduced by Chester Ittner Bliss in 1934. The related term "log-odds" was popularized by G. A. Barnard in 1949. Despite being nearly 80 years old, the logit remains one of the most widely used functions in both classical statistics and contemporary machine learning.

ELI5: Explain like I'm 5

Imagine you have a bag with 9 red marbles and 1 blue marble. The chance (probability) of picking a red marble is 90%, or 0.9. Now, the "odds" of picking red are 9 to 1, because for every 1 blue marble there are 9 red ones. Log-odds just means taking the logarithm of those odds, which turns the number 9 into about 2.2.

Why bother? Because probabilities are stuck between 0 and 1, which makes math tricky. Odds are stuck above 0, which is still limiting. But log-odds can be any number, positive or negative, which makes it much easier for a computer or a formula to work with. When the log-odds is 0, you have a 50-50 chance. Positive log-odds mean "more likely than not," and negative log-odds mean "less likely than not." The further the number is from 0, the more lopsided things are.

So log-odds is really just a clever trick to turn "how likely is something" into a number that plays nicely with addition, subtraction, and all the math that machine learning models need to do.

Mathematical definition

The logit function maps probabilities from the open interval (0, 1) to the entire real line (-infinity, +infinity). For a probability p:

logit(p) = ln(p / (1 - p)) = ln(p) - ln(1 - p)

Several equivalent expressions exist:

logit(p) = -ln(1/p - 1)
logit(p) = 2 * arctanh(2p - 1)

The logit can also be expressed as a Taylor series expansion around p = 0.5:

logit(x) = 2 * sum of [(2x - 1)^(2n+1) / (2n + 1)]   for n = 0, 1, 2, ...

From probability to odds to log-odds

The transformation proceeds in two stages. First, a probability p is converted to odds:

odds = p / (1 - p)

Then the natural logarithm is applied:

log-odds = ln(odds) = ln(p / (1 - p))

Each representation has a different range, which is summarized in the following table:

Representation	Range	Key property
Probability	[0, 1]	Bounded on both sides
Odds	[0, +infinity)	Bounded on one side
Log-odds (logit)	(-infinity, +infinity)	Unbounded

The following table illustrates how specific probabilities, odds, and log-odds relate to each other:

Probability (p)	Odds (p / (1 - p))	Log-odds (ln(odds))	Interpretation
0.01	0.0101	-4.595	Very unlikely
0.05	0.0526	-2.944	Unlikely
0.10	0.1111	-2.197	Unlikely
0.20	0.2500	-1.386	Somewhat unlikely
0.30	0.4286	-0.847	Somewhat unlikely
0.40	0.6667	-0.405	Slightly below even
0.50	1.0000	0.000	Even odds
0.60	1.5000	0.405	Slightly above even
0.70	2.3333	0.847	Somewhat likely
0.80	4.0000	1.386	Likely
0.90	9.0000	2.197	Very likely
0.95	19.000	2.944	Very likely
0.99	99.000	4.595	Near certain

The table highlights a key symmetry: log-odds values are symmetric around zero. A probability of 0.9 yields log-odds of approximately +2.197, while a probability of 0.1 yields log-odds of approximately -2.197.

Properties of the logit function

The logit function has several mathematical properties that make it well suited for statistical modeling.

Domain and range

The domain of the logit function is the open interval (0, 1), excluding the endpoints 0 and 1 where the function is undefined (since ln(0) is undefined). The range covers all real numbers from -infinity to +infinity. This unbounded range is one of the main reasons log-odds are preferred over raw probabilities in regression modeling, since a linear function of predictors naturally produces values across the entire real line.

Symmetry

The logit function is antisymmetric around p = 0.5:

logit(1 - p) = -logit(p)

This means that the log-odds of an event not happening is the negative of the log-odds of it happening. At p = 0.5, the logit equals zero, providing a natural reference point where neither outcome is favored. This symmetry also means that positive log-odds indicate the event is more likely than not, negative log-odds indicate it is less likely than not, and zero log-odds correspond to a 50-50 chance.

Monotonicity

The logit is a strictly increasing function. As p increases from 0 to 1, logit(p) increases from -infinity to +infinity. This preserves the ordering of probabilities: if p_1 > p_2, then logit(p_1) > logit(p_2). The one-to-one mapping ensures that every probability corresponds to exactly one log-odds value and vice versa.

Derivative

The derivative of the logit function is:

d/dp logit(p) = 1 / (p(1 - p))

This derivative is always positive on (0, 1), confirming monotonicity. The derivative is smallest at p = 0.5 (where it equals 4) and grows without bound as p approaches 0 or 1, meaning the logit "stretches" the probability scale more at the extremes.

Additivity

Because the logarithm converts multiplication to addition, independent contributions to the odds can be combined by adding their log-odds rather than multiplying odds directly. This additive property is why linear models work naturally in log-odds space. In logistic regression, each predictor contributes an additive term to the log-odds, and the combined effect is the sum of all individual contributions.

Approximate normality

The sampling distribution of the log-odds ratio has a better normal distribution approximation than the odds ratio itself. This property makes the log-odds ratio especially useful for constructing confidence intervals and performing hypothesis tests in medical and social science research. It also means that the log-odds scale is more suitable for meta-analytic combination of results across studies.

Relationship to binary entropy

The logit function is the negative of the derivative of the binary cross-entropy function H(p) = -p ln(p) - (1 - p) ln(1 - p). This connection is significant in information theory and in the design of loss functions for classification tasks.

The inverse logit: the sigmoid function

The inverse of the logit function is the sigmoid function (also called the logistic function or expit function):

sigma(x) = 1 / (1 + e^(-x)) = e^x / (e^x + 1)

Given a log-odds value x, the sigmoid function converts it back to a probability. The sigmoid can also be expressed as:

sigma(x) = (tanh(x/2) + 1) / 2

The logit and sigmoid functions form an inverse pair:

sigma(logit(p)) = p
logit(sigma(x)) = x

This inverse relationship is central to logistic regression, where the model assumes that the log-odds of the outcome is a linear function of the input features, and the sigmoid function converts those log-odds back into predicted probabilities.

Property	Logit (log-odds)	Sigmoid (inverse logit)
Formula	ln(p / (1 - p))	1 / (1 + e^(-x))
Input range	(0, 1)	(-infinity, +infinity)
Output range	(-infinity, +infinity)	(0, 1)
Midpoint	logit(0.5) = 0	sigma(0) = 0.5
Symmetry	logit(1 - p) = -logit(p)	sigma(-x) = 1 - sigma(x)
Use case	Transform probability to real line	Transform real line to probability

Log-odds in logistic regression

Logistic regression is a generalized linear model for binary classification in which the logit serves as the canonical link function for the Bernoulli distribution. The model assumes that the log-odds of the positive class is a linear combination of the input features:

logit(p) = ln(p / (1 - p)) = B_0 + B_1*x_1 + B_2*x_2 + ... + B_n*x_n

Here, B_0 is the intercept (the log-odds when all predictors are zero) and each B_k represents the change in log-odds for a one-unit increase in x_k, holding all other variables constant.

Interpreting coefficients

The coefficients of a logistic regression model have a direct interpretation in terms of log-odds and odds ratios:

Coefficient value	Effect on log-odds	Odds ratio (e^B)	Interpretation
B = 0	No change	OR = 1.00	Predictor has no effect
B = 0.5	Increase by 0.5	OR = 1.65	65% increase in odds per unit
B = 1.0	Increase by 1.0	OR = 2.72	Odds nearly triple per unit
B = -0.5	Decrease by 0.5	OR = 0.61	39% decrease in odds per unit
B = -1.0	Decrease by 1.0	OR = 0.37	Odds cut to about one-third per unit

Exponentiating a coefficient yields the odds ratio: OR = e^(B_k). An odds ratio greater than 1 indicates that higher values of x_k are associated with higher odds of the outcome, while an odds ratio less than 1 indicates lower odds. This straightforward interpretation is one of the main reasons logistic regression remains popular in applied research.

Why use log-odds instead of probability

Modeling probabilities directly with a linear function creates problems. A linear function can produce values below 0 or above 1, which are invalid probabilities. The logit transformation eliminates this issue by mapping probabilities to an unbounded range where linear modeling is appropriate. Applying the sigmoid to the resulting linear combination guarantees that the predicted probability stays between 0 and 1.

Additionally, log-odds produce a model whose output is a continuous and differentiable function of the parameters, which is necessary for optimization methods like gradient descent and maximum likelihood estimation.

Maximum likelihood estimation

The parameters of a logistic regression model are typically estimated using maximum likelihood estimation (MLE). Unlike linear regression, there is no closed-form solution for the logistic regression coefficients. Instead, iterative optimization algorithms such as Newton-Raphson, iteratively reweighted least squares (IRLS), or gradient descent are used to find the parameter values that maximize the log-likelihood function.

The log-likelihood for logistic regression is:

L(B) = sum of [y_i * ln(sigma(x_i^T * B)) + (1 - y_i) * ln(1 - sigma(x_i^T * B))]

where y_i in {0, 1} are the observed labels and sigma is the sigmoid function. This is a concave function, guaranteeing that any local maximum is also the global maximum.

Log-odds in the Bayesian framework

The log-odds representation plays a natural role in Bayesian inference, particularly when updating beliefs in light of new evidence.

Bayes' rule in odds form

Bayes' theorem can be expressed in odds form. If H is a hypothesis and D is observed data:

posterior odds = prior odds * Bayes factor

Taking the logarithm of both sides:

log(posterior odds) = log(prior odds) + log(Bayes factor)

More explicitly, for hypotheses H_1 and H_2:

log(P(H_1|D) / P(H_2|D)) = log(P(H_1) / P(H_2)) + log(P(D|H_1) / P(D|H_2))

This is an additive update rule: in the log-odds space, updating beliefs is simply a matter of adding the log Bayes factor to the prior log-odds. This is far more intuitive than multiplying probabilities, especially during sequential updating where multiple pieces of evidence arrive over time.

Weight of evidence

The log Bayes factor, sometimes called the "weight of evidence" (a term introduced by I. J. Good in his 1950 book Probability and the Weighing of Evidence), quantifies how much a piece of evidence supports one hypothesis over another:

W = ln(P(D|H_1) / P(D|H_2))

When W > 0, the evidence supports H_1. When W < 0, the evidence supports H_2. When W = 0, the evidence is neutral. This additive framework makes it straightforward to accumulate evidence from multiple independent sources by simply summing their individual weights.

Decibans and evidence strength

Some authors measure evidence strength in decibans (tenths of a ban), analogous to decibels in acoustics. One ban corresponds to a likelihood ratio of 10 (using base-10 logarithms), while one deciban corresponds to a likelihood ratio of approximately 1.26.

Evidence strength	Decibans	Likelihood ratio	Meaning
Barely worth mentioning	0-5	1-3.2	Very weak evidence
Substantial	5-10	3.2-10	Moderate support
Strong	10-15	10-32	Strong support
Very strong	15-20	32-100	Very strong support
Decisive	>20	>100	Overwhelming evidence

This logarithmic scale makes it natural to compare and combine evidence from different sources. Sums are often more intuitive than products, which is one of the reasons thinking in log-odds is practical for working with cumulative evidence.

Advantages for sequential updating

In many real-world applications, evidence arrives incrementally. The log-odds formulation is particularly convenient for sequential Bayesian updating because:

Each new observation adds (or subtracts) a fixed quantity to the running log-odds total.
The order in which evidence is processed does not matter (for independent observations).
The log scale prevents numerical underflow that can occur when multiplying many small probabilities together.

These properties make log-odds the preferred representation in applications like robotics (occupancy grid mapping), medical diagnosis, and online learning systems.

Log-odds in neural networks and deep learning

In modern deep learning, the term "logits" refers to the raw, unnormalized output scores produced by the final layer of a neural network before any activation function is applied. Although this usage is technically an extension of the original statistical definition, it has become standard terminology in the field.

Logits as raw network outputs

In a classification network, the final fully connected layer produces a vector of real-valued scores, one for each possible class. These scores are called logits. For a binary classification problem, a single logit is produced, and the sigmoid function converts it to a probability. For multi-class classification with K classes, a vector of K logits is produced, and the softmax function converts them to a probability distribution. In this multi-class context, individual logits are not strict log-odds in the binary sense, but they function analogously as unnormalized log-probabilities.

Task	Number of logits	Activation function	Output
Binary classification	1	Sigmoid	Single probability
Multi-class classification	K (one per class)	Softmax	Probability distribution over K classes
Multi-label classification	K (one per label)	Sigmoid (per logit)	Independent probabilities

Why work with logits instead of probabilities

Several practical reasons explain why deep learning frameworks like PyTorch and TensorFlow prefer logits over probabilities:

Numerical stability. Computing the cross-entropy loss directly from logits avoids taking the logarithm of very small numbers, which can cause floating-point underflow. Functions like CrossEntropyLoss in PyTorch and tf.nn.softmax_cross_entropy_with_logits in TensorFlow combine the softmax and the log operation into a single numerically stable computation.
Gradient quality. Gradients computed from logits tend to be better behaved during backpropagation, reducing the risk of vanishing gradients. When probabilities are very close to 0 or 1, gradients can become extremely large and destabilize training.
Flexibility. Logits can be post-processed in various ways (temperature scaling, top-k filtering, nucleus sampling) before being converted to probabilities, giving practitioners more control over model behavior.

Temperature scaling

Temperature scaling is a technique that divides logits by a scalar T > 0 before applying the softmax function:

P(class = k) = e^(z_k / T) / sum_j e^(z_j / T)

where z_k is the logit for class k.

Temperature	Effect	Use case
T < 1	Sharpens distribution; model becomes more confident	Inference with high-confidence predictions
T = 1	No change; recovers standard softmax	Default behavior
T > 1	Flattens distribution; model becomes less confident	Model calibration, knowledge distillation
T approaches infinity	Approaches uniform distribution	Maximum entropy sampling, exploration
T approaches 0	Approaches argmax (winner-take-all)	Greedy decoding

Temperature scaling is widely used in large language models to control the randomness of text generation and in knowledge distillation to transfer knowledge from a large teacher model to a smaller student model.

Log-odds in large language models

Large language models (LLMs) such as GPT, Claude, and LLaMA produce logits as their primary output at each step of text generation. For every position in the output sequence, the model computes a logit for each token in its vocabulary (often 30,000 to 100,000+ tokens). These logits are divided by a temperature parameter and then passed through softmax to obtain the probability distribution for the next token:

adjusted_logits = logits / temperature
probabilities = softmax(adjusted_logits)

A temperature below 1.0 sharpens the distribution (making the most probable tokens even more likely), while a temperature above 1.0 flattens it (making less probable tokens more competitive). Sampling strategies such as top-k and top-p (nucleus sampling) operate on these adjusted logits to balance between deterministic and creative text generation.

Log probabilities in LLM APIs

Many LLM APIs provide access to "logprobs" (log probabilities), which are closely related to log-odds. Log probabilities are useful for:

Confidence estimation. Tokens with higher log probabilities indicate that the model is more certain about its prediction.
Perplexity calculation. Perplexity, a common metric for evaluating language models, is computed directly from log probabilities as the exponentiated average negative log-likelihood.
Classification tasks. When using an LLM as a classifier, log probabilities for specific tokens (e.g., "Yes" vs. "No") can be compared to determine the model's answer.
Calibration analysis. Log probabilities reveal whether a model's confidence aligns with its actual accuracy.

Log-odds in gradient boosting

In gradient boosting algorithms for classification, such as XGBoost, LightGBM, and CatBoost, the model operates internally in log-odds space. The initial prediction is typically the log-odds of the positive class computed from the base rate in the training data. Each subsequent decision tree in the ensemble adds a correction to the current log-odds prediction, and the final prediction is the sum of all tree outputs. The log-odds total is then converted to a probability using the sigmoid function.

This approach works because the gradient of the binary cross-entropy loss with respect to the log-odds prediction has a clean mathematical form. Specifically, the gradient (pseudo-residual) at each step is simply the difference between the observed label and the current predicted probability, which is straightforward to compute and leads to efficient tree construction.

initial prediction = log(count of positives / count of negatives)
for each boosting round:
    residuals = observed labels - current predicted probabilities
    fit a decision tree to the residuals
    update log-odds predictions by adding tree outputs
final probability = sigmoid(sum of all tree outputs)

Log-odds in epidemiology and clinical research

In epidemiology and clinical research, the odds ratio (OR) is one of the most commonly reported measures of association between an exposure and an outcome, particularly in case-control studies and clinical trials. The log of the odds ratio is used extensively in meta-analysis and in logistic regression models applied to observational data.

Odds ratio interpretation

Odds ratio value	Log odds ratio	Interpretation
OR = 0.25	ln(OR) = -1.386	Strong protective association
OR = 0.5	ln(OR) = -0.693	Odds are halved
OR = 1	ln(OR) = 0	No association
OR > 1	ln(OR) > 0	Exposure associated with higher odds of outcome
OR = 2	ln(OR) = 0.693	Odds are doubled
OR = 4	ln(OR) = 1.386	Strong positive association

A 95% confidence interval for the odds ratio that does not include 1.0 indicates statistical significance at the 5% level.

Advantages of log-odds in meta-analysis

The log-odds ratio is preferred over the raw odds ratio in meta-analyses for several reasons:

Its sampling distribution is approximately normal, making it suitable for standard statistical methods.
Its variance can be estimated straightforwardly from the observed cell counts of a 2x2 contingency table.
The scale is symmetric: an OR of 2 and an OR of 0.5 are equidistant from the null value on the log scale.
Individual study log-odds ratios can be combined using weighted averaging, with weights inversely proportional to each study's variance.

The rare disease assumption

When the outcome of interest is rare (typically less than 10% incidence), the odds ratio approximates the relative risk (RR). This is known as the rare disease assumption and is one reason odds ratios are widely used in case-control studies, where relative risk cannot be directly estimated. However, when the outcome is common, the odds ratio can substantially overestimate the relative risk, and caution is needed when interpreting results.

The logit is not the only function that maps probabilities to the real line. Several alternatives exist, each derived from a different probability distribution.

Link function	Formula	Underlying distribution	Common use case
Logit	ln(p / (1 - p))	Logistic distribution	Logistic regression; default for binary outcomes
Probit	inverse of normal CDF	Normal distribution	Dose-response models; econometrics
Complementary log-log	ln(-ln(1 - p))	Gumbel distribution	Survival analysis; asymmetric responses
Cauchit	tan(pi * (p - 0.5))	Cauchy distribution	Heavy-tailed outcomes
Log-log	-ln(-ln(p))	Reversed Gumbel	Rare events with right skew

Logit vs. probit

The logit and probit functions are both sigmoid-shaped transformations used in binary response models. They differ in their underlying distributional assumptions but produce nearly identical results for moderate probabilities. They diverge noticeably only in the extreme tails.

Property	Logit model	Probit model
Link function	Logistic CDF inverse	Normal CDF inverse
Error distribution	Logistic distribution	Normal (Gaussian) distribution
Tail behavior	Heavier tails	Lighter tails
Coefficient interpretation	Log-odds change; exponentiate for odds ratio	Z-score change; no direct odds ratio
Computational cost	Closed-form sigmoid; faster	Requires numerical evaluation of normal CDF
Typical fields	Medicine, epidemiology, ML	Econometrics, psychometrics

The logit model is generally preferred in applied work because the coefficients can be directly interpreted as log-odds changes, and the odds ratio interpretation (e^B) is intuitive and widely understood.

Information-theoretic perspective

The choice of logarithmic base in the logit function determines the unit of measurement, connecting log-odds to information theory:

Logarithmic base	Unit name	Symbol	Common usage
Base e (natural log)	Nat	nat	Statistics, machine learning
Base 2	Shannon (or bit)	Sh	Information theory, digital systems
Base 10	Hartley (or ban)	Hart	Engineering, some Bayesian analyses

When using base-2 logarithms, the log-odds can be interpreted in terms of information content. A change of 1 bit in log-odds corresponds to a doubling or halving of the odds. Converting between bases is a simple scaling: log_b(x) = ln(x) / ln(b). The choice of base affects the numerical values but not the qualitative interpretation.

This information-theoretic perspective also connects to the use of log probabilities in natural language processing. The negative expected value of log probabilities is the entropy of a distribution, a concept that underpins both cross-entropy loss functions and perplexity metrics.

Numerical considerations

Working with log-odds offers practical advantages in computation, especially when dealing with very small or very large probabilities.

Avoiding numerical underflow

When probabilities are extremely small (e.g., 10^-15), multiplying them together in floating-point arithmetic can lead to underflow (the result rounds to zero). In the log-odds domain, multiplications become additions, which are numerically stable:

log(p_1 * p_2 * ... * p_n) = log(p_1) + log(p_2) + ... + log(p_n)

This property is especially valuable in applications like hidden Markov models, Bayesian networks, naive Bayes classifiers, and occupancy grid mapping in robotics.

Log-sum-exp trick

When converting back from log-odds to probabilities, the log-sum-exp trick prevents overflow and underflow:

log(e^a + e^b) = max(a, b) + log(1 + e^(-|a - b|))

This technique is used internally by deep learning frameworks when computing the softmax of logits and when evaluating the cross-entropy loss in a numerically stable way. All major frameworks, including PyTorch, TensorFlow, and JAX, implement this optimization.

Floating-point limits

The core numerical challenge arises from the fact that computers represent numbers with finite precision. Floating-point formats like FP32 have strict limits on how large or small a stored value can be. When computing exp(x), large positive values can exceed the maximum representable number and overflow to infinity, while large negative values can underflow to zero. By working with logits rather than probabilities, these extreme cases are handled more gracefully.

Historical development

The concept of log-odds has a long intellectual history spanning several centuries and disciplines.

Year	Contributor	Contribution
1838	Pierre-Francois Verhulst	Proposed the logistic function for population growth modeling
1860s	Charles Sanders Peirce	Used log-odds extensively in probability theory
1934	Chester Ittner Bliss	Introduced the probit function for dose-response analysis
1944	Joseph Berkson	Coined "logit" (logistic unit); proposed as alternative to probit for bioassay
1949	G. A. Barnard	Popularized the term "log-odds"
1950	I. J. Good	Formalized "weight of evidence" using log-odds in Bayesian theory
1958	David Cox	Developed the modern framework of logistic regression
1970s-1980s	Daniel McFadden	Extended logit models to discrete choice; won Nobel Prize in Economics (2000)
2010s	Deep learning era	"Logits" becomes standard terminology for raw neural network outputs

Practical examples

Example 1: Logistic regression for spam detection

Consider a logistic regression model for email spam detection with two features: the number of exclamation marks (x_1) and the number of suspicious links (x_2). Suppose the fitted model is:

logit(p) = -2.0 + 0.8 * x_1 + 1.5 * x_2

For an email with 3 exclamation marks and 2 suspicious links:

logit(p) = -2.0 + 0.8(3) + 1.5(2) = -2.0 + 2.4 + 3.0 = 3.4
p = sigma(3.4) = 1 / (1 + e^(-3.4)) = 0.968

The model predicts a 96.8% probability that this email is spam.

Interpreting the coefficients: each additional exclamation mark increases the log-odds of spam by 0.8, which corresponds to an odds ratio of e^0.8 = 2.23 (the odds of spam roughly double per exclamation mark). Each additional suspicious link increases the log-odds by 1.5, corresponding to an odds ratio of e^1.5 = 4.48 (the odds of spam roughly quadruple per link).

Example 2: Bayesian updating with log-odds

A doctor estimates a 10% prior probability that a patient has a disease (prior log-odds = ln(0.1/0.9) = -2.197). A test with a likelihood ratio of 15 (meaning the test is 15 times more likely to come back positive if the patient is sick than if they are healthy) is performed, and the result is positive:

log(Bayes factor) = ln(15) = 2.708
posterior log-odds = -2.197 + 2.708 = 0.511
posterior probability = sigma(0.511) = 1 / (1 + e^(-0.511)) = 0.625

After the positive test, the probability has risen from 10% to about 62.5%. If a second independent test (with a likelihood ratio of 8) is also positive, the update is another simple addition:

log(second Bayes factor) = ln(8) = 2.079
updated log-odds = 0.511 + 2.079 = 2.590
updated probability = sigma(2.590) = 0.930

After two positive tests, the probability has risen to about 93.0%.

Example 3: Gradient boosting initialization

In a dataset where 200 out of 1000 training examples belong to the positive class, a gradient boosting classifier initializes its prediction as:

initial log-odds = ln(200 / 800) = ln(0.25) = -1.386
initial probability = sigma(-1.386) = 0.20

Each subsequent tree then adjusts this log-odds value. If the first tree outputs +0.5 for a particular sample, the updated prediction becomes:

updated log-odds = -1.386 + 0.5 = -0.886
updated probability = sigma(-0.886) = 0.292

References

Berkson, J. (1944). "Application of the logistic function to bio-assay." *Journal of the American Statistical Association*, 39(227), 357-365.
Cramer, J. S. (2004). "The early origins of the logit model." *Studies in History and Philosophy of Science Part C*, 35(4), 613-626.
Cox, D. R. (1958). "The regression analysis of binary sequences." *Journal of the Royal Statistical Society: Series B*, 20(2), 215-242.
Barnard, G. A. (1949). "Statistical inference." *Journal of the Royal Statistical Society: Series B*, 11(2), 115-149.
Good, I. J. (1950). *Probability and the Weighing of Evidence*. London: Charles Griffin.
Norton, E. C. and Dowd, B. E. (2018). "Log odds and the interpretation of logit models." *Health Services Research*, 53(2), 859-878.
McCullagh, P. and Nelder, J. A. (1989). *Generalized Linear Models*, 2nd ed. London: Chapman and Hall.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. New York: Springer.
Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). *Applied Logistic Regression*, 3rd ed. Hoboken, NJ: Wiley.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. Cambridge, MA: MIT Press.
Bland, J. M. and Altman, D. G. (2000). "The odds ratio." *BMJ*, 320(7247), 1468.
Szumilas, M. (2010). "Explaining odds ratios." *Journal of the Canadian Academy of Child and Adolescent Psychiatry*, 19(3), 227-229.
Chen, T. and Guestrin, C. (2016). "XGBoost: a scalable tree boosting system." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the knowledge in a neural network." *arXiv preprint arXiv:1503.02531*.

ELI5: Explain like I'm 5

Mathematical definition

From probability to odds to log-odds

Properties of the logit function

Domain and range

Symmetry

Monotonicity

Derivative

Additivity

Approximate normality

Relationship to binary entropy

The inverse logit: the sigmoid function

Log-odds in logistic regression

Interpreting coefficients

Why use log-odds instead of probability

Maximum likelihood estimation

Log-odds in the Bayesian framework

Bayes' rule in odds form

Weight of evidence

Decibans and evidence strength

Advantages for sequential updating

Log-odds in neural networks and deep learning

Logits as raw network outputs

Why work with logits instead of probabilities

Temperature scaling

Log-odds in large language models

Log probabilities in LLM APIs

Log-odds in gradient boosting

Log-odds in epidemiology and clinical research

Odds ratio interpretation

Advantages of log-odds in meta-analysis

The rare disease assumption

Comparison with related link functions

Logit vs. probit

Information-theoretic perspective

Numerical considerations

Avoiding numerical underflow

Log-sum-exp trick

Floating-point limits

Historical development

Practical examples

Example 1: Logistic regression for spam detection

Example 2: Bayesian updating with log-odds

Example 3: Gradient boosting initialization

See also

References

Improve this article

Related Articles

ARC-AGI 2

Hyperplane

Independently and Identically Distributed (i.i.d.)

Earth Mover's Distance

Curse of Dimensionality

Bayes' theorem

ELI5: Explain like I'm 5

Mathematical definition

From probability to odds to log-odds

Properties of the logit function

Domain and range

Symmetry

Monotonicity

Derivative

Additivity

Approximate normality

Relationship to binary entropy

The inverse logit: the sigmoid function

Log-odds in logistic regression

Interpreting coefficients

Why use log-odds instead of probability

Maximum likelihood estimation

Log-odds in the Bayesian framework

Bayes' rule in odds form

Weight of evidence

Decibans and evidence strength

Advantages for sequential updating

Log-odds in neural networks and deep learning

Logits as raw network outputs

Why work with logits instead of probabilities

Temperature scaling

Log-odds in large language models