See also: Machine learning terms
Logistic regression is a statistical method used to model the probability of a binary outcome based on one or more independent variables. In the context of machine learning, logistic regression is a supervised learning algorithm widely used for binary classification problems. Despite the word "regression" in its name, logistic regression is fundamentally a classification model; it predicts the probability that an observation belongs to a particular class, then applies a decision threshold to assign a label.
The technique has deep historical roots. The underlying logistic function was first proposed by the Belgian mathematician Pierre Francois Verhulst in 1838 as a model of population growth. Verhulst named the function in 1845, but his work was largely forgotten after his early death. Raymond Pearl and Lowell Reed rediscovered the logistic function independently in 1920 while studying population dynamics at Johns Hopkins University. The application of the logistic function to statistical regression began with Joseph Berkson, who coined the term "logit" (from "logistic unit") in 1944 and introduced logistic regression as a tool for bioassay analysis. British statistician David Cox then developed and popularized the method as a general-purpose statistical model in his landmark 1958 paper, "The Regression Analysis of Binary Sequences." Cox later extended the model to multinomial outcomes in 1966. Since about 1970, logistic regression has been the most commonly used model for binary regression across the sciences.
Logistic regression is one of the most widely used algorithms in practice because of its simplicity, interpretability, and strong theoretical foundations. It serves as a building block for more complex models, including neural networks, where the sigmoid function used in logistic regression appears as a core component.
The foundation of logistic regression is the sigmoid function (also called the logistic function), which maps any real-valued number to a value between 0 and 1:
sigma(z) = 1 / (1 + e^(-z))
where z is the linear combination of input features and model weights:
z = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p
The model therefore estimates the conditional probability of the positive class as:
P(y = 1 | x) = sigma(beta_0 + beta_1 * x_1 + ... + beta_p * x_p)
The sigmoid function has several important properties that make it suitable for modeling probabilities:
| Property | Description |
|---|---|
| Output range | The function output is always between 0 and 1, making it interpretable as a probability |
| S-shaped curve | The function transitions smoothly from 0 to 1 |
| Symmetry | sigma(z) + sigma(-z) = 1 for all values of z |
| Derivative | sigma'(z) = sigma(z) * (1 - sigma(z)), which simplifies gradient computation during training |
| Midpoint | At z = 0, the output is 0.5, providing a natural threshold for binary classification |
| Asymptotic behavior | Approaches 1 as z goes to positive infinity and 0 as z goes to negative infinity, but never reaches either value |
The sigmoid function is the canonical link function for the Bernoulli distribution, making logistic regression a natural member of the generalized linear model (GLM) family.
The logistic regression model can be understood through the concept of log-odds (also called the logit). If p is the probability of the positive class, then the odds of the positive class are:
odds = p / (1 - p)
The logit function is the natural logarithm of the odds:
logit(p) = ln(p / (1 - p))
Logistic regression assumes that the logit of the probability is a linear function of the input features:
ln(p / (1 - p)) = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p
This is a key insight: while the relationship between the features and the probability is nonlinear (due to the sigmoid function), the relationship between the features and the log-odds is linear. Each coefficient beta_j represents the change in the log-odds of the outcome for a one-unit increase in the corresponding feature x_j, holding all other features constant.
The coefficients in logistic regression have a direct interpretation in terms of odds ratios. The odds ratio for feature x_j is:
OR_j = e^(beta_j)
An odds ratio greater than 1 indicates that increasing the feature value increases the odds of the positive class, while an odds ratio less than 1 indicates a decrease. For example, if the coefficient for a feature is 0.7, the odds ratio is e^0.7 = 2.01, meaning a one-unit increase in that feature roughly doubles the odds of the positive outcome.
| Coefficient value | Odds ratio | Interpretation |
|---|---|---|
| Positive (beta_j > 0) | Greater than 1 | Increases odds of positive class |
| Zero (beta_j = 0) | Equal to 1 | No effect on odds |
| Negative (beta_j < 0) | Less than 1 | Decreases odds of positive class |
While odds ratios are the standard way to interpret logistic regression coefficients, they can be unintuitive. An alternative interpretation uses marginal effects, which quantify the change in the predicted probability for a one-unit change in a predictor. Unlike the coefficient itself (which is constant), the marginal effect of a feature depends on the current values of all predictors, because the sigmoid function is nonlinear. The marginal effect of feature x_j is:
partial P / partial x_j = beta_j * sigma(z) * (1 - sigma(z))
Practitioners often report the "average marginal effect" (AME), computed by averaging the marginal effect over all observations in the dataset. Marginal effects provide a more intuitive measure of a variable's real-world impact on the predicted probability.
Unlike linear regression, logistic regression cannot be trained effectively using ordinary least squares. The nonlinear sigmoid function makes the sum of squared errors a non-convex function of the parameters, meaning that gradient-based methods could get stuck in local minima. Instead, logistic regression uses maximum likelihood estimation (MLE) to find the optimal model parameters.
MLE aims to find the parameters that maximize the probability (likelihood) of observing the training data given the model. For a binary classification problem with n observations, the likelihood function is:
L(beta) = Product from i=1 to n of p_i^(y_i) * (1 - p_i)^(1 - y_i)
where y_i is the true label (0 or 1) for the i-th observation and p_i is the predicted probability for that observation.
In practice, maximizing the likelihood is equivalent to minimizing the negative log-likelihood (also called the cross-entropy loss or log loss):
J(beta) = -(1/n) * Sum from i=1 to n of [y_i * ln(p_i) + (1 - y_i) * ln(1 - p_i)]
The negative log-likelihood is a convex function of the parameters, guaranteeing that optimization algorithms will find the global minimum. This property holds because the log-likelihood for the exponential family (of which the Bernoulli distribution is a member) is always concave.
Several optimization algorithms can minimize the log loss:
| Algorithm | Description | When to use |
|---|---|---|
| Gradient descent | Iteratively updates parameters using the gradient | General-purpose, easy to implement |
| Newton's method (IRLS) | Uses second-order derivatives for faster convergence | Small to medium datasets |
| L-BFGS | Quasi-Newton method, approximates the Hessian | Medium to large datasets |
| Stochastic gradient descent | Updates parameters using one sample (or mini-batch) at a time | Very large datasets, online learning |
| Coordinate descent | Optimizes one parameter at a time | Useful with L1 regularization |
Iteratively Reweighted Least Squares (IRLS), a form of Newton's method, is the classical algorithm for logistic regression and is used in R's glm() function. Modern implementations in libraries like scikit-learn typically use L-BFGS or coordinate descent. Unlike linear regression, there is no closed-form solution for the logistic regression parameters; iterative numerical methods are always required.
The gradient of the log loss with respect to each parameter beta_j is:
partial J / partial beta_j = (1/n) * Sum from i=1 to n of (p_i - y_i) * x_ij
where p_i = sigma(beta_0 + beta_1 * x_1i + ... + beta_p * x_pi) is the predicted probability for observation i. This gradient has an elegant form: it is the average of the product of the prediction error (p_i - y_i) and the feature value x_ij, which is the same structure as the gradient for linear regression. The difference is that p_i is computed through the sigmoid function rather than being a direct linear combination.
The decision boundary is the surface that separates the predicted classes. In binary logistic regression, the model predicts class 1 when p >= threshold and class 0 otherwise. The default threshold is 0.5.
At the decision boundary, the predicted probability equals the threshold. For a threshold of 0.5, this corresponds to z = 0, so the decision boundary is the set of points where:
beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p = 0
This is a linear decision boundary: a line in two dimensions, a plane in three dimensions, or a hyperplane in higher dimensions. This linear boundary is both a strength and a limitation of logistic regression. It works well when the classes are approximately linearly separable, but it cannot capture complex, nonlinear boundaries without feature engineering.
Although logistic regression produces a linear decision boundary by default, practitioners can create nonlinear boundaries by engineering polynomial or interaction features. For example, adding x_1^2, x_2^2, and x_1 * x_2 as additional features allows the decision boundary to take the shape of an ellipse, parabola, or other conic section. This approach is similar in spirit to the kernel trick used in support vector machines, though it requires explicit feature construction.
The default threshold of 0.5 is not always optimal. In many real-world applications, the costs of false positives and false negatives are different. For example, in medical diagnosis, missing a disease (false negative) may be far more costly than a false alarm (false positive). Adjusting the threshold allows practitioners to balance sensitivity and specificity according to the application's needs.
Binary logistic regression is the most common form. The target variable has exactly two possible outcomes (for example, "spam" or "not spam," "disease" or "healthy"). The model estimates P(y = 1 | x) using the sigmoid function, and the probability of the other class is its complement: P(y = 0 | x) = 1 - P(y = 1 | x).
When the target variable has more than two unordered categories, multinomial logistic regression extends the binary model. There are two main approaches:
One-vs-Rest (OvR). This strategy trains one binary classifier per class, treating each class as the positive class and all other classes combined as the negative class. For K classes, this produces K separate binary logistic regression models. The class with the highest predicted probability is chosen as the final prediction.
Softmax (multinomial). The multinomial approach models all classes simultaneously using the softmax function:
P(y = k | x) = e^(z_k) / Sum from j=1 to K of e^(z_j)
where z_k = beta_0k + beta_1k * x_1 + ... + beta_pk * x_p is the linear combination for class k. The softmax function ensures that the probabilities for all classes sum to 1. The loss function for multinomial logistic regression is the categorical cross-entropy:
J = -(1/n) * Sum from i=1 to n of Sum from k=1 to K of y_ik * ln(p_ik)
where y_ik is 1 if observation i belongs to class k and 0 otherwise.
When the categories have a natural order (for example, "low," "medium," "high"), ordinal logistic regression is more appropriate than multinomial logistic regression. The most common variant is the proportional odds model, which models cumulative probabilities using a single set of feature coefficients and multiple threshold (intercept) parameters. This preserves the ordering information and is more parsimonious than fitting separate models for each category.
Regularization adds a penalty term to the cost function to prevent overfitting, especially when the number of features is large relative to the number of training examples or when features are correlated.
L1 regularization adds the sum of the absolute values of the coefficients to the cost function:
J_regularized = J(beta) + lambda * Sum(|beta_j|)
L1 regularization can drive some coefficients to exactly zero, performing automatic feature selection. This is useful when many features are irrelevant or redundant. The resulting model is sparse, which aids interpretability.
L2 regularization adds the sum of the squared coefficients to the cost function:
J_regularized = J(beta) + lambda * Sum(beta_j^2)
L2 regularization shrinks all coefficients toward zero but does not set any to exactly zero. It is the default regularization in many logistic regression implementations, including scikit-learn's LogisticRegression class.
Elastic net combines L1 and L2 penalties, providing a balance between feature selection and coefficient shrinkage:
J_regularized = J(beta) + lambda_1 * Sum(|beta_j|) + lambda_2 * Sum(beta_j^2)
Elastic net is particularly useful when there are groups of correlated features; L1 alone tends to select one feature from each group arbitrarily, while elastic net can retain all members of the group.
The regularization strength is controlled by the parameter C in scikit-learn (where C = 1/lambda). Smaller values of C apply stronger regularization.
| Regularization | Penalty | Feature selection | Sparsity | Default in scikit-learn |
|---|---|---|---|---|
| L1 (Lasso) | Sum of absolute coefficients | Yes | Sparse | No (use penalty='l1') |
| L2 (Ridge) | Sum of squared coefficients | No | Dense | Yes (penalty='l2') |
| Elastic net | L1 + L2 combined | Partial | Moderate | No (use penalty='elasticnet') |
| None | No penalty | No | Dense | No (use penalty='none') |
A confusion matrix provides a detailed breakdown of the model's predictions:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True Positive (TP) | False Negative (FN) |
| Actual negative | False Positive (FP) | True Negative (TN) |
From the confusion matrix, all common classification metrics can be derived.
| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall proportion of correct predictions |
| Precision | TP / (TP + FP) | Proportion of positive predictions that are correct |
| Recall (sensitivity) | TP / (TP + FN) | Proportion of actual positives correctly identified |
| Specificity | TN / (TN + FP) | Proportion of actual negatives correctly identified |
| F1 score | 2 * Precision * Recall / (Precision + Recall) | Harmonic mean of precision and recall |
The ROC curve is a graphical tool for evaluating the performance of a binary classifier across all possible thresholds. It plots the True Positive Rate (sensitivity) against the False Positive Rate as the classification threshold varies from 0 to 1.
The Area Under the ROC Curve (AUC) summarizes the overall performance of the classifier in a single number between 0 and 1:
AUC is a threshold-independent metric, making it useful for comparing classifiers without committing to a specific threshold. It is particularly valuable when dealing with imbalanced datasets, where accuracy alone can be misleading.
For highly imbalanced datasets, the Precision-Recall (PR) curve is often more informative than the ROC curve. The PR curve plots precision against recall at various thresholds. The area under the PR curve (AUPRC) provides a summary metric that focuses on the performance of the classifier on the minority class.
Beyond classification metrics, logistic regression models can be assessed using statistical tests commonly found in software like statsmodels and R:
Logistic regression makes several assumptions that should be verified before relying on the model:
Notably, logistic regression does not assume normality of the features, homoscedasticity, or a linear relationship between features and the outcome (only between features and the log-odds).
Logistic regression is a special case of the generalized linear model (GLM) framework, formalized by John Nelder and Robert Wedderburn in 1972. In this framework, the response variable follows a distribution from the exponential family (here, the Bernoulli distribution), and the link function transforms the mean of the response to a linear predictor. For logistic regression, the link function is the logit. Other binary regression models within the GLM framework include the probit model (which uses the inverse normal CDF as the link function) and the complementary log-log model. The probit and logit models produce very similar predictions for most datasets and differ mainly in the tails of their probability distributions.
Logistic regression is mathematically equivalent to a neural network with no hidden layers and a sigmoid activation function in the output neuron. A single neuron receives several inputs, multiplies each by a weight, adds a bias, sums the results, and passes the sum through the sigmoid function. This is precisely the operation that logistic regression performs. The bias term corresponds to the intercept (beta_0), and the weights correspond to the feature coefficients.
From this perspective, training a logistic regression model with binary cross-entropy loss and gradient descent is identical to training a single-neuron neural network. Deep neural networks can be understood as stacking many such "logistic regression units" (neurons) in layers, with nonlinear activations between them, allowing the network to learn complex, nonlinear decision boundaries that a single logistic regression cannot capture.
One of the main advantages of logistic regression is the interpretability of its coefficients. Several approaches exist for assessing feature importance:
| Method | Description | When to use |
|---|---|---|
| Odds ratios (e^beta_j) | Multiplicative change in odds per unit change in feature | Default interpretation for logistic regression |
| Standardized coefficients | Coefficients after features are z-score normalized | Comparing importance across features with different scales |
| Average marginal effects | Average change in predicted probability per unit change | When probability-scale interpretation is desired |
| Wald test / p-values | Statistical significance of each coefficient | Hypothesis testing, explanatory modeling |
| L1 coefficient magnitude | Absolute value of coefficients after L1 regularization | Predictive modeling with automatic feature selection |
When features are standardized (mean 0, standard deviation 1), the magnitude of the coefficients directly reflects the relative importance of each feature. Without standardization, the coefficient magnitude depends on the scale of the feature and cannot be compared directly across features.
| Strengths | Weaknesses |
|---|---|
| Highly interpretable (coefficients, odds ratios, marginal effects) | Cannot model nonlinear decision boundaries without feature engineering |
| Fast to train and predict | Assumes linearity of log-odds |
| Outputs well-calibrated probabilities | Sensitive to multicollinearity |
| Works well with small to medium datasets | May underperform on complex tasks compared to ensemble methods |
| Built-in regularization options (L1, L2, elastic net) | Requires careful handling of missing data and outliers |
| Low memory footprint | Needs sufficient samples per class for reliable estimates |
| Strong theoretical foundations in statistics | Performance degrades with many irrelevant features (without regularization) |
| Serves as an excellent baseline for any classification task | Cannot directly model interactions unless features are explicitly constructed |
| Classifier | Interpretability | Handles nonlinearity | Probabilistic output | Training speed | Best for |
|---|---|---|---|---|---|
| Logistic regression | High | No (linear boundary) | Yes (well-calibrated) | Fast | Baseline, interpretable models |
| Naive Bayes | Moderate | No | Yes (often poorly calibrated) | Very fast | Text classification, small datasets |
| Decision tree | High | Yes | Limited | Fast | Interpretable nonlinear models |
| Random forest | Low | Yes | Yes (averaged) | Moderate | General-purpose, robust |
| Support vector machine | Low | Yes (with kernels) | Not native (requires Platt scaling) | Moderate | High-dimensional, small-medium data |
| K-nearest neighbors | Low | Yes | Limited | Fast (lazy learner) | Non-parametric, small data |
| Neural network | Low | Yes | Yes (with softmax/sigmoid) | Slow | Complex patterns, large data |
| Gradient boosting | Low | Yes | Yes | Moderate | Competitions, tabular data |
Logistic regression is used across virtually every domain where binary or categorical outcomes must be predicted. Its simplicity, interpretability, and regulatory transparency make it especially popular in fields where model decisions must be explained to stakeholders or regulators.
| Application domain | Example use case | Target variable |
|---|---|---|
| Healthcare | Disease diagnosis, mortality risk prediction | Positive / Negative |
| Finance | Credit scoring, loan default prediction | Default / No default |
| Marketing | Customer churn prediction, ad click-through | Churned / Retained |
| Natural language processing | Sentiment analysis, spam detection | Positive / Negative |
| Fraud detection | Transaction fraud screening, insurance fraud | Fraudulent / Legitimate |
| Epidemiology | Disease risk factor analysis, clinical trials | Infected / Not infected |
| Criminal justice | Recidivism risk assessment | Reoffend / Not reoffend |
| Social sciences | Survey response modeling, voting behavior | Yes / No |
In credit scoring, logistic regression remains the industry standard in many regulatory environments because its coefficients can be directly translated into scorecards with transparent point assignments. In healthcare, logistic regression models for clinical risk prediction (such as the Framingham Risk Score for cardiovascular disease) have been validated over decades of use.
Scikit-learn provides a robust, production-ready implementation of logistic regression through the LogisticRegression class. It is designed primarily for predictive modeling and applies L2 regularization by default.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob))
Key parameters include penalty (regularization type), C (inverse regularization strength), solver (optimization algorithm), and multi_class (strategy for multiclass problems).
| Solver | Supports L1 | Supports L2 | Supports multinomial | Best for |
|---|---|---|---|---|
| lbfgs | No | Yes | Yes | Default, medium datasets |
| liblinear | Yes | Yes | No (OvR only) | Small datasets, L1 penalty |
| saga | Yes | Yes | Yes | Large datasets, all penalties |
| newton-cg | No | Yes | Yes | Medium datasets |
| newton-cholesky | No | Yes | No | Dense features, many samples |
Statsmodels is designed for explanatory (inferential) modeling and provides detailed statistical output, including p-values, confidence intervals, and hypothesis tests. It does not apply regularization by default.
import statsmodels.api as sm
X_with_const = sm.add_constant(X_train) # add intercept term
model = sm.Logit(y_train, X_with_const)
result = model.fit()
print(result.summary()) # detailed statistical summary
Statsmodels also supports logistic regression through its GLM interface:
model = sm.GLM(y_train, X_with_const, family=sm.families.Binomial())
result = model.fit()
In R, logistic regression is fit using the glm() function with family = binomial:
model <- glm(y ~ x1 + x2 + x3, data = train_data, family = binomial)
summary(model)
predictions <- predict(model, newdata = test_data, type = "response")
R's glm() uses IRLS (iteratively reweighted least squares) as its default optimizer. The summary() output includes coefficient estimates, standard errors, z-values, and p-values for each predictor.
| Feature | scikit-learn | statsmodels | R glm() |
|---|---|---|---|
| Primary purpose | Prediction | Inference | Inference |
| Default regularization | L2 (C=1.0) | None | None |
| p-values / confidence intervals | Not built-in | Yes | Yes |
| Supports cross-validation | Yes (with LogisticRegressionCV) | No (manual) | No (manual, or via caret) |
| Optimization algorithms | L-BFGS, liblinear, saga, etc. | Newton-Raphson | IRLS |
| Best suited for | Machine learning pipelines | Statistical analysis | Statistical analysis |
Imagine you want to figure out if a fruit is an apple or an orange based on its color and size. Logistic regression is like drawing a line between the apples and oranges on a chart. On one side of the line, the model says "probably an apple." On the other side, it says "probably an orange." The closer a fruit is to the line, the less sure the model is. The farther away it is, the more confident the prediction. Logistic regression learns where to draw this line by looking at lots of examples of apples and oranges you have already identified.