Logistic Regression

Introduction

Logistic regression is a statistical method used to model the probability of a binary outcome based on one or more independent variables. In the context of machine learning, logistic regression is a supervised learning algorithm widely used for binary classification problems. Despite the word "regression" in its name, logistic regression is fundamentally a classification model; it predicts the probability that an observation belongs to a particular class, then applies a decision threshold to assign a label.

The technique has deep historical roots. The underlying logistic function was first proposed by the Belgian mathematician Pierre Francois Verhulst in 1838 as a model of population growth. Verhulst named the function in 1845, but his work was largely forgotten after his early death. Raymond Pearl and Lowell Reed rediscovered the logistic function independently in 1920 while studying population dynamics at Johns Hopkins University. The application of the logistic function to statistical regression began with Joseph Berkson, who coined the term "logit" (from "logistic unit") in 1944 and introduced logistic regression as a tool for bioassay analysis. British statistician David Cox then developed and popularized the method as a general-purpose statistical model in his landmark 1958 paper, "The Regression Analysis of Binary Sequences." Cox later extended the model to multinomial outcomes in 1966. Since about 1970, logistic regression has been the most commonly used model for binary regression across the sciences.

Logistic regression is one of the most widely used algorithms in practice because of its simplicity, interpretability, and strong theoretical foundations. It serves as a building block for more complex models, including neural networks, where the sigmoid function used in logistic regression appears as a core component.

Mathematical formulation

The logistic (sigmoid) function

The foundation of logistic regression is the sigmoid function (also called the logistic function), which maps any real-valued number to a value between 0 and 1:

sigma(z) = 1 / (1 + e^(-z))

where z is the linear combination of input features and model weights:

z = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p

The model therefore estimates the conditional probability of the positive class as:

P(y = 1 | x) = sigma(beta_0 + beta_1 * x_1 + ... + beta_p * x_p)

Properties of the sigmoid function

The sigmoid function has several important properties that make it suitable for modeling probabilities:

Property	Description
Output range	The function output is always between 0 and 1, making it interpretable as a probability
S-shaped curve	The function transitions smoothly from 0 to 1
Symmetry	sigma(z) + sigma(-z) = 1 for all values of z
Derivative	sigma'(z) = sigma(z) * (1 - sigma(z)), which simplifies gradient computation during training
Midpoint	At z = 0, the output is 0.5, providing a natural threshold for binary classification
Asymptotic behavior	Approaches 1 as z goes to positive infinity and 0 as z goes to negative infinity, but never reaches either value

The sigmoid function is the canonical link function for the Bernoulli distribution, making logistic regression a natural member of the generalized linear model (GLM) family.

Log-odds and the logit function

The logistic regression model can be understood through the concept of log-odds (also called the logit). If p is the probability of the positive class, then the odds of the positive class are:

odds = p / (1 - p)

The logit function is the natural logarithm of the odds:

logit(p) = ln(p / (1 - p))

Logistic regression assumes that the logit of the probability is a linear function of the input features:

ln(p / (1 - p)) = beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p

This is a key insight: while the relationship between the features and the probability is nonlinear (due to the sigmoid function), the relationship between the features and the log-odds is linear. Each coefficient beta_j represents the change in the log-odds of the outcome for a one-unit increase in the corresponding feature x_j, holding all other features constant.

Interpreting coefficients via odds ratios

The coefficients in logistic regression have a direct interpretation in terms of odds ratios. The odds ratio for feature x_j is:

OR_j = e^(beta_j)

An odds ratio greater than 1 indicates that increasing the feature value increases the odds of the positive class, while an odds ratio less than 1 indicates a decrease. For example, if the coefficient for a feature is 0.7, the odds ratio is e^0.7 = 2.01, meaning a one-unit increase in that feature roughly doubles the odds of the positive outcome.

Coefficient value	Odds ratio	Interpretation
Positive (beta_j > 0)	Greater than 1	Increases odds of positive class
Zero (beta_j = 0)	Equal to 1	No effect on odds
Negative (beta_j < 0)	Less than 1	Decreases odds of positive class

Marginal effects

While odds ratios are the standard way to interpret logistic regression coefficients, they can be unintuitive. An alternative interpretation uses marginal effects, which quantify the change in the predicted probability for a one-unit change in a predictor. Unlike the coefficient itself (which is constant), the marginal effect of a feature depends on the current values of all predictors, because the sigmoid function is nonlinear. The marginal effect of feature x_j is:

partial P / partial x_j = beta_j * sigma(z) * (1 - sigma(z))

Practitioners often report the "average marginal effect" (AME), computed by averaging the marginal effect over all observations in the dataset. Marginal effects provide a more intuitive measure of a variable's real-world impact on the predicted probability.

Maximum likelihood estimation

Why not least squares?

Unlike linear regression, logistic regression cannot be trained effectively using ordinary least squares. The nonlinear sigmoid function makes the sum of squared errors a non-convex function of the parameters, meaning that gradient-based methods could get stuck in local minima. Instead, logistic regression uses maximum likelihood estimation (MLE) to find the optimal model parameters.

The likelihood function

MLE aims to find the parameters that maximize the probability (likelihood) of observing the training data given the model. For a binary classification problem with n observations, the likelihood function is:

L(beta) = Product from i=1 to n of p_i^(y_i) * (1 - p_i)^(1 - y_i)

where y_i is the true label (0 or 1) for the i-th observation and p_i is the predicted probability for that observation.

In practice, maximizing the likelihood is equivalent to minimizing the negative log-likelihood (also called the cross-entropy loss or log loss):

J(beta) = -(1/n) * Sum from i=1 to n of [y_i * ln(p_i) + (1 - y_i) * ln(1 - p_i)]

The negative log-likelihood is a convex function of the parameters, guaranteeing that optimization algorithms will find the global minimum. This property holds because the log-likelihood for the exponential family (of which the Bernoulli distribution is a member) is always concave.

Optimization algorithms

Several optimization algorithms can minimize the log loss:

Algorithm	Description	When to use
Gradient descent	Iteratively updates parameters using the gradient	General-purpose, easy to implement
Newton's method (IRLS)	Uses second-order derivatives for faster convergence	Small to medium datasets
L-BFGS	Quasi-Newton method, approximates the Hessian	Medium to large datasets
Stochastic gradient descent	Updates parameters using one sample (or mini-batch) at a time	Very large datasets, online learning
Coordinate descent	Optimizes one parameter at a time	Useful with L1 regularization

Iteratively Reweighted Least Squares (IRLS), a form of Newton's method, is the classical algorithm for logistic regression and is used in R's glm() function. Modern implementations in libraries like scikit-learn typically use L-BFGS or coordinate descent. Unlike linear regression, there is no closed-form solution for the logistic regression parameters; iterative numerical methods are always required.

Gradient of the log loss

The gradient of the log loss with respect to each parameter beta_j is:

partial J / partial beta_j = (1/n) * Sum from i=1 to n of (p_i - y_i) * x_ij

where p_i = sigma(beta_0 + beta_1 * x_1i + ... + beta_p * x_pi) is the predicted probability for observation i. This gradient has an elegant form: it is the average of the product of the prediction error (p_i - y_i) and the feature value x_ij, which is the same structure as the gradient for linear regression. The difference is that p_i is computed through the sigmoid function rather than being a direct linear combination.

Decision boundary

The decision boundary is the surface that separates the predicted classes. In binary logistic regression, the model predicts class 1 when p >= threshold and class 0 otherwise. The default threshold is 0.5.

At the decision boundary, the predicted probability equals the threshold. For a threshold of 0.5, this corresponds to z = 0, so the decision boundary is the set of points where:

beta_0 + beta_1 * x_1 + beta_2 * x_2 + ... + beta_p * x_p = 0

This is a linear decision boundary: a line in two dimensions, a plane in three dimensions, or a hyperplane in higher dimensions. This linear boundary is both a strength and a limitation of logistic regression. It works well when the classes are approximately linearly separable, but it cannot capture complex, nonlinear boundaries without feature engineering.

Extending to nonlinear decision boundaries

Although logistic regression produces a linear decision boundary by default, practitioners can create nonlinear boundaries by engineering polynomial or interaction features. For example, adding x_1^2, x_2^2, and x_1 * x_2 as additional features allows the decision boundary to take the shape of an ellipse, parabola, or other conic section. This approach is similar in spirit to the kernel trick used in support vector machines, though it requires explicit feature construction.

Adjusting the threshold

The default threshold of 0.5 is not always optimal. In many real-world applications, the costs of false positives and false negatives are different. For example, in medical diagnosis, missing a disease (false negative) may be far more costly than a false alarm (false positive). Adjusting the threshold allows practitioners to balance sensitivity and specificity according to the application's needs.

Types of logistic regression

Binary logistic regression

Binary logistic regression is the most common form. The target variable has exactly two possible outcomes (for example, "spam" or "not spam," "disease" or "healthy"). The model estimates P(y = 1 | x) using the sigmoid function, and the probability of the other class is its complement: P(y = 0 | x) = 1 - P(y = 1 | x).

Multinomial logistic regression (softmax regression)

When the target variable has more than two unordered categories, multinomial logistic regression extends the binary model. There are two main approaches:

One-vs-Rest (OvR). This strategy trains one binary classifier per class, treating each class as the positive class and all other classes combined as the negative class. For K classes, this produces K separate binary logistic regression models. The class with the highest predicted probability is chosen as the final prediction.

Softmax (multinomial). The multinomial approach models all classes simultaneously using the softmax function:

P(y = k | x) = e^(z_k) / Sum from j=1 to K of e^(z_j)

where z_k = beta_0k + beta_1k * x_1 + ... + beta_pk * x_p is the linear combination for class k. The softmax function ensures that the probabilities for all classes sum to 1. The loss function for multinomial logistic regression is the categorical cross-entropy:

J = -(1/n) * Sum from i=1 to n of Sum from k=1 to K of y_ik * ln(p_ik)

where y_ik is 1 if observation i belongs to class k and 0 otherwise.

Ordinal logistic regression

When the categories have a natural order (for example, "low," "medium," "high"), ordinal logistic regression is more appropriate than multinomial logistic regression. The most common variant is the proportional odds model, which models cumulative probabilities using a single set of feature coefficients and multiple threshold (intercept) parameters. This preserves the ordering information and is more parsimonious than fitting separate models for each category.

Regularization

Regularization adds a penalty term to the cost function to prevent overfitting, especially when the number of features is large relative to the number of training examples or when features are correlated.

L1 regularization (Lasso)

L1 regularization adds the sum of the absolute values of the coefficients to the cost function:

J_regularized = J(beta) + lambda * Sum(|beta_j|)

L1 regularization can drive some coefficients to exactly zero, performing automatic feature selection. This is useful when many features are irrelevant or redundant. The resulting model is sparse, which aids interpretability.

L2 regularization (Ridge)

L2 regularization adds the sum of the squared coefficients to the cost function:

J_regularized = J(beta) + lambda * Sum(beta_j^2)

L2 regularization shrinks all coefficients toward zero but does not set any to exactly zero. It is the default regularization in many logistic regression implementations, including scikit-learn's LogisticRegression class.

Elastic net

Elastic net combines L1 and L2 penalties, providing a balance between feature selection and coefficient shrinkage:

J_regularized = J(beta) + lambda_1 * Sum(|beta_j|) + lambda_2 * Sum(beta_j^2)

Elastic net is particularly useful when there are groups of correlated features; L1 alone tends to select one feature from each group arbitrarily, while elastic net can retain all members of the group.

The regularization strength is controlled by the parameter C in scikit-learn (where C = 1/lambda). Smaller values of C apply stronger regularization.

Regularization	Penalty	Feature selection	Sparsity	Default in scikit-learn
L1 (Lasso)	Sum of absolute coefficients	Yes	Sparse	No (use penalty='l1')
L2 (Ridge)	Sum of squared coefficients	No	Dense	Yes (penalty='l2')
Elastic net	L1 + L2 combined	Partial	Moderate	No (use penalty='elasticnet')
None	No penalty	No	Dense	No (use penalty='none')

Evaluation metrics

Confusion matrix

A confusion matrix provides a detailed breakdown of the model's predictions:

	Predicted positive	Predicted negative
Actual positive	True Positive (TP)	False Negative (FN)
Actual negative	False Positive (FP)	True Negative (TN)

From the confusion matrix, all common classification metrics can be derived.

Common classification metrics

Metric	Formula	Meaning
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Overall proportion of correct predictions
Precision	TP / (TP + FP)	Proportion of positive predictions that are correct
Recall (sensitivity)	TP / (TP + FN)	Proportion of actual positives correctly identified
Specificity	TN / (TN + FP)	Proportion of actual negatives correctly identified
F1 score	2 * Precision * Recall / (Precision + Recall)	Harmonic mean of precision and recall

ROC curve and AUC

The ROC curve is a graphical tool for evaluating the performance of a binary classifier across all possible thresholds. It plots the True Positive Rate (sensitivity) against the False Positive Rate as the classification threshold varies from 0 to 1.

The Area Under the ROC Curve (AUC) summarizes the overall performance of the classifier in a single number between 0 and 1:

AUC = 1.0: Perfect classifier that separates all positives from negatives.
AUC = 0.5: No better than random guessing (the ROC curve is a diagonal line).
AUC < 0.5: Worse than random guessing (indicates the labels may be reversed).

AUC is a threshold-independent metric, making it useful for comparing classifiers without committing to a specific threshold. It is particularly valuable when dealing with imbalanced datasets, where accuracy alone can be misleading.

Precision-recall curve

For highly imbalanced datasets, the Precision-Recall (PR) curve is often more informative than the ROC curve. The PR curve plots precision against recall at various thresholds. The area under the PR curve (AUPRC) provides a summary metric that focuses on the performance of the classifier on the minority class.

Additional evaluation tools

Beyond classification metrics, logistic regression models can be assessed using statistical tests commonly found in software like statsmodels and R:

Deviance and likelihood ratio tests: Compare the fitted model against a null (intercept-only) model or nested models.
Pseudo-R-squared: Analogues of R-squared for linear regression (for example, McFadden's R-squared), which measure how much the model improves over the null model.
Hosmer-Lemeshow test: Assesses the goodness of fit by comparing observed and predicted event rates across groups of observations.
Wald statistic: Tests whether individual coefficients are significantly different from zero.

Assumptions of logistic regression

Logistic regression makes several assumptions that should be verified before relying on the model:

Linearity of the logit. The log-odds of the outcome is a linear function of the independent variables. This does not mean the features and the probability have a linear relationship.
Independence of observations. Each observation is independent of the others. Violations occur with clustered, matched, or time-series data.
No multicollinearity. The independent variables should not be highly correlated with each other. Strong multicollinearity inflates the variance of coefficient estimates.
Large sample size. Logistic regression generally requires a larger sample size than linear regression for reliable estimates, especially when the number of features is large. A commonly cited rule of thumb (the "rule of ten") is that at least 10 to 20 events per predictor variable are needed for stable coefficient estimates.
No extreme outliers. Outliers can disproportionately influence the estimated coefficients, particularly in small datasets.

Notably, logistic regression does not assume normality of the features, homoscedasticity, or a linear relationship between features and the outcome (only between features and the log-odds).

Relationship to other models

Generalized linear models

Logistic regression is a special case of the generalized linear model (GLM) framework, formalized by John Nelder and Robert Wedderburn in 1972. In this framework, the response variable follows a distribution from the exponential family (here, the Bernoulli distribution), and the link function transforms the mean of the response to a linear predictor. For logistic regression, the link function is the logit. Other binary regression models within the GLM framework include the probit model (which uses the inverse normal CDF as the link function) and the complementary log-log model. The probit and logit models produce very similar predictions for most datasets and differ mainly in the tails of their probability distributions.

Relationship to neural networks

Logistic regression is mathematically equivalent to a neural network with no hidden layers and a sigmoid activation function in the output neuron. A single neuron receives several inputs, multiplies each by a weight, adds a bias, sums the results, and passes the sum through the sigmoid function. This is precisely the operation that logistic regression performs. The bias term corresponds to the intercept (beta_0), and the weights correspond to the feature coefficients.

From this perspective, training a logistic regression model with binary cross-entropy loss and gradient descent is identical to training a single-neuron neural network. Deep neural networks can be understood as stacking many such "logistic regression units" (neurons) in layers, with nonlinear activations between them, allowing the network to learn complex, nonlinear decision boundaries that a single logistic regression cannot capture.

Feature importance and coefficient interpretation

One of the main advantages of logistic regression is the interpretability of its coefficients. Several approaches exist for assessing feature importance:

Method	Description	When to use
Odds ratios (e^beta_j)	Multiplicative change in odds per unit change in feature	Default interpretation for logistic regression
Standardized coefficients	Coefficients after features are z-score normalized	Comparing importance across features with different scales
Average marginal effects	Average change in predicted probability per unit change	When probability-scale interpretation is desired
Wald test / p-values	Statistical significance of each coefficient	Hypothesis testing, explanatory modeling
L1 coefficient magnitude	Absolute value of coefficients after L1 regularization	Predictive modeling with automatic feature selection

When features are standardized (mean 0, standard deviation 1), the magnitude of the coefficients directly reflects the relative importance of each feature. Without standardization, the coefficient magnitude depends on the scale of the feature and cannot be compared directly across features.

Strengths and weaknesses

Strengths	Weaknesses
Highly interpretable (coefficients, odds ratios, marginal effects)	Cannot model nonlinear decision boundaries without feature engineering
Fast to train and predict	Assumes linearity of log-odds
Outputs well-calibrated probabilities	Sensitive to multicollinearity
Works well with small to medium datasets	May underperform on complex tasks compared to ensemble methods
Built-in regularization options (L1, L2, elastic net)	Requires careful handling of missing data and outliers
Low memory footprint	Needs sufficient samples per class for reliable estimates
Strong theoretical foundations in statistics	Performance degrades with many irrelevant features (without regularization)
Serves as an excellent baseline for any classification task	Cannot directly model interactions unless features are explicitly constructed

Comparison with other classifiers

Classifier	Interpretability	Handles nonlinearity	Probabilistic output	Training speed	Best for
Logistic regression	High	No (linear boundary)	Yes (well-calibrated)	Fast	Baseline, interpretable models
Naive Bayes	Moderate	No	Yes (often poorly calibrated)	Very fast	Text classification, small datasets
Decision tree	High	Yes	Limited	Fast	Interpretable nonlinear models
Random forest	Low	Yes	Yes (averaged)	Moderate	General-purpose, robust
Support vector machine	Low	Yes (with kernels)	Not native (requires Platt scaling)	Moderate	High-dimensional, small-medium data
K-nearest neighbors	Low	Yes	Limited	Fast (lazy learner)	Non-parametric, small data
Neural network	Low	Yes	Yes (with softmax/sigmoid)	Slow	Complex patterns, large data
Gradient boosting	Low	Yes	Yes	Moderate	Competitions, tabular data

Applications

Logistic regression is used across virtually every domain where binary or categorical outcomes must be predicted. Its simplicity, interpretability, and regulatory transparency make it especially popular in fields where model decisions must be explained to stakeholders or regulators.

Application domain	Example use case	Target variable
Healthcare	Disease diagnosis, mortality risk prediction	Positive / Negative
Finance	Credit scoring, loan default prediction	Default / No default
Marketing	Customer churn prediction, ad click-through	Churned / Retained
Natural language processing	Sentiment analysis, spam detection	Positive / Negative
Fraud detection	Transaction fraud screening, insurance fraud	Fraudulent / Legitimate
Epidemiology	Disease risk factor analysis, clinical trials	Infected / Not infected
Criminal justice	Recidivism risk assessment	Reoffend / Not reoffend
Social sciences	Survey response modeling, voting behavior	Yes / No

In credit scoring, logistic regression remains the industry standard in many regulatory environments because its coefficients can be directly translated into scorecards with transparent point assignments. In healthcare, logistic regression models for clinical risk prediction (such as the Framingham Risk Score for cardiovascular disease) have been validated over decades of use.

Implementation

Python: scikit-learn

Scikit-learn provides a robust, production-ready implementation of logistic regression through the LogisticRegression class. It is designed primarily for predictive modeling and applies L2 regularization by default.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob))

Key parameters include penalty (regularization type), C (inverse regularization strength), solver (optimization algorithm), and multi_class (strategy for multiclass problems).

Solver	Supports L1	Supports L2	Supports multinomial	Best for
lbfgs	No	Yes	Yes	Default, medium datasets
liblinear	Yes	Yes	No (OvR only)	Small datasets, L1 penalty
saga	Yes	Yes	Yes	Large datasets, all penalties
newton-cg	No	Yes	Yes	Medium datasets
newton-cholesky	No	Yes	No	Dense features, many samples

Python: statsmodels

Statsmodels is designed for explanatory (inferential) modeling and provides detailed statistical output, including p-values, confidence intervals, and hypothesis tests. It does not apply regularization by default.

import statsmodels.api as sm

X_with_const = sm.add_constant(X_train)  # add intercept term
model = sm.Logit(y_train, X_with_const)
result = model.fit()
print(result.summary())  # detailed statistical summary

Statsmodels also supports logistic regression through its GLM interface:

model = sm.GLM(y_train, X_with_const, family=sm.families.Binomial())
result = model.fit()

R: glm()

In R, logistic regression is fit using the glm() function with family = binomial:

model <- glm(y ~ x1 + x2 + x3, data = train_data, family = binomial)
summary(model)
predictions <- predict(model, newdata = test_data, type = "response")

R's glm() uses IRLS (iteratively reweighted least squares) as its default optimizer. The summary() output includes coefficient estimates, standard errors, z-values, and p-values for each predictor.

Choosing between implementations

Feature	scikit-learn	statsmodels	R glm()
Primary purpose	Prediction	Inference	Inference
Default regularization	L2 (C=1.0)	None	None
p-values / confidence intervals	Not built-in	Yes	Yes
Supports cross-validation	Yes (with `LogisticRegressionCV`)	No (manual)	No (manual, or via caret)
Optimization algorithms	L-BFGS, liblinear, saga, etc.	Newton-Raphson	IRLS
Best suited for	Machine learning pipelines	Statistical analysis	Statistical analysis

Explain like I'm 5 (ELI5)

Imagine you want to figure out if a fruit is an apple or an orange based on its color and size. Logistic regression is like drawing a line between the apples and oranges on a chart. On one side of the line, the model says "probably an apple." On the other side, it says "probably an orange." The closer a fruit is to the line, the less sure the model is. The farther away it is, the more confident the prediction. Logistic regression learns where to draw this line by looking at lots of examples of apples and oranges you have already identified.

References

Verhulst, P. F. (1838). "Notice sur la loi que la population suit dans son accroissement." *Correspondance Mathematique et Physique*, 10, 113-121.
Berkson, J. (1944). "Application of the logistic function to bio-assay." *Journal of the American Statistical Association*, 39(227), 357-365.
Cox, D. R. (1958). "The regression analysis of binary sequences." *Journal of the Royal Statistical Society, Series B*, 20(2), 215-242.
Nelder, J. A. & Wedderburn, R. W. M. (1972). "Generalized linear models." *Journal of the Royal Statistical Society, Series A*, 135(3), 370-384.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). *Applied Logistic Regression* (3rd ed.). Wiley.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Agresti, A. (2002). *Categorical Data Analysis* (2nd ed.). Wiley.
Cramer, J. S. (2003). "The origins and development of the logit model." *Tinbergen Institute Discussion Paper*, No. 02-119/4.
Scikit-learn documentation. Logistic Regression.
Raschka, S. (2018). "What is the relation between Logistic Regression and Neural Networks and when to use which?" sebastianraschka.com.
Wikipedia contributors. Logistic regression. Wikipedia, The Free Encyclopedia.

Introduction

Mathematical formulation

The logistic (sigmoid) function

Properties of the sigmoid function

Log-odds and the logit function

Interpreting coefficients via odds ratios

Marginal effects

Maximum likelihood estimation

Why not least squares?

The likelihood function

Optimization algorithms

Gradient of the log loss

Decision boundary

Extending to nonlinear decision boundaries

Adjusting the threshold

Types of logistic regression

Binary logistic regression

Multinomial logistic regression (softmax regression)

Ordinal logistic regression

Regularization

L1 regularization (Lasso)

L2 regularization (Ridge)

Elastic net

Evaluation metrics

Confusion matrix

Common classification metrics

ROC curve and AUC

Precision-recall curve

Additional evaluation tools

Assumptions of logistic regression

Relationship to other models

Generalized linear models

Relationship to neural networks

Feature importance and coefficient interpretation

Strengths and weaknesses

Comparison with other classifiers

Applications

Implementation

Python: scikit-learn

Python: statsmodels

R: glm()

Choosing between implementations

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

False Negative Rate

False Positive Rate (FPR)

Log-Odds

Multi-Class Logistic Regression

Naive Bayes

Introduction

Mathematical formulation

The logistic (sigmoid) function

Properties of the sigmoid function

Log-odds and the logit function

Interpreting coefficients via odds ratios

Marginal effects

Maximum likelihood estimation

Why not least squares?

The likelihood function

Optimization algorithms

Gradient of the log loss

Decision boundary

Extending to nonlinear decision boundaries

Adjusting the threshold

Types of logistic regression

Binary logistic regression

Multinomial logistic regression (softmax regression)

Ordinal logistic regression

Regularization

L1 regularization (Lasso)

L2 regularization (Ridge)

Elastic net

Evaluation metrics

Confusion matrix

Common classification metrics

ROC curve and AUC

Precision-recall curve