# Logistic Regression

> Source: https://aiwiki.ai/wiki/logistic_regression
> Updated: 2026-07-12
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Logistic regression** is a statistical method that models the probability of a binary (yes/no) outcome as a function of one or more input variables, by passing a linear combination of those inputs through the [sigmoid function](/wiki/sigmoid_function) to produce a value between 0 and 1. Despite its name, it is a [classification model](/wiki/classification_model), not a regression model for continuous values: it outputs a probability, then applies a decision threshold (0.5 by default) to assign a class label. It is one of the most widely used algorithms in both statistics and [machine learning](/wiki/machine_learning), valued for being fast, interpretable, and well calibrated, and it doubles as the mathematical building block of a single artificial [neuron](/wiki/neural_network).

The method was developed by British statistician David Cox in his 1958 paper "The Regression Analysis of Binary Sequences," which opens by framing the exact problem logistic regression solves: "A sequence of 0's and 1's is observed and it is suspected that the chance that a particular trial is a 1 depends on the value of one or more independent variables."[3] The name of its core quantity, the *logit*, was coined earlier by statistician Joseph Berkson in 1944, who wrote that he used the term "for ln p/q following Bliss, who called the analogous function which is linear on x for the normal curve 'probit.'"[2][13] Since about 1970, logistic regression has been the most commonly used model for binary regression across the sciences.[12]

## What is logistic regression?

Logistic regression is a statistical method used to model the probability of a binary outcome based on one or more independent variables. In the context of [machine learning](/wiki/machine_learning), logistic regression is a [supervised learning](/wiki/supervised_learning) algorithm widely used for [binary classification](/wiki/binary_classification) problems. Despite the word "regression" in its name, logistic regression is fundamentally a [classification model](/wiki/classification_model); it predicts the probability that an observation belongs to a particular class, then applies a decision threshold to assign a label.

The technique has deep historical roots. The underlying logistic function was first proposed by the Belgian mathematician Pierre Francois Verhulst in 1838 as a model of population growth.[1][9] Verhulst named the function in 1845, but his work was largely forgotten after his early death. Raymond Pearl and Lowell Reed rediscovered the logistic function independently in 1920 while studying population dynamics at Johns Hopkins University.[9] The application of the logistic function to statistical regression began with Joseph Berkson, who coined the term "logit" (from "logistic unit") in 1944 and introduced logistic regression as a tool for bioassay analysis.[2][9] British statistician David Cox then developed and popularized the method as a general-purpose statistical model in his landmark 1958 paper, "The Regression Analysis of Binary Sequences."[3] Cox later extended the model to multinomial outcomes in 1966. Since about 1970, logistic regression has been the most commonly used model for binary regression across the sciences.[12]

Logistic regression is one of the most widely used algorithms in practice because of its simplicity, interpretability, and strong theoretical foundations.[5][6] It serves as a building block for more complex models, including [neural networks](/wiki/neural_network), where the [sigmoid function](/wiki/sigmoid_function) used in logistic regression appears as a core component.[7][11]

### When was logistic regression invented?

The ideas behind logistic regression accumulated over more than a century, and it is useful to separate the history of the logistic *function* from the history of logistic *regression* as a statistical model.

| Year | Contributor | Contribution |
|---|---|---|
| 1838 | Pierre Francois Verhulst | Proposed the logistic function as a model of constrained population growth[1][9] |
| 1845 | Pierre Francois Verhulst | Named the solutions of his growth equation "logistic" curves[9] |
| 1920 | Raymond Pearl and Lowell Reed | Rediscovered the logistic function while fitting U.S. census data[9] |
| 1944 | Joseph Berkson | Applied the logistic function to bioassay and coined the term "logit"[2][9] |
| 1958 | David Cox | Developed logistic regression as a general-purpose model for binary data[3] |
| 1966 | David Cox | Extended the model to multinomial (more than two category) outcomes |
| 1972 | John Nelder and Robert Wedderburn | Formalized logistic regression as a generalized linear model[4] |

## Mathematical formulation

### The logistic (sigmoid) function

The foundation of logistic regression is the **sigmoid function** (also called the logistic function), which maps any real-valued number to a value between 0 and 1:[6][7]

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

where z is the linear combination of input features and model weights:

$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p
$$

The model therefore estimates the conditional probability of the positive class as:

$$
P(y = 1 \mid x) = \sigma(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)
$$

### Properties of the sigmoid function

The sigmoid function has several important properties that make it suitable for modeling probabilities:

| Property | Description |
|---|---|
| Output range | The function output is always between 0 and 1, making it interpretable as a probability |
| S-shaped curve | The function transitions smoothly from 0 to 1 |
| Symmetry | $$\sigma(z) + \sigma(-z) = 1$$ for all values of $$z$$ |
| Derivative | $$\sigma'(z) = \sigma(z) (1 - \sigma(z))$$, which simplifies gradient computation during training |
| Midpoint | At $$z = 0$$, the output is 0.5, providing a natural threshold for binary classification |
| Asymptotic behavior | Approaches 1 as $$z$$ goes to positive infinity and 0 as $$z$$ goes to negative infinity, but never reaches either value |

The sigmoid function is the canonical link function for the Bernoulli distribution, making logistic regression a natural member of the generalized linear model (GLM) family.[4][8]

## Log-odds and the logit function

The logistic regression model can be understood through the concept of **log-odds** (also called the [logit](/wiki/logits)).[2][8] If p is the probability of the positive class, then the odds of the positive class are:

$$
\text{odds} = p / (1 - p)
$$

The **logit function** is the natural logarithm of the odds:

$$
\operatorname{logit}(p) = \ln(p / (1 - p))
$$

Logistic regression assumes that the logit of the probability is a linear function of the input features:

$$
\ln(p / (1 - p)) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p
$$

This is a key insight: while the relationship between the features and the probability is nonlinear (due to the sigmoid function), the relationship between the features and the log-odds is linear. Each coefficient $$\beta_j$$ represents the change in the log-odds of the outcome for a one-unit increase in the corresponding feature $$x_j$$, holding all other features constant.[5][8]

### Interpreting coefficients via odds ratios

The coefficients in logistic regression have a direct interpretation in terms of odds ratios.[5][8] The odds ratio for feature $$x_j$$ is:

$$
\text{OR}_j = e^{\beta_j}
$$

An odds ratio greater than 1 indicates that increasing the feature value increases the odds of the positive class, while an odds ratio less than 1 indicates a decrease. For example, if the coefficient for a feature is 0.7, the odds ratio is $$e^{0.7} = 2.01$$, meaning a one-unit increase in that feature roughly doubles the odds of the positive outcome.

| Coefficient value | Odds ratio | Interpretation |
|---|---|---|
| Positive ($$\beta_j > 0$$) | Greater than 1 | Increases odds of positive class |
| Zero ($$\beta_j = 0$$) | Equal to 1 | No effect on odds |
| Negative ($$\beta_j < 0$$) | Less than 1 | Decreases odds of positive class |

### Marginal effects

While odds ratios are the standard way to interpret logistic regression coefficients, they can be unintuitive. An alternative interpretation uses **marginal effects**, which quantify the change in the predicted probability for a one-unit change in a predictor. Unlike the coefficient itself (which is constant), the marginal effect of a feature depends on the current values of all predictors, because the sigmoid function is nonlinear. The marginal effect of feature $$x_j$$ is:

$$
\frac{\partial P}{\partial x_j} = \beta_j \sigma(z) (1 - \sigma(z))
$$

Practitioners often report the "average marginal effect" (AME), computed by averaging the marginal effect over all observations in the dataset. Marginal effects provide a more intuitive measure of a variable's real-world impact on the predicted probability.[5]

## Maximum likelihood estimation

### Why not least squares?

Unlike [linear regression](/wiki/linear_regression), logistic regression cannot be trained effectively using ordinary least squares. The nonlinear sigmoid function makes the sum of squared errors a non-convex function of the parameters, meaning that gradient-based methods could get stuck in local minima. Instead, logistic regression uses **maximum likelihood estimation** (MLE) to find the optimal model parameters.[6][7]

### The likelihood function

MLE aims to find the parameters that maximize the probability (likelihood) of observing the training data given the model. For a binary classification problem with n observations, the likelihood function is:

$$
L(\beta) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i}
$$

where $$y_i$$ is the true label (0 or 1) for the i-th observation and $$p_i$$ is the predicted probability for that observation.

In practice, maximizing the likelihood is equivalent to minimizing the **negative log-likelihood** (also called the **[cross-entropy](/wiki/cross-entropy) loss** or **log loss**):[7]

$$
J(\beta) = -\frac{1}{n} \sum_{i=1}^{n} \left[y_i \ln(p_i) + (1 - y_i) \ln(1 - p_i)\right]
$$

The negative log-likelihood is a convex function of the parameters, guaranteeing that optimization algorithms will find the global minimum. This property holds because the log-likelihood for the exponential family (of which the Bernoulli distribution is a member) is always concave.[6][7]

### Optimization algorithms

Several optimization algorithms can minimize the log loss:

| Algorithm | Description | When to use |
|---|---|---|
| [Gradient descent](/wiki/gradient_descent) | Iteratively updates parameters using the gradient | General-purpose, easy to implement |
| Newton's method (IRLS) | Uses second-order derivatives for faster convergence | Small to medium datasets |
| L-BFGS | Quasi-Newton method, approximates the Hessian | Medium to large datasets |
| [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) | Updates parameters using one sample (or mini-batch) at a time | Very large datasets, online learning |
| Coordinate descent | Optimizes one parameter at a time | Useful with L1 regularization |

Iteratively Reweighted Least Squares (IRLS), a form of Newton's method, is the classical algorithm for logistic regression and is used in R's `glm()` function.[7] Modern implementations in libraries like [scikit-learn](/wiki/scikit-learn) typically use L-BFGS or coordinate descent; L-BFGS has been the default solver in scikit-learn's `LogisticRegression` class since version 0.22.[10] Unlike linear regression, there is no closed-form solution for the logistic regression parameters; iterative numerical methods are always required.[6]

### Gradient of the log loss

The gradient of the log loss with respect to each parameter beta_j is:

$$
\frac{\partial J}{\partial \beta_j} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i) x_{ij}
$$

where $$p_i = \sigma(\beta_0 + \beta_1 x_{1i} + \cdots + \beta_p x_{pi})$$ is the predicted probability for observation i. This gradient has an elegant form: it is the average of the product of the prediction error $$(p_i - y_i)$$ and the feature value $$x_{ij}$$, which is the same structure as the gradient for linear regression. The difference is that p_i is computed through the sigmoid function rather than being a direct linear combination.[6][7]

## Decision boundary

The **decision boundary** is the surface that separates the predicted classes. In binary logistic regression, the model predicts class 1 when $$p \ge \text{threshold}$$ and class 0 otherwise. The default threshold is 0.5.

At the decision boundary, the predicted probability equals the threshold. For a threshold of 0.5, this corresponds to $$z = 0$$, so the decision boundary is the set of points where:

$$
\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = 0
$$

This is a **linear decision boundary**: a line in two dimensions, a plane in three dimensions, or a hyperplane in higher dimensions.[6][7] This linear boundary is both a strength and a limitation of logistic regression. It works well when the classes are approximately linearly separable, but it cannot capture complex, nonlinear boundaries without feature engineering.

### Extending to nonlinear decision boundaries

Although logistic regression produces a linear decision boundary by default, practitioners can create nonlinear boundaries by engineering polynomial or interaction features. For example, adding $$x_1^2$$, $$x_2^2$$, and $$x_1 x_2$$ as additional features allows the decision boundary to take the shape of an ellipse, parabola, or other conic section.[6] This approach is similar in spirit to the kernel trick used in [support vector machines](/wiki/support_vector_machine_svm), though it requires explicit feature construction.

### Adjusting the threshold

The default threshold of 0.5 is not always optimal. In many real-world applications, the costs of false positives and false negatives are different. For example, in medical diagnosis, missing a disease (false negative) may be far more costly than a false alarm (false positive). Adjusting the threshold allows practitioners to balance sensitivity and specificity according to the application's needs.

## Types of logistic regression

### Binary logistic regression

Binary logistic regression is the most common form. The target variable has exactly two possible outcomes (for example, "spam" or "not spam," "disease" or "healthy"). The model estimates $$P(y = 1 \mid x)$$ using the sigmoid function, and the probability of the other class is its complement: $$P(y = 0 \mid x) = 1 - P(y = 1 \mid x)$$.[5][8]

### Multinomial logistic regression (softmax regression)

When the target variable has more than two unordered categories, **multinomial logistic regression** extends the binary model.[8] There are two main approaches:

**One-vs-Rest (OvR).** This strategy trains one binary classifier per class, treating each class as the positive class and all other classes combined as the negative class. For K classes, this produces K separate binary logistic regression models. The class with the highest predicted probability is chosen as the final prediction.

**[Softmax](/wiki/softmax) (multinomial).** The multinomial approach models all classes simultaneously using the softmax function:

$$
P(y = k \mid x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}
$$

where $$z_k = \beta_{0k} + \beta_{1k} x_1 + \cdots + \beta_{pk} x_p$$ is the linear combination for class k. The softmax function ensures that the probabilities for all classes sum to 1.[7] The loss function for multinomial logistic regression is the **categorical cross-entropy**:

$$
J = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \ln(p_{ik})
$$

where $$y_{ik}$$ is 1 if observation i belongs to class k and 0 otherwise.

### Ordinal logistic regression

When the categories have a natural order (for example, "low," "medium," "high"), **ordinal logistic regression** is more appropriate than multinomial logistic regression. The most common variant is the **proportional odds model**, which models cumulative probabilities using a single set of feature coefficients and multiple threshold (intercept) parameters.[8] This preserves the ordering information and is more parsimonious than fitting separate models for each category.

## Regularization

[Regularization](/wiki/regularization) adds a penalty term to the cost function to prevent overfitting, especially when the number of features is large relative to the number of training examples or when features are correlated.[6]

### L1 regularization (Lasso)

L1 regularization adds the sum of the absolute values of the coefficients to the cost function:

$$
J_{\text{regularized}} = J(\beta) + \lambda \sum |\beta_j|
$$

L1 regularization can drive some coefficients to exactly zero, performing automatic [feature selection](/wiki/feature_selection).[6] This is useful when many features are irrelevant or redundant. The resulting model is sparse, which aids interpretability.

### L2 regularization (Ridge)

L2 regularization adds the sum of the squared coefficients to the cost function:

$$
J_{\text{regularized}} = J(\beta) + \lambda \sum \beta_j^2
$$

L2 regularization shrinks all coefficients toward zero but does not set any to exactly zero. It is the default regularization in many logistic regression implementations, including scikit-learn's `LogisticRegression` class.[6][10]

### Elastic net

Elastic net combines L1 and L2 penalties, providing a balance between feature selection and coefficient shrinkage:

$$
J_{\text{regularized}} = J(\beta) + \lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2
$$

Elastic net is particularly useful when there are groups of correlated features; L1 alone tends to select one feature from each group arbitrarily, while elastic net can retain all members of the group.

The regularization strength is controlled by the parameter C in scikit-learn (where $$C = 1/\lambda$$). Smaller values of C apply stronger regularization.[10]

| Regularization | Penalty | Feature selection | Sparsity | Default in scikit-learn |
|---|---|---|---|---|
| L1 (Lasso) | Sum of absolute coefficients | Yes | Sparse | No (use penalty='l1') |
| L2 (Ridge) | Sum of squared coefficients | No | Dense | Yes (penalty='l2') |
| Elastic net | L1 + L2 combined | Partial | Moderate | No (use penalty='elasticnet') |
| None | No penalty | No | Dense | No (use penalty='none') |

## Evaluation metrics

### Confusion matrix

A [confusion matrix](/wiki/confusion_matrix) provides a detailed breakdown of the model's predictions:[5]

| | Predicted positive | Predicted negative |
|---|---|---|
| **Actual positive** | True Positive (TP) | False Negative (FN) |
| **Actual negative** | False Positive (FP) | True Negative (TN) |

From the confusion matrix, all common classification metrics can be derived.

### Common classification metrics

| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | $$(\text{TP} + \text{TN}) / (\text{TP} + \text{TN} + \text{FP} + \text{FN})$$ | Overall proportion of correct predictions |
| Precision | $$\text{TP} / (\text{TP} + \text{FP})$$ | Proportion of positive predictions that are correct |
| Recall (sensitivity) | $$\text{TP} / (\text{TP} + \text{FN})$$ | Proportion of actual positives correctly identified |
| Specificity | $$\text{TN} / (\text{TN} + \text{FP})$$ | Proportion of actual negatives correctly identified |
| F1 score | $$\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$ | Harmonic mean of precision and recall |

### ROC curve and AUC

The [ROC curve](/wiki/roc_receiver_operating_characteristic_curve) is a graphical tool for evaluating the performance of a binary classifier across all possible thresholds. It plots the True Positive Rate (sensitivity) against the False Positive Rate as the classification threshold varies from 0 to 1.[5]

The **Area Under the ROC Curve (AUC)** summarizes the overall performance of the classifier in a single number between 0 and 1:

- **AUC = 1.0:** Perfect classifier that separates all positives from negatives.
- **AUC = 0.5:** No better than random guessing (the ROC curve is a diagonal line).
- **AUC < 0.5:** Worse than random guessing (indicates the labels may be reversed).

AUC is a threshold-independent metric, making it useful for comparing classifiers without committing to a specific threshold. It is particularly valuable when dealing with [imbalanced datasets](/wiki/imbalanced_data), where accuracy alone can be misleading.[5][12]

### Precision-recall curve

For highly imbalanced datasets, the **Precision-Recall (PR) curve** is often more informative than the ROC curve. The PR curve plots precision against recall at various thresholds. The area under the PR curve (AUPRC) provides a summary metric that focuses on the performance of the classifier on the minority class.

### Additional evaluation tools

Beyond classification metrics, logistic regression models can be assessed using statistical tests commonly found in software like statsmodels and R:[5]

- **Deviance and likelihood ratio tests:** Compare the fitted model against a null (intercept-only) model or nested models.
- **Pseudo-R-squared:** Analogues of R-squared for linear regression (for example, McFadden's R-squared), which measure how much the model improves over the null model.
- **Hosmer-Lemeshow test:** Assesses the goodness of fit by comparing observed and predicted event rates across groups of observations.[5]
- **Wald statistic:** Tests whether individual coefficients are significantly different from zero.

## Assumptions of logistic regression

Logistic regression makes several assumptions that should be verified before relying on the model:[5]

1. **Linearity of the logit.** The log-odds of the outcome is a linear function of the independent variables. This does not mean the features and the probability have a linear relationship.
2. **Independence of observations.** Each observation is independent of the others. Violations occur with clustered, matched, or time-series data.
3. **No multicollinearity.** The independent variables should not be highly correlated with each other. Strong multicollinearity inflates the variance of coefficient estimates.
4. **Large sample size.** Logistic regression generally requires a larger sample size than linear regression for reliable estimates, especially when the number of features is large. A widely cited rule of thumb (the "rule of ten") is that at least 10 to 20 events per predictor variable are needed for stable coefficient estimates.[5] This guideline traces to a 1996 Monte Carlo simulation study by Peduzzi and colleagues, published in the Journal of Clinical Epidemiology, which ran 500 simulated analyses at events-per-variable values of 2, 5, 10, 15, 20, and 25 and found that values below about 10 produced biased coefficients and unreliable confidence intervals.[13]
5. **No extreme outliers.** Outliers can disproportionately influence the estimated coefficients, particularly in small datasets.

Notably, logistic regression does **not** assume normality of the features, homoscedasticity, or a linear relationship between features and the outcome (only between features and the log-odds).[5]

## Relationship to other models

### Generalized linear models

Logistic regression is a special case of the **generalized linear model (GLM)** framework, formalized by John Nelder and Robert Wedderburn in 1972.[4] In this framework, the response variable follows a distribution from the exponential family (here, the Bernoulli distribution), and the link function transforms the mean of the response to a linear predictor. For logistic regression, the link function is the logit.[4][8] Other binary regression models within the GLM framework include the probit model (which uses the inverse normal CDF as the link function) and the complementary log-log model. The probit and logit models produce very similar predictions for most datasets and differ mainly in the tails of their probability distributions.[8]

### How does logistic regression relate to neural networks?

Logistic regression is mathematically equivalent to a neural network with no hidden layers and a sigmoid activation function in the output neuron.[7][11] A single neuron receives several inputs, multiplies each by a weight, adds a bias, sums the results, and passes the sum through the sigmoid function. This is precisely the operation that logistic regression performs. The bias term corresponds to the intercept ($$\beta_0$$), and the weights correspond to the feature coefficients.[11]

From this perspective, training a logistic regression model with binary cross-entropy loss and gradient descent is identical to training a single-neuron neural network. Deep neural networks can be understood as stacking many such "logistic regression units" (neurons) in layers, with nonlinear activations between them, allowing the network to learn complex, nonlinear decision boundaries that a single logistic regression cannot capture.[7][11]

## Feature importance and coefficient interpretation

One of the main advantages of logistic regression is the interpretability of its coefficients. Several approaches exist for assessing feature importance:

| Method | Description | When to use |
|---|---|---|
| Odds ratios ($$e^{\beta_j}$$) | Multiplicative change in odds per unit change in feature | Default interpretation for logistic regression |
| Standardized coefficients | Coefficients after features are z-score normalized | Comparing importance across features with different scales |
| Average marginal effects | Average change in predicted probability per unit change | When probability-scale interpretation is desired |
| Wald test / p-values | Statistical significance of each coefficient | Hypothesis testing, explanatory modeling |
| L1 coefficient magnitude | Absolute value of coefficients after L1 regularization | Predictive modeling with automatic feature selection |

When features are standardized (mean 0, standard deviation 1), the magnitude of the coefficients directly reflects the relative importance of each feature. Without standardization, the coefficient magnitude depends on the scale of the feature and cannot be compared directly across features.[5][6]

## Strengths and weaknesses

| Strengths | Weaknesses |
|---|---|
| Highly interpretable (coefficients, odds ratios, marginal effects) | Cannot model nonlinear decision boundaries without feature engineering |
| Fast to train and predict | Assumes linearity of log-odds |
| Outputs well-calibrated probabilities | Sensitive to multicollinearity |
| Works well with small to medium datasets | May underperform on complex tasks compared to ensemble methods |
| Built-in regularization options (L1, L2, elastic net) | Requires careful handling of missing data and outliers |
| Low memory footprint | Needs sufficient samples per class for reliable estimates |
| Strong theoretical foundations in statistics | Performance degrades with many irrelevant features (without regularization) |
| Serves as an excellent baseline for any classification task | Cannot directly model interactions unless features are explicitly constructed |

## How does logistic regression compare with other classifiers?

| Classifier | Interpretability | Handles nonlinearity | Probabilistic output | Training speed | Best for |
|---|---|---|---|---|---|
| Logistic regression | High | No (linear boundary) | Yes (well-calibrated) | Fast | Baseline, interpretable models |
| [Naive Bayes](/wiki/naive_bayes) | Moderate | No | Yes (often poorly calibrated) | Very fast | Text classification, small datasets |
| [Decision tree](/wiki/decision_tree) | High | Yes | Limited | Fast | Interpretable nonlinear models |
| [Random forest](/wiki/random_forest) | Low | Yes | Yes (averaged) | Moderate | General-purpose, robust |
| [Support vector machine](/wiki/support_vector_machine_svm) | Low | Yes (with kernels) | Not native (requires Platt scaling) | Moderate | High-dimensional, small-medium data |
| [K-nearest neighbors](/wiki/k_nearest_neighbors) | Low | Yes | Limited | Fast (lazy learner) | Non-parametric, small data |
| [Neural network](/wiki/neural_network) | Low | Yes | Yes (with softmax/sigmoid) | Slow | Complex patterns, large data |
| [Gradient boosting](/wiki/gradient_boosting) | Low | Yes | Yes | Moderate | Competitions, tabular data |

## What is logistic regression used for?

Logistic regression is used across virtually every domain where binary or categorical outcomes must be predicted. Its simplicity, interpretability, and regulatory transparency make it especially popular in fields where model decisions must be explained to stakeholders or regulators.

| Application domain | Example use case | Target variable |
|---|---|---|
| Healthcare | Disease diagnosis, mortality risk prediction | Positive / Negative |
| Finance | Credit scoring, loan default prediction | Default / No default |
| Marketing | Customer churn prediction, ad click-through | Churned / Retained |
| [Natural language processing](/wiki/natural_language_processing) | [Sentiment analysis](/wiki/sentiment_analysis), spam detection | Positive / Negative |
| Fraud detection | Transaction fraud screening, insurance fraud | Fraudulent / Legitimate |
| Epidemiology | Disease risk factor analysis, clinical trials | Infected / Not infected |
| Criminal justice | Recidivism risk assessment | Reoffend / Not reoffend |
| Social sciences | Survey response modeling, voting behavior | Yes / No |

In credit scoring, logistic regression remains the industry standard in many regulatory environments because its coefficients can be directly translated into scorecards with transparent point assignments.[12] In healthcare, logistic regression models for clinical risk prediction (such as the Framingham Risk Score for cardiovascular disease) have been validated over decades of use.[5]

## Implementation

### Python: scikit-learn

Scikit-learn provides a robust, production-ready implementation of logistic regression through the `LogisticRegression` class. It is designed primarily for predictive modeling and applies L2 regularization by default.[10]

```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(penalty='l2', C=1.0, max_iter=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC:", roc_auc_score(y_test, y_prob))
```

Key parameters include `penalty` (regularization type), `C` (inverse regularization strength), `solver` (optimization algorithm), and `multi_class` (strategy for multiclass problems). Since scikit-learn version 0.22, the default solver is `lbfgs` (L-BFGS), with L2 as the default penalty.[10]

| Solver | Supports L1 | Supports L2 | Supports multinomial | Best for |
|---|---|---|---|---|
| lbfgs | No | Yes | Yes | Default, medium datasets |
| liblinear | Yes | Yes | No (OvR only) | Small datasets, L1 penalty |
| saga | Yes | Yes | Yes | Large datasets, all penalties |
| newton-cg | No | Yes | Yes | Medium datasets |
| newton-cholesky | No | Yes | No | Dense features, many samples |

### Python: statsmodels

Statsmodels is designed for explanatory (inferential) modeling and provides detailed statistical output, including p-values, confidence intervals, and hypothesis tests. It does not apply regularization by default.

```python
import statsmodels.api as sm

X_with_const = sm.add_constant(X_train)  # add intercept term
model = sm.Logit(y_train, X_with_const)
result = model.fit()
print(result.summary())  # detailed statistical summary
```

Statsmodels also supports logistic regression through its GLM interface:

```python
model = sm.GLM(y_train, X_with_const, family=sm.families.Binomial())
result = model.fit()
```

### R: glm()

In R, logistic regression is fit using the `glm()` function with `family = binomial`:

```r
model <- glm(y ~ x1 + x2 + x3, data = train_data, family = binomial)
summary(model)
predictions <- predict(model, newdata = test_data, type = "response")
```

R's `glm()` uses IRLS (iteratively reweighted least squares) as its default optimizer.[7] The `summary()` output includes coefficient estimates, standard errors, z-values, and p-values for each predictor.

### Choosing between implementations

| Feature | scikit-learn | statsmodels | R glm() |
|---|---|---|---|
| Primary purpose | Prediction | Inference | Inference |
| Default regularization | L2 (C=1.0) | None | None |
| p-values / confidence intervals | Not built-in | Yes | Yes |
| Supports cross-validation | Yes (with `LogisticRegressionCV`) | No (manual) | No (manual) |
| Optimization algorithms | L-BFGS, liblinear, saga, etc. | Newton-Raphson | IRLS |
| Best suited for | Machine learning pipelines | Statistical analysis | Statistical analysis |

## Explain like I'm 5 (ELI5)

Imagine you want to figure out if a fruit is an apple or an orange based on its color and size. Logistic regression is like drawing a line between the apples and oranges on a chart. On one side of the line, the model says "probably an apple." On the other side, it says "probably an orange." The closer a fruit is to the line, the less sure the model is. The farther away it is, the more confident the prediction. Logistic regression learns where to draw this line by looking at lots of examples of apples and oranges you have already identified.

## References

1. Verhulst, P. F. (1838). "Notice sur la loi que la population suit dans son accroissement." *Correspondance Mathematique et Physique*, 10, 113-121.
2. Berkson, J. (1944). "Application of the logistic function to bio-assay." *Journal of the American Statistical Association*, 39(227), 357-365.
3. Cox, D. R. (1958). "The regression analysis of binary sequences." *Journal of the Royal Statistical Society, Series B*, 20(2), 215-242.
4. Nelder, J. A. & Wedderburn, R. W. M. (1972). "Generalized linear models." *Journal of the Royal Statistical Society, Series A*, 135(3), 370-384.
5. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). *Applied Logistic Regression* (3rd ed.). Wiley.
6. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer.
7. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
8. Agresti, A. (2002). *Categorical Data Analysis* (2nd ed.). Wiley.
9. Cramer, J. S. (2003). "The origins and development of the logit model." *Tinbergen Institute Discussion Paper*, No. 02-119/4.
10. Scikit-learn documentation. [Logistic Regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression).
11. Raschka, S. (2018). "What is the relation between Logistic Regression and Neural Networks and when to use which?" [sebastianraschka.com](https://sebastianraschka.com/faq/docs/logisticregr-neuralnet.html).
12. Wikipedia contributors. [Logistic regression](https://en.wikipedia.org/wiki/Logistic_regression). Wikipedia, The Free Encyclopedia.
13. Peduzzi, P., Concato, J., Kemper, E., Holford, T. R., & Feinstein, A. R. (1996). "A simulation study of the number of events per variable in logistic regression analysis." *Journal of Clinical Epidemiology*, 49(12), 1373-1379.