A generalized linear model (GLM) is a flexible extension of ordinary linear regression that allows the response variable to follow any distribution from the exponential family, not just the normal distribution. GLMs unify several common statistical models, including linear regression, logistic regression, and Poisson regression, under a single theoretical framework. The framework was introduced by John Nelder and Robert Wedderburn in a landmark 1972 paper published in the Journal of the Royal Statistical Society [1]. GLMs have since become one of the most important tools in both classical statistics and machine learning, with applications spanning insurance pricing, ecological modeling, epidemiology, and many other fields.
Imagine you have a toy machine that predicts things. Regular linear regression is like a machine that can only predict smooth, continuous numbers (like temperature). But what if you want to predict something different, like whether it will rain (yes or no) or how many birds you will see in the park (a count: 0, 1, 2, 3...)? A generalized linear model is like an upgraded version of that machine. It has a special adapter (the "link function") that lets it handle all sorts of different predictions, not just smooth numbers. You plug in the right adapter depending on what kind of answer you need, and the machine works for many more types of questions.
Before 1972, statisticians treated models like linear regression, logistic regression, probit analysis, and Poisson regression as separate techniques, each with its own estimation procedure and theory. John Nelder and Robert Wedderburn recognized that these models shared a common mathematical structure and published "Generalized Linear Models" in the Journal of the Royal Statistical Society, Series A, Volume 135, pages 370 to 384 [1]. Their paper demonstrated that by specifying three components (a distribution, a linear predictor, and a link function), a wide class of models could be estimated using a single iterative algorithm.
Nelder and Wedderburn also proposed the iteratively reweighted least squares (IRLS) algorithm for maximum likelihood estimation, which remains the default fitting method in most statistical software today. Their work was later expanded by Peter McCullagh and John Nelder in the influential 1989 textbook Generalized Linear Models (second edition), which became a standard reference in the field [2].
Every generalized linear model is defined by three components that work together to specify how the response variable relates to the predictor variables.
The random component specifies the probability distribution of the response variable Y. In a GLM, this distribution must belong to the exponential family, which includes the normal, binomial, Poisson, gamma, inverse Gaussian, and negative binomial distributions. The choice of distribution reflects the nature of the data. For example, binary outcomes call for a binomial distribution, count data call for a Poisson distribution, and continuous positive-valued data call for a gamma distribution.
The systematic component is the linear predictor, which combines the explanatory variables into a single value:
η = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ
Here, β₀ is the intercept, β₁ through βₚ are regression coefficients, and X₁ through Xₚ are the predictor variables. Despite the name "generalized linear model," the linearity refers to this component: the model is linear in the parameters, even though the overall relationship between predictors and the response can be nonlinear.
The link function g(·) connects the random component to the systematic component by transforming the expected value of the response variable so that it equals the linear predictor:
g(μ) = η, where μ = E(Y)
The link function serves two purposes. First, it maps the expected value of Y onto the entire real line, which is necessary because the linear predictor η can take any value from negative infinity to positive infinity. Second, it ensures that predicted values remain within the valid range for the chosen distribution. For instance, the logit link maps probabilities (which must lie between 0 and 1) onto the real line, while the log link ensures that predicted counts remain positive.
Each exponential family distribution has a canonical link function that arises naturally from the mathematical form of the distribution. Using the canonical link simplifies estimation and guarantees certain desirable statistical properties, such as the equivalence of Newton's method and Fisher scoring.
The following table summarizes the most widely used GLMs, their distributions, canonical link functions, and typical applications.
| Distribution | Link Function | Link Formula | Mean Function | Typical Use |
|---|---|---|---|---|
| Normal (Gaussian) | Identity | g(μ) = μ | μ = Xβ | Continuous response data, standard linear regression |
| Binomial | Logit | g(μ) = log(μ / (1 - μ)) | μ = exp(Xβ) / (1 + exp(Xβ)) | Binary outcomes, logistic regression |
| Poisson | Log | g(μ) = log(μ) | μ = exp(Xβ) | Count data (event counts, frequencies) |
| Gamma | Inverse | g(μ) = 1/μ | μ = 1/(Xβ) | Positive continuous data (insurance claims, survival times) |
| Inverse Gaussian | Inverse squared | g(μ) = 1/μ² | μ = (Xβ)^(-1/2) | Positive continuous data with heavy right tail |
| Negative Binomial | Log | g(μ) = log(μ) | μ = exp(Xβ) | Overdispersed count data |
Alternative link functions can also be used. For binary response data, the probit link (the inverse of the standard normal CDF) and the complementary log-log link are common alternatives to the logit link, each carrying slightly different assumptions about the underlying data-generating process.
The mathematical foundation of GLMs rests on the exponential family of distributions. A distribution belongs to the exponential family if its probability density (or mass) function can be written in the form:
f(y | θ, φ) = a(y, φ) · exp[(yθ - b(θ)) / φ]
In this formulation:
The variance of Y under this formulation is Var(Y) = φ · V(μ), where V(μ) = b''(θ) is called the variance function. This relationship means that in a GLM, the variance is allowed to depend on the mean, unlike in ordinary linear regression where the variance is assumed constant.
GLM parameters are estimated by maximizing the log-likelihood function. For a sample of n independent observations, the log-likelihood is the sum of the individual log-density contributions. Because GLMs generally do not have closed-form solutions for the maximum likelihood estimates (with the exception of the normal/identity case, which reduces to ordinary least squares), iterative numerical methods are required.
The two primary iterative approaches are:
The iteratively reweighted least squares (IRLS) algorithm is the standard method for fitting GLMs. Proposed by Nelder and Wedderburn in their original 1972 paper, IRLS reformulates the Fisher scoring updates as a sequence of weighted least squares problems [1]. At each iteration, the algorithm:
IRLS is numerically stable, converges reliably for well-specified models, and leverages efficient linear algebra routines for the weighted least squares step. McCullagh and Nelder proved that this procedure converges to the maximum likelihood estimates under standard regularity conditions [2].
Deviance is the GLM analogue of the residual sum of squares in ordinary linear regression. It measures the discrepancy between the fitted model and a saturated model (a model with as many parameters as observations that fits the data perfectly). The deviance is defined as:
D = 2 · [log L(saturated model) - log L(fitted model)]
Two types of deviance are commonly reported:
To compare two nested GLMs (where one model is a special case of the other), the likelihood ratio test computes the difference in deviances between the two models. Under the null hypothesis that the simpler model is adequate, this difference follows an approximate chi-squared distribution with degrees of freedom equal to the number of additional parameters in the larger model [3].
For comparing non-nested models, information criteria provide a useful alternative:
A key assumption in many GLMs is that the variance is fully determined by the mean through the variance function V(μ). In practice, the observed variance often exceeds this theoretical expectation, a phenomenon called overdispersion. Overdispersion is particularly common with Poisson and binomial data.
Overdispersion can be detected by comparing the residual deviance to the residual degrees of freedom. If the ratio (residual deviance / degrees of freedom) is substantially greater than 1, overdispersion is likely present.
Several approaches address overdispersion:
| Approach | Description | When to Use |
|---|---|---|
| Quasi-likelihood | Uses the same mean-variance relationship but introduces a dispersion parameter φ to scale the variance. Standard errors and test statistics are adjusted accordingly, and F-tests replace chi-squared tests. | Mild to moderate overdispersion |
| Negative binomial regression | Adds an extra parameter to model the excess variance directly. The variance is μ + μ²/k, allowing it to exceed the Poisson mean. | Count data with variance much larger than the mean |
| Zero-inflated models | Combines a point mass at zero with a count distribution (Poisson or negative binomial) to handle excess zeros. | Data with more zeros than the count distribution predicts |
Generalized linear models and neural networks are more closely related than they might first appear. A GLM can be viewed as a single-layer neural network with a specific activation function. Ordinary linear regression corresponds to a network with no activation function (the identity), while logistic regression corresponds to a network with a sigmoid activation. The link function in a GLM plays the same structural role as the activation function in a neural network: both transform the output of a linear combination of inputs [5].
The key difference is depth and flexibility. GLMs are restricted to a single linear transformation followed by one link function, which limits their ability to capture complex nonlinear relationships. Neural networks stack multiple layers of linear transformations and nonlinear activations, enabling them to approximate arbitrarily complex functions. However, this added flexibility comes at a cost: neural networks require more data, are computationally more expensive to train, and produce models that are far harder to interpret.
In practice, GLMs remain the preferred choice when interpretability matters, when the sample size is moderate, or when the relationship between predictors and response is well approximated by a known link function. Neural networks are better suited to large-scale prediction problems with complex, high-dimensional inputs (such as images or text) where raw predictive accuracy outweighs the need for interpretable coefficients [5].
Recent research has explored hybrid approaches that combine both paradigms. For example, features extracted from deep neural networks can be used as inputs to a GLM, combining the representation power of deep learning with the statistical rigor of the GLM framework [6]. In the insurance industry, such "combined actuarial neural net" models have gained traction for pricing and reserving applications.
GLMs rely on several assumptions that should be checked:
Common diagnostic tools include deviance residual plots, leverage and influence measures (Cook's distance), and quantile-quantile (Q-Q) plots of deviance residuals. The loss function used for fitting (the negative log-likelihood) also serves as the basis for assessing model adequacy.
GLMs are implemented in all major statistical and machine learning software packages. The following table compares the most common implementations.
| Software | Function / Class | Key Features |
|---|---|---|
| R | glm() (built-in) | Full family support (gaussian, binomial, poisson, Gamma, inverse.gaussian, quasi), detailed summary output with deviance, AIC, coefficient tests. The standard reference implementation. |
| Python (statsmodels) | sm.GLM() | Supports Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, Negative Binomial, and Tweedie families. Provides detailed summary tables with standard errors, confidence intervals, and deviance statistics [7]. |
| Python (scikit-learn) | PoissonRegressor, GammaRegressor, TweedieRegressor | Oriented toward prediction rather than inference. Includes regularization by default. Added in scikit-learn 0.23 [8]. |
| SAS | PROC GENMOD | Comprehensive GLM procedure with support for repeated measures, GEE estimation, and Type III tests. |
| Stata | glm command | Supports all standard families and links, with robust standard error options. |
Using statsmodels to fit a Poisson regression:
import statsmodels.api as sm
import numpy as np
# Prepare data
X = sm.add_constant(predictor_data)
y = count_data
# Fit Poisson GLM
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
# View results
print(results.summary())
Using R to fit a logistic regression:
# Fit logistic regression (binomial GLM with logit link)
model <- glm(outcome ~ predictor1 + predictor2,
data = my_data,
family = binomial(link = "logit"))
summary(model)
GLMs are used extensively across many disciplines: