Generalized Linear Model

A generalized linear model (GLM) is a flexible extension of ordinary linear regression that allows the response variable to follow any distribution from the exponential family, not just the normal distribution. GLMs unify several common statistical models, including linear regression, logistic regression, and Poisson regression, under a single theoretical framework. The framework was introduced by John Nelder and Robert Wedderburn in a landmark 1972 paper published in the Journal of the Royal Statistical Society ^[1]. GLMs have since become one of the most important tools in both classical statistics and machine learning, with applications spanning insurance pricing, ecological modeling, epidemiology, and many other fields.

Explain Like I'm 5 (ELI5)

Imagine you have a toy machine that predicts things. Regular linear regression is like a machine that can only predict smooth, continuous numbers (like temperature). But what if you want to predict something different, like whether it will rain (yes or no) or how many birds you will see in the park (a count: 0, 1, 2, 3...)? A generalized linear model is like an upgraded version of that machine. It has a special adapter (the "link function") that lets it handle all sorts of different predictions, not just smooth numbers. You plug in the right adapter depending on what kind of answer you need, and the machine works for many more types of questions.

Historical Background

Before 1972, statisticians treated models like linear regression, logistic regression, probit analysis, and Poisson regression as separate techniques, each with its own estimation procedure and theory. John Nelder and Robert Wedderburn recognized that these models shared a common mathematical structure and published "Generalized Linear Models" in the Journal of the Royal Statistical Society, Series A, Volume 135, pages 370 to 384 ^[1]. Their paper demonstrated that by specifying three components (a distribution, a linear predictor, and a link function), a wide class of models could be estimated using a single iterative algorithm.

Nelder and Wedderburn also proposed the iteratively reweighted least squares (IRLS) algorithm for maximum likelihood estimation, which remains the default fitting method in most statistical software today. Their work was later expanded by Peter McCullagh and John Nelder in the influential 1989 textbook Generalized Linear Models (second edition), which became a standard reference in the field ^[2].

Three Components of a GLM

Every generalized linear model is defined by three components that work together to specify how the response variable relates to the predictor variables.

Random Component

The random component specifies the probability distribution of the response variable Y. In a GLM, this distribution must belong to the exponential family, which includes the normal, binomial, Poisson, gamma, inverse Gaussian, and negative binomial distributions. The choice of distribution reflects the nature of the data. For example, binary outcomes call for a binomial distribution, count data call for a Poisson distribution, and continuous positive-valued data call for a gamma distribution.

Systematic Component

The systematic component is the linear predictor, which combines the explanatory variables into a single value:

η = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ

Here, β₀ is the intercept, β₁ through βₚ are regression coefficients, and X₁ through Xₚ are the predictor variables. Despite the name "generalized linear model," the linearity refers to this component: the model is linear in the parameters, even though the overall relationship between predictors and the response can be nonlinear.

Link Function

The link function g(·) connects the random component to the systematic component by transforming the expected value of the response variable so that it equals the linear predictor:

g(μ) = η, where μ = E(Y)

The link function serves two purposes. First, it maps the expected value of Y onto the entire real line, which is necessary because the linear predictor η can take any value from negative infinity to positive infinity. Second, it ensures that predicted values remain within the valid range for the chosen distribution. For instance, the logit link maps probabilities (which must lie between 0 and 1) onto the real line, while the log link ensures that predicted counts remain positive.

Each exponential family distribution has a canonical link function that arises naturally from the mathematical form of the distribution. Using the canonical link simplifies estimation and guarantees certain desirable statistical properties, such as the equivalence of Newton's method and Fisher scoring.

Common GLM Types

The following table summarizes the most widely used GLMs, their distributions, canonical link functions, and typical applications.

Distribution	Link Function	Link Formula	Mean Function	Typical Use
Normal (Gaussian)	Identity	g(μ) = μ	μ = Xβ	Continuous response data, standard linear regression
Binomial	Logit	g(μ) = log(μ / (1 - μ))	μ = exp(Xβ) / (1 + exp(Xβ))	Binary outcomes, logistic regression
Poisson	Log	g(μ) = log(μ)	μ = exp(Xβ)	Count data (event counts, frequencies)
Gamma	Inverse	g(μ) = 1/μ	μ = 1/(Xβ)	Positive continuous data (insurance claims, survival times)
Inverse Gaussian	Inverse squared	g(μ) = 1/μ²	μ = (Xβ)^(-1/2)	Positive continuous data with heavy right tail
Negative Binomial	Log	g(μ) = log(μ)	μ = exp(Xβ)	Overdispersed count data

Alternative link functions can also be used. For binary response data, the probit link (the inverse of the standard normal CDF) and the complementary log-log link are common alternatives to the logit link, each carrying slightly different assumptions about the underlying data-generating process.

The Exponential Family

The mathematical foundation of GLMs rests on the exponential family of distributions. A distribution belongs to the exponential family if its probability density (or mass) function can be written in the form:

f(y | θ, φ) = a(y, φ) · exp[(yθ - b(θ)) / φ]

In this formulation:

θ is the natural (canonical) parameter, which is related to the mean of the distribution.
φ is the dispersion parameter, which controls the variance.
b(θ) is the cumulant function; its first derivative gives the mean (b'(θ) = μ), and its second derivative gives the variance function (b''(θ) = V(μ)).
a(y, φ) is a normalizing function that depends on the data and the dispersion parameter.

The variance of Y under this formulation is Var(Y) = φ · V(μ), where V(μ) = b''(θ) is called the variance function. This relationship means that in a GLM, the variance is allowed to depend on the mean, unlike in ordinary linear regression where the variance is assumed constant.

Parameter Estimation

Maximum Likelihood Estimation

GLM parameters are estimated by maximizing the log-likelihood function. For a sample of n independent observations, the log-likelihood is the sum of the individual log-density contributions. Because GLMs generally do not have closed-form solutions for the maximum likelihood estimates (with the exception of the normal/identity case, which reduces to ordinary least squares), iterative numerical methods are required.

The two primary iterative approaches are:

Newton's method (Newton-Raphson): Updates the parameter vector using the observed information matrix (the negative Hessian of the log-likelihood).
Fisher scoring: Replaces the observed information matrix with its expected value (the Fisher information matrix). When the canonical link function is used, Newton's method and Fisher scoring produce identical updates ^[2].

IRLS Algorithm

The iteratively reweighted least squares (IRLS) algorithm is the standard method for fitting GLMs. Proposed by Nelder and Wedderburn in their original 1972 paper, IRLS reformulates the Fisher scoring updates as a sequence of weighted least squares problems ^[1]. At each iteration, the algorithm:

Computes the current estimate of the mean μ from the current parameter estimates β.
Calculates a working response (also called the adjusted dependent variable), which linearizes the link function around the current estimates.
Calculates working weights based on the variance function and the derivative of the link function.
Solves a weighted least squares problem using the working response and working weights to obtain updated parameter estimates.
Repeats steps 1 through 4 until the parameter estimates converge (that is, until successive estimates change by less than a specified tolerance).

IRLS is numerically stable, converges reliably for well-specified models, and leverages efficient linear algebra routines for the weighted least squares step. McCullagh and Nelder proved that this procedure converges to the maximum likelihood estimates under standard regularity conditions ^[2].

Deviance and Model Comparison

Deviance

Deviance is the GLM analogue of the residual sum of squares in ordinary linear regression. It measures the discrepancy between the fitted model and a saturated model (a model with as many parameters as observations that fits the data perfectly). The deviance is defined as:

D = 2 · [log L(saturated model) - log L(fitted model)]

Two types of deviance are commonly reported:

Null deviance: The deviance of a model containing only the intercept. It measures the total variability in the response explained by the mean alone.
Residual deviance: The deviance of the fitted model. A large drop from null deviance to residual deviance indicates that the predictors improve the fit substantially.

Likelihood Ratio Test

To compare two nested GLMs (where one model is a special case of the other), the likelihood ratio test computes the difference in deviances between the two models. Under the null hypothesis that the simpler model is adequate, this difference follows an approximate chi-squared distribution with degrees of freedom equal to the number of additional parameters in the larger model ^[3].

Information Criteria

For comparing non-nested models, information criteria provide a useful alternative:

AIC (Akaike Information Criterion): AIC = Deviance + 2k, where k is the number of estimated parameters. Lower AIC values indicate better-fitting models, balancing goodness of fit against model complexity ^[4].
BIC (Bayesian Information Criterion): BIC = Deviance + k · log(n), where n is the sample size. BIC penalizes model complexity more heavily than AIC, especially for large samples.

Overdispersion

A key assumption in many GLMs is that the variance is fully determined by the mean through the variance function V(μ). In practice, the observed variance often exceeds this theoretical expectation, a phenomenon called overdispersion. Overdispersion is particularly common with Poisson and binomial data.

Overdispersion can be detected by comparing the residual deviance to the residual degrees of freedom. If the ratio (residual deviance / degrees of freedom) is substantially greater than 1, overdispersion is likely present.

Several approaches address overdispersion:

Approach	Description	When to Use
Quasi-likelihood	Uses the same mean-variance relationship but introduces a dispersion parameter φ to scale the variance. Standard errors and test statistics are adjusted accordingly, and F-tests replace chi-squared tests.	Mild to moderate overdispersion
Negative binomial regression	Adds an extra parameter to model the excess variance directly. The variance is μ + μ²/k, allowing it to exceed the Poisson mean.	Count data with variance much larger than the mean
Zero-inflated models	Combines a point mass at zero with a count distribution (Poisson or negative binomial) to handle excess zeros.	Data with more zeros than the count distribution predicts

GLMs and Neural Networks

Generalized linear models and neural networks are more closely related than they might first appear. A GLM can be viewed as a single-layer neural network with a specific activation function. Ordinary linear regression corresponds to a network with no activation function (the identity), while logistic regression corresponds to a network with a sigmoid activation. The link function in a GLM plays the same structural role as the activation function in a neural network: both transform the output of a linear combination of inputs ^[5].

The key difference is depth and flexibility. GLMs are restricted to a single linear transformation followed by one link function, which limits their ability to capture complex nonlinear relationships. Neural networks stack multiple layers of linear transformations and nonlinear activations, enabling them to approximate arbitrarily complex functions. However, this added flexibility comes at a cost: neural networks require more data, are computationally more expensive to train, and produce models that are far harder to interpret.

In practice, GLMs remain the preferred choice when interpretability matters, when the sample size is moderate, or when the relationship between predictors and response is well approximated by a known link function. Neural networks are better suited to large-scale prediction problems with complex, high-dimensional inputs (such as images or text) where raw predictive accuracy outweighs the need for interpretable coefficients ^[5].

Recent research has explored hybrid approaches that combine both paradigms. For example, features extracted from deep neural networks can be used as inputs to a GLM, combining the representation power of deep learning with the statistical rigor of the GLM framework ^[6]. In the insurance industry, such "combined actuarial neural net" models have gained traction for pricing and reserving applications.

Assumptions and Diagnostics

GLMs rely on several assumptions that should be checked:

Correct distribution: The response variable follows (approximately) the specified distribution from the exponential family.
Correct link function: The chosen link function properly relates the mean response to the linear predictor.
Independence: Observations are independent of each other.
Linearity in the link scale: The relationship between predictors and the transformed mean is linear.
No severe multicollinearity: Predictor variables should not be too highly correlated with each other.

Common diagnostic tools include deviance residual plots, leverage and influence measures (Cook's distance), and quantile-quantile (Q-Q) plots of deviance residuals. The loss function used for fitting (the negative log-likelihood) also serves as the basis for assessing model adequacy.

Software Implementation

GLMs are implemented in all major statistical and machine learning software packages. The following table compares the most common implementations.

Software	Function / Class	Key Features
R	`glm()` (built-in)	Full family support (gaussian, binomial, poisson, Gamma, inverse.gaussian, quasi), detailed summary output with deviance, AIC, coefficient tests. The standard reference implementation.
Python (statsmodels)	`sm.GLM()`	Supports Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, Negative Binomial, and Tweedie families. Provides detailed summary tables with standard errors, confidence intervals, and deviance statistics ^[7].
Python (scikit-learn)	`PoissonRegressor`, `GammaRegressor`, `TweedieRegressor`	Oriented toward prediction rather than inference. Includes regularization by default. Added in scikit-learn 0.23 ^[8].
SAS	`PROC GENMOD`	Comprehensive GLM procedure with support for repeated measures, GEE estimation, and Type III tests.
Stata	`glm` command	Supports all standard families and links, with robust standard error options.

Example: Fitting a GLM in Python

Using statsmodels to fit a Poisson regression:

import statsmodels.api as sm
import numpy as np

# Prepare data
X = sm.add_constant(predictor_data)
y = count_data

# Fit Poisson GLM
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

# View results
print(results.summary())

Using R to fit a logistic regression:

# Fit logistic regression (binomial GLM with logit link)
model <- glm(outcome ~ predictor1 + predictor2,
             data = my_data,
             family = binomial(link = "logit"))
summary(model)

Applications

GLMs are used extensively across many disciplines:

Insurance and actuarial science: GLMs are the industry standard for modeling claim frequency (Poisson or negative binomial regression) and claim severity (gamma or inverse Gaussian regression). Insurers use these models to set premiums, assess risk, and comply with regulatory requirements ^[9].
Epidemiology: Poisson regression and logistic regression are used to model disease incidence rates, estimate relative risks, and identify risk factors while controlling for confounding variables ^[10].
Ecology: Ecologists use Poisson and negative binomial GLMs to model species abundance, distribution patterns, and the effects of environmental variables on biodiversity.
Social sciences: Logistic regression is widely used for modeling binary outcomes such as voting behavior, survey responses, and educational attainment.
Engineering: Gamma regression models are applied to reliability analysis and time-to-failure data.
Finance: Credit scoring models frequently use logistic regression, and count-based models are used for operational risk analysis.

References

Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized Linear Models." *Journal of the Royal Statistical Society, Series A*, 135(3), 370-384.
McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman & Hall/CRC.
Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley.
Akaike, H. (1974). "A new look at the statistical model identification." *IEEE Transactions on Automatic Control*, 19(6), 716-723.
Wuthrich, M. V. (2020). "From Generalized Linear Models to Neural Networks, and Back." *SSRN Electronic Journal*.
Schelldorfer, J., & Wuthrich, M. V. (2019). "Nesting classical actuarial models into neural networks." *SSRN Electronic Journal*.
Statsmodels documentation: Generalized Linear Models. https://www.statsmodels.org/stable/glm.html
Scikit-learn documentation: Generalized Linear Models. https://scikit-learn.org/stable/modules/linear_model.html
De Jong, P., & Heller, G. Z. (2008). *Generalized Linear Models for Insurance Data*. Cambridge University Press.
Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press.

Explain Like I'm 5 (ELI5)

Historical Background

Three Components of a GLM

Random Component

Systematic Component

Link Function

Common GLM Types

The Exponential Family

Parameter Estimation

Maximum Likelihood Estimation

IRLS Algorithm

Deviance and Model Comparison

Deviance

Likelihood Ratio Test

Information Criteria

Overdispersion

GLMs and Neural Networks

Assumptions and Diagnostics

Software Implementation

Example: Fitting a GLM in Python

Applications

See Also

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Explain Like I'm 5 (ELI5)

Historical Background

Three Components of a GLM

Random Component

Systematic Component

Link Function

Common GLM Types

The Exponential Family

Parameter Estimation

Maximum Likelihood Estimation

IRLS Algorithm

Deviance and Model Comparison

Deviance

Likelihood Ratio Test

Information Criteria

Overdispersion

GLMs and Neural Networks

Assumptions and Diagnostics

Software Implementation

Example: Fitting a GLM in Python

Applications

See Also

References

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness