# Generalized Linear Model

> Source: https://aiwiki.ai/wiki/generalized_linear_model
> Updated: 2026-06-24
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **generalized linear model** (GLM) is a flexible extension of ordinary [linear regression](/wiki/linear_regression) that allows the response variable to follow any distribution from the exponential family, not just the normal distribution, by connecting the mean of the response to a linear predictor through a link function. GLMs unify several common statistical models, including [linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and Poisson regression, under a single theoretical framework, and they are estimated by maximum likelihood using iteratively reweighted least squares (IRLS). The framework was introduced by John Nelder and Robert Wedderburn in a landmark paper published in the *Journal of the Royal Statistical Society, Series A (General)*, Volume 135, Issue 3, pages 370 to 384, in May 1972 [1]. GLMs have since become one of the most important tools in both classical statistics and [machine learning](/wiki/machine_learning), with applications spanning insurance pricing, ecological modeling, epidemiology, and many other fields.

*This article describes the statistical model. For the family of large language models from Zhipu AI that share the abbreviation "GLM," see [GLM-4.6](/wiki/glm_4_6).*

## Explain Like I'm 5 (ELI5)

Imagine you have a toy machine that predicts things. Regular linear regression is like a machine that can only predict smooth, continuous numbers (like temperature). But what if you want to predict something different, like whether it will rain (yes or no) or how many birds you will see in the park (a count: 0, 1, 2, 3...)? A generalized linear model is like an upgraded version of that machine. It has a special adapter (the "link function") that lets it handle all sorts of different predictions, not just smooth numbers. You plug in the right adapter depending on what kind of answer you need, and the machine works for many more types of questions.

## When was the generalized linear model introduced?

Before 1972, statisticians treated models like linear regression, logistic regression, probit analysis, and Poisson regression as separate techniques, each with its own estimation procedure and theory. John Nelder and Robert Wedderburn recognized that these models shared a common mathematical structure and published "Generalized Linear Models" in the *Journal of the Royal Statistical Society, Series A (General)*, Volume 135, Issue 3, pages 370 to 384, in May 1972 [1]. Their paper demonstrated that by specifying three components (a distribution, a linear predictor, and a link function), a wide class of models could be estimated using a single iterative algorithm. The original paper illustrated the framework with four distributions: the normal, the binomial (covering probit analysis), the Poisson (covering contingency tables), and the gamma [1].

The paper's opening sentence stated the central result directly: "The technique of iterative weighted linear regression can be used to obtain maximum likelihood estimates of the parameters with observations distributed according to some exponential family and systematic effects that can be made linear by a suitable transformation." [1] This iterative weighted regression procedure became known as iteratively reweighted least squares (IRLS), and it remains the default fitting method in most statistical software today.

Nelder and Wedderburn's work was later expanded by Peter McCullagh and John Nelder in the influential textbook *Generalized Linear Models*, whose second edition appeared in 1989 (Chapman and Hall, London; ISBN 0412317605) as part of the Chapman & Hall/CRC Monographs on Statistics and Applied Probability series. That book became the standard reference in the field and added topics such as conditional and marginal likelihood methods, estimating equations, and models for dispersion effects [2].

## What are the three components of a GLM?

Every generalized linear model is defined by three components that work together to specify how the response variable relates to the predictor variables.

### Random Component

The random component specifies the probability distribution of the response variable Y. In a GLM, this distribution must belong to the **exponential family**, which includes the normal, binomial, Poisson, gamma, inverse Gaussian, and negative binomial distributions. The choice of distribution reflects the nature of the data. For example, binary outcomes call for a binomial distribution, count data call for a Poisson distribution, and continuous positive-valued data call for a gamma distribution.

### Systematic Component

The systematic component is the linear predictor, which combines the explanatory variables into a single value:

**η = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ**

Here, β₀ is the intercept, β₁ through βₚ are regression coefficients, and X₁ through Xₚ are the predictor variables. Despite the name "generalized linear model," the linearity refers to this component: the model is linear in the parameters, even though the overall relationship between predictors and the response can be nonlinear.

### Link Function

The link function **g(·)** connects the random component to the systematic component by transforming the expected value of the response variable so that it equals the linear predictor:

**g(μ) = η**, where **μ = E(Y)**

Equivalently, the mean is recovered through the inverse link, μ = g⁻¹(η) [7]. The link function serves two purposes. First, it maps the expected value of Y onto the entire real line, which is necessary because the linear predictor η can take any value from negative infinity to positive infinity. Second, it ensures that predicted values remain within the valid range for the chosen distribution. For instance, the logit link maps probabilities (which must lie between 0 and 1) onto the real line, while the log link ensures that predicted counts remain positive.

Each exponential family distribution has a **canonical link function** that arises naturally from the mathematical form of the distribution. The canonical link is the function that maps the mean μ to the natural (canonical) parameter θ of the exponential family, so that when it is used the canonical parameter equals the linear predictor (θ = η) [3]. The canonical links for the three most common cases are the identity link for the normal distribution, the logit link for the binomial (Bernoulli) distribution, and the log link for the Poisson distribution [3]. Using the canonical link simplifies estimation and guarantees certain desirable statistical properties, such as the equivalence of Newton's method and Fisher scoring.

## How do common GLM types differ?

The following table summarizes the most widely used GLMs, their distributions, canonical link functions, and typical applications.

| Distribution | Link Function | Link Formula | Mean Function | Typical Use |
|---|---|---|---|---|
| Normal (Gaussian) | Identity | g(μ) = μ | μ = Xβ | Continuous response data, standard [linear regression](/wiki/linear_regression) |
| Binomial | Logit | g(μ) = log(μ / (1 - μ)) | μ = exp(Xβ) / (1 + exp(Xβ)) | Binary outcomes, [logistic regression](/wiki/logistic_regression) |
| Poisson | Log | g(μ) = log(μ) | μ = exp(Xβ) | Count data (event counts, frequencies) |
| Gamma | Inverse | g(μ) = 1/μ | μ = 1/(Xβ) | Positive continuous data (insurance claims, survival times) |
| Inverse Gaussian | Inverse squared | g(μ) = 1/μ² | μ = (Xβ)^(-1/2) | Positive continuous data with heavy right tail |
| Negative Binomial | Log | g(μ) = log(μ) | μ = exp(Xβ) | Overdispersed count data |

All three of the most familiar regression methods are therefore special cases of one framework: ordinary least squares [regression](/wiki/regression) is a GLM with the identity link and a normal distribution, [logistic regression](/wiki/logistic_regression) is a GLM with the logit link and a binomial distribution, and Poisson regression is a GLM with the log link and a Poisson distribution [1][3].

Alternative link functions can also be used. For binary response data, the **probit** link (the inverse of the standard normal CDF) and the **complementary log-log** link are common alternatives to the logit link, each carrying slightly different assumptions about the underlying data-generating process.

## The Exponential Family

The mathematical foundation of GLMs rests on the exponential family of distributions. A distribution belongs to the exponential family if its probability density (or mass) function can be written in the form:

**f(y | θ, φ) = a(y, φ) · exp[(yθ - b(θ)) / φ]**

In this formulation:

- **θ** is the natural (canonical) parameter, which is related to the mean of the distribution.
- **φ** is the dispersion parameter, which controls the variance.
- **b(θ)** is the cumulant function; its first derivative gives the mean (b'(θ) = μ), and its second derivative gives the variance function (b''(θ) = V(μ)).
- **a(y, φ)** is a normalizing function that depends on the data and the dispersion parameter.

The variance of Y under this formulation is **Var(Y) = φ · V(μ)**, where V(μ) = b''(θ) is called the variance function. This relationship means that in a GLM, the variance is allowed to depend on the mean, unlike in ordinary linear regression where the variance is assumed constant. In fact, the statsmodels GLM documentation notes that "a GLM is determined by link function g and variance function v(μ) alone" together with the predictors [7].

## How are GLM parameters estimated?

### Maximum Likelihood Estimation

GLM parameters are estimated by maximizing the log-likelihood function. For a sample of n independent observations, the log-likelihood is the sum of the individual log-density contributions. Because GLMs generally do not have closed-form solutions for the maximum likelihood estimates (with the exception of the normal/identity case, which reduces to ordinary least squares), iterative numerical methods are required. See [maximum likelihood estimation](/wiki/maximum_likelihood_estimation) for the general principle.

The two primary iterative approaches are:

- **Newton's method (Newton-Raphson):** Updates the parameter vector using the observed information matrix (the negative Hessian of the log-likelihood).
- **Fisher scoring:** Replaces the observed information matrix with its expected value (the Fisher information matrix). When the canonical link function is used, Newton's method and Fisher scoring produce identical updates [2].

### IRLS Algorithm

The iteratively reweighted least squares (IRLS) algorithm is the standard method for fitting GLMs. Proposed by Nelder and Wedderburn in their original 1972 paper, IRLS reformulates the Fisher scoring updates as a sequence of weighted least squares problems [1]. At each iteration, the algorithm:

1. Computes the current estimate of the mean μ from the current parameter estimates β.
2. Calculates a **working response** (also called the adjusted dependent variable), which linearizes the link function around the current estimates.
3. Calculates **working weights** based on the variance function and the derivative of the link function.
4. Solves a weighted least squares problem using the working response and working weights to obtain updated parameter estimates.
5. Repeats steps 1 through 4 until the parameter estimates converge (that is, until successive estimates change by less than a specified tolerance).

IRLS is numerically stable, converges reliably for well-specified models, and leverages efficient linear algebra routines for the weighted least squares step. McCullagh and Nelder proved that this procedure converges to the maximum likelihood estimates under standard regularity conditions [2].

## Deviance and Model Comparison

### Deviance

Deviance is the GLM analogue of the residual sum of squares in ordinary linear regression. It measures the discrepancy between the fitted model and a saturated model (a model with as many parameters as observations that fits the data perfectly). The deviance is defined as:

**D = 2 · [log L(saturated model) - log L(fitted model)]**

In the Gaussian (normal) case, the deviance reduces exactly to the residual sum of squares of ordinary linear regression [7]. Two types of deviance are commonly reported:

- **Null deviance:** The deviance of a model containing only the intercept. It measures the total variability in the response explained by the mean alone.
- **Residual deviance:** The deviance of the fitted model. A large drop from null deviance to residual deviance indicates that the predictors improve the fit substantially.

### Likelihood Ratio Test

To compare two nested GLMs (where one model is a special case of the other), the likelihood ratio test computes the difference in deviances between the two models. Under the null hypothesis that the simpler model is adequate, this difference follows an approximate chi-squared distribution with degrees of freedom equal to the number of additional parameters in the larger model [3].

### Information Criteria

For comparing non-nested models, information criteria provide a useful alternative:

- **AIC (Akaike Information Criterion):** AIC = Deviance + 2k, where k is the number of estimated parameters. Lower AIC values indicate better-fitting models, balancing goodness of fit against model complexity [4].
- **BIC (Bayesian Information Criterion):** BIC = Deviance + k · log(n), where n is the sample size. BIC penalizes model complexity more heavily than AIC, especially for large samples.

## What is overdispersion in a GLM?

A key assumption in many GLMs is that the variance is fully determined by the mean through the variance function V(μ). In practice, the observed variance often exceeds this theoretical expectation, a phenomenon called **overdispersion**. Overdispersion is particularly common with Poisson and binomial data.

Overdispersion can be detected by comparing the residual deviance to the residual degrees of freedom. If the ratio (residual deviance / degrees of freedom) is substantially greater than 1, overdispersion is likely present.

Several approaches address overdispersion:

| Approach | Description | When to Use |
|---|---|---|
| Quasi-likelihood | Uses the same mean-variance relationship but introduces a dispersion parameter φ to scale the variance. Standard errors and test statistics are adjusted accordingly, and F-tests replace chi-squared tests. | Mild to moderate overdispersion |
| Negative binomial regression | Adds an extra parameter to model the excess variance directly. The variance is μ + μ²/k, allowing it to exceed the Poisson mean. | Count data with variance much larger than the mean |
| Zero-inflated models | Combines a point mass at zero with a count distribution (Poisson or negative binomial) to handle excess zeros. | Data with more zeros than the count distribution predicts |

## How do GLMs relate to neural networks?

Generalized linear models and [neural networks](/wiki/neural_network) are more closely related than they might first appear. A GLM can be viewed as a single-layer neural network with a specific activation function. Ordinary [linear regression](/wiki/linear_regression) corresponds to a network with no activation function (the identity), while [logistic regression](/wiki/logistic_regression) corresponds to a network with a sigmoid activation. The link function in a GLM plays the same structural role as the activation function in a neural network: both transform the output of a linear combination of inputs [5].

The key difference is depth and flexibility. GLMs are restricted to a single linear transformation followed by one link function, which limits their ability to capture complex nonlinear relationships. Neural networks stack multiple layers of linear transformations and nonlinear activations, enabling them to approximate arbitrarily complex functions. However, this added flexibility comes at a cost: neural networks require more data, are computationally more expensive to train, and produce models that are far harder to interpret.

In practice, GLMs remain the preferred choice when interpretability matters, when the sample size is moderate, or when the relationship between predictors and response is well approximated by a known link function. Neural networks are better suited to large-scale prediction problems with complex, high-dimensional inputs (such as images or text) where raw predictive accuracy outweighs the need for interpretable coefficients [5].

Recent research has explored hybrid approaches that combine both paradigms. For example, features extracted from deep neural networks can be used as inputs to a GLM, combining the representation power of deep learning with the statistical rigor of the GLM framework [6]. In the insurance industry, such "combined actuarial neural net" models have gained traction for pricing and reserving applications.

## Assumptions and Diagnostics

GLMs rely on several assumptions that should be checked:

1. **Correct distribution:** The response variable follows (approximately) the specified distribution from the exponential family.
2. **Correct link function:** The chosen link function properly relates the mean response to the linear predictor.
3. **Independence:** Observations are independent of each other.
4. **Linearity in the link scale:** The relationship between predictors and the transformed mean is linear.
5. **No severe multicollinearity:** Predictor variables should not be too highly correlated with each other.

Common diagnostic tools include deviance residual plots, leverage and influence measures (Cook's distance), and quantile-quantile (Q-Q) plots of deviance residuals. The [loss function](/wiki/loss_function) used for fitting (the negative log-likelihood) also serves as the basis for assessing model adequacy.

## What software implements GLMs?

GLMs are implemented in all major statistical and [machine learning](/wiki/machine_learning) software packages. The following table compares the most common implementations.

| Software | Function / Class | Key Features |
|---|---|---|
| R | `glm()` (built-in) | Full family support (gaussian, binomial, poisson, Gamma, inverse.gaussian, quasi), detailed summary output with deviance, AIC, coefficient tests. The standard reference implementation. |
| Python (statsmodels) | `sm.GLM()` | Supports Gaussian, Binomial, Poisson, Gamma, Inverse Gaussian, Negative Binomial, and Tweedie families. Provides detailed summary tables with standard errors, confidence intervals, and deviance statistics [7]. |
| Python (scikit-learn) | `PoissonRegressor`, `GammaRegressor`, `TweedieRegressor` | Oriented toward prediction rather than inference. Includes regularization by default. Added in scikit-learn 0.23, released in May 2020 [8]. |
| SAS | `PROC GENMOD` | Comprehensive GLM procedure with support for repeated measures, GEE estimation, and Type III tests. |
| Stata | `glm` command | Supports all standard families and links, with robust standard error options. |

In scikit-learn, `PoissonRegressor` and `GammaRegressor` are equivalent to `TweedieRegressor(power=1)` and `TweedieRegressor(power=2)` respectively, both using a built-in log link, and all three support an L2 penalty on the coefficients [8].

### Example: Fitting a GLM in Python

Using statsmodels to fit a Poisson regression:

```python
import statsmodels.api as sm
import numpy as np

# Prepare data
X = sm.add_constant(predictor_data)
y = count_data

# Fit Poisson GLM
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()

# View results
print(results.summary())
```

Using R to fit a logistic regression:

```r
# Fit logistic regression (binomial GLM with logit link)
model <- glm(outcome ~ predictor1 + predictor2,
             data = my_data,
             family = binomial(link = "logit"))
summary(model)
```

## What is a GLM used for?

GLMs are used extensively across many disciplines:

- **Insurance and actuarial science:** GLMs are the industry standard for modeling claim frequency (Poisson or negative binomial regression) and claim severity (gamma or inverse Gaussian regression). Insurers use these models to set premiums, assess risk, and comply with regulatory requirements [9].
- **Epidemiology:** Poisson regression and logistic regression are used to model disease incidence rates, estimate relative risks, and identify risk factors while controlling for confounding variables [10].
- **Ecology:** Ecologists use Poisson and negative binomial GLMs to model species abundance, distribution patterns, and the effects of environmental variables on biodiversity.
- **Social sciences:** Logistic regression is widely used for modeling binary outcomes such as voting behavior, survey responses, and educational attainment.
- **Engineering:** Gamma regression models are applied to reliability analysis and time-to-failure data.
- **Finance:** Credit scoring models frequently use logistic regression, and count-based models are used for operational risk analysis.

## See Also

- [Linear Regression](/wiki/linear_regression)
- [Logistic Regression](/wiki/logistic_regression)
- [Regression](/wiki/regression)
- [Neural Network](/wiki/neural_network)
- [Loss Function](/wiki/loss_function)
- [Machine Learning](/wiki/machine_learning)
- [Maximum Likelihood Estimation](/wiki/maximum_likelihood_estimation)
- [Activation Function](/wiki/activation_function)

## References

1. Nelder, J. A., & Wedderburn, R. W. M. (1972). "Generalized Linear Models." *Journal of the Royal Statistical Society, Series A (General)*, 135(3), 370-384. https://academic.oup.com/jrsssa/article/135/3/370/7110572
2. McCullagh, P., & Nelder, J. A. (1989). *Generalized Linear Models* (2nd ed.). Chapman and Hall/CRC, London. ISBN 0412317605.
3. Agresti, A. (2015). *Foundations of Linear and Generalized Linear Models*. Wiley.
4. Akaike, H. (1974). "A new look at the statistical model identification." *IEEE Transactions on Automatic Control*, 19(6), 716-723.
5. Wuthrich, M. V. (2020). "From Generalized Linear Models to Neural Networks, and Back." *SSRN Electronic Journal*.
6. Schelldorfer, J., & Wuthrich, M. V. (2019). "Nesting classical actuarial models into neural networks." *SSRN Electronic Journal*.
7. Statsmodels documentation: Generalized Linear Models. https://www.statsmodels.org/stable/glm.html
8. Scikit-learn documentation: Generalized Linear Models. https://scikit-learn.org/stable/modules/linear_model.html
9. De Jong, P., & Heller, G. Z. (2008). *Generalized Linear Models for Insurance Data*. Cambridge University Press.
10. Dobson, A. J., & Barnett, A. G. (2018). *An Introduction to Generalized Linear Models* (4th ed.). CRC Press.