Maximum likelihood estimation (MLE)
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,993 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,993 words
Add missing citations, update stale details, or suggest a clearer explanation.
Maximum likelihood estimation (MLE) is a method for estimating the parameters of a probability distribution from observed data. Given a parametric model with density (or mass) function p(x; θ) and a dataset x, the maximum likelihood estimate θ̂ is the value of θ that makes the observed data most probable under the model. It is the dominant frequentist parameter-estimation procedure in classical statistics and the workhorse training principle behind modern probabilistic machine learning, including logistic regression, generalised linear models, hidden Markov models, Gaussian mixture models, and the autoregressive language models that drive contemporary deep learning.
The procedure was developed by Ronald A. Fisher in a series of papers between 1912 and 1922, with the now-standard name and the modern formulation appearing in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics" published in the Philosophical Transactions of the Royal Society. That single paper also introduced the words parameter, statistic, consistency, efficiency, and sufficiency into the statistical vocabulary, and it remains the historical anchor for almost all of frequentist estimation theory.[^fisher1922][^aldrich1997]
Let x = (x₁, ..., x_n) denote observed data and let p(x; θ) be a parametric statistical model indexed by a parameter θ that lies in some parameter space Θ ⊆ ℝ^k. The likelihood function is the joint density (or mass) of the data, viewed as a function of θ with x held fixed:
L(θ; x) = p(x; θ)
Fisher was emphatic that the likelihood is not a probability distribution over θ. Likelihood and probability are wholly distinct objects: one fixes the parameter and varies the data, the other fixes the data and varies the parameter. Fisher coined the term "likelihood" precisely to mark this distinction.[^fisher1922][^likelihood_wiki]
For independent and identically distributed (i.i.d.) samples the joint density factorises, so the likelihood is a product:
L(θ; x) = ∏_{i=1}^{n} p(x_i; θ)
In practice one almost always works with the log-likelihood ℓ(θ; x) = log L(θ; x), which converts the product into a sum:
ℓ(θ; x) = ∑_{i=1}^{n} log p(x_i; θ)
The logarithm is monotone, so any maximiser of L is also a maximiser of ℓ. Working in log space is computationally essential because individual densities can be vanishingly small in high dimensions, and a product of n such densities will underflow to zero in floating-point arithmetic for even modest n. The maximum likelihood estimator is then
θ̂_MLE = argmax_{θ ∈ Θ} ℓ(θ; x)
When ℓ is differentiable, the estimator satisfies the likelihood equations ∂ℓ/∂θ = 0 evaluated at θ̂. The vector of partial derivatives s(θ; x) = ∇_θ ℓ(θ; x) is called the score function, and its expected value is zero at the true parameter under standard regularity conditions.[^fisher_info_wiki]
The canonical examples below illustrate why MLE so often produces the "obvious" empirical estimator, and where it gives more interesting answers such as ordinary least squares.
| Model | Parameter | MLE | Closed form? |
|---|---|---|---|
| Bernoulli(p) | p ∈ [0,1] | sample proportion of successes, k/n | yes |
| Categorical(π_1, ..., π_K) | class probabilities | empirical class frequencies, n_k / n | yes |
| Normal(μ, σ²) | μ | sample mean x̄ | yes |
| Normal(μ, σ²) | σ² | (1/n) ∑ (x_i − x̄)², biased downward | yes |
| Linear regression with Gaussian noise | weight vector w | ordinary least squares, (XᵀX)⁻¹Xᵀy | yes |
| Logistic regression | weight vector w | requires iterative optimisation (IRLS, gradient descent) | no |
| Gaussian mixture model | mixture weights, means, covariances | local optima found by EM | no |
For n i.i.d. Bernoulli(p) trials with k successes, the log-likelihood is ℓ(p) = k log p + (n − k) log(1 − p). Differentiating and setting the derivative to zero gives p̂ = k/n. The MLE of a Bernoulli probability is just the empirical proportion of successes.[^wasserman_ch9][^mle_wiki]
For n i.i.d. samples x₁, ..., x_n from N(μ, σ²) the log-likelihood is
ℓ(μ, σ²) = −(n/2) log(2π) − (n/2) log σ² − (1/(2σ²)) ∑ (x_i − μ)²
Maximising over μ for fixed σ² yields the sample mean μ̂ = x̄. Substituting back and maximising over σ² yields σ̂²_MLE = (1/n) ∑ (x_i − x̄)². The mean estimator is unbiased; the variance estimator is biased downward by a factor (n − 1)/n, which is why the textbook "sample variance" with denominator n − 1 is preferred for unbiased estimation. The MLE is still consistent: the bias vanishes as n → ∞.[^bishop_prml][^mle_wiki]
For n i.i.d. samples drawn from a K-class categorical distribution with probabilities π = (π_1, ..., π_K), the MLE under the constraint ∑ π_k = 1 (using a Lagrange multiplier) is π̂_k = n_k / n, where n_k is the number of times class k was observed. Empirical frequencies maximise the categorical likelihood, which is why "counting" is the natural training rule for unsmoothed naive Bayes and n-gram language models.[^wasserman_ch9]
If y_i = xᵢᵀw + ε_i with ε_i ∼ N(0, σ²) and the inputs xᵢ are treated as fixed, the conditional log-likelihood factorises as a sum of Gaussian log-densities, and maximising over w reduces to minimising ∑ (y_i − xᵢᵀw)². The MLE is the ordinary least squares solution
ŵ_MLE = (XᵀX)⁻¹ Xᵀy
This identity is the formal reason that squared loss shows up everywhere in regression: it is exactly the negative log-likelihood under a homoscedastic Gaussian noise model, up to constants that do not depend on w.[^goodfellow_ch5][^bishop_prml]
For binary logistic regression with σ(z) = 1/(1+e^(-z)) and labels y_i ∈ {0,1}, the log-likelihood is
ℓ(w) = ∑ [y_i log σ(xᵢᵀw) + (1 − y_i) log(1 − σ(xᵢᵀw))]
The negative of this expression is the binary cross-entropy loss, also called the log loss. The likelihood equations have no closed-form solution because of the nonlinearity in σ, so the MLE is computed by iterative methods. The traditional choice is iteratively reweighted least squares (IRLS), which is Newton's method applied to the GLM likelihood with Fisher scoring. Each Newton step solves a weighted least squares problem in a working response, and the algorithm converges to the unique global maximum because the log-likelihood is strictly concave in w.[^irls_wiki][^goodfellow_ch5]
The extension to more than two classes is multi-class logistic regression, trained by minimising categorical cross-entropy.
MLE has a tightly interlocking set of large-sample guarantees that, taken together, make it the canonical "good" estimator in classical statistics.
| Property | Statement | Source |
|---|---|---|
| Consistency | θ̂_n → θ_true in probability as n → ∞ | Cramér 1946 |
| Asymptotic normality | √n (θ̂_n − θ_true) → N(0, I(θ)⁻¹) | Cramér 1946 |
| Asymptotic efficiency | Variance attains the Cramér-Rao lower bound asymptotically | Rao 1945, Cramér 1946 |
| Equivariance / invariance | If g is a one-to-one function, then the MLE of g(θ) is g(θ̂) | Casella and Berger ch. 7 |
| Asymptotic equivalence with Bayes | Posterior concentrates as N(θ̂, I(θ)⁻¹/n) for any well-behaved prior | Bernstein-von Mises theorem |
Under Cramér's regularity conditions, the maximum likelihood estimator is consistent: θ̂_n converges in probability to the true parameter θ_true as the sample size n grows without bound.[^cramer1946][^stanford_lec16] Intuitively, the empirical log-likelihood (1/n) ∑ log p(x_i; θ) converges by the law of large numbers to its expectation E_{x ∼ p_true}[log p(x; θ)], which is uniquely maximised at θ_true when the model is correctly specified. The argmax of the empirical objective therefore converges to the argmax of the population objective.
Under the same regularity conditions,
√n (θ̂_n − θ_true) → N(0, I(θ_true)⁻¹)
in distribution, where I(θ) is the Fisher information matrix
I(θ) = E[(∇_θ log p(X; θ))(∇_θ log p(X; θ))ᵀ] = −E[∇²_θ log p(X; θ)]
The two definitions agree under the regularity conditions that allow exchange of differentiation and integration. Fisher information is, intuitively, the curvature of the log-likelihood at the truth: a sharply peaked likelihood means a small variance for the estimator.[^fisher_info_wiki][^mit_18443]
The Cramér-Rao lower bound, proved by Harald Cramér and C. R. Rao in the 1940s, states that for any unbiased estimator T(X) of θ,
Var(T) ≥ I(θ)⁻¹
The MLE attains this bound asymptotically, in the sense that its asymptotic variance equals I(θ)⁻¹. No other consistent estimator can have lower asymptotic variance under the same model. This is what is meant by saying the MLE is asymptotically efficient.[^cramer1946][^stanford_stats200]
If θ̂ is the MLE of θ and g is a one-to-one transformation, then g(θ̂) is the MLE of g(θ). This is the invariance property of the MLE, due in its modern form to Zehna (1966) and stated as Theorem 7.2.10 in Casella and Berger's Statistical Inference.[^casella_berger] Bayesian estimators do not enjoy this property in general: the posterior mean of g(θ) is not g(posterior mean of θ).
The Bernstein-von Mises theorem states that under regularity conditions, the Bayesian posterior distribution converges in total variation distance to a Gaussian centred at the MLE with covariance given by the inverse Fisher information divided by n. As n → ∞, the prior is washed out, the posterior looks Gaussian, and the posterior mean and the MLE become indistinguishable to leading order. This is the deep reason that frequentist confidence intervals based on Fisher information and Bayesian credible intervals from "reasonable" priors agree in large samples.[^bvm_wiki]
The asymptotic guarantees above hold under what are usually called Cramér's regularity conditions, after Harald Cramér's 1946 textbook Mathematical Methods of Statistics.[^cramer1946][^stanford_lec16] Common formulations require:
When these conditions fail, MLE can still be defined but its behaviour can be pathological. A famous boundary case is estimating θ in Uniform(0, θ): the MLE is the sample maximum, which converges at rate 1/n rather than the usual 1/√n, and its limiting distribution is exponential rather than Gaussian.
Maximising the log-likelihood is, up to a constant that does not depend on θ, equivalent to minimising the Kullback-Leibler divergence from the empirical data distribution to the model. Let p̂_data denote the empirical distribution that puts mass 1/n on each observed sample. Then
D_KL(p̂_data ‖ p_θ) = E_{x ∼ p̂_data}[log p̂_data(x)] − E_{x ∼ p̂_data}[log p_θ(x)]
The first term does not depend on θ, so minimising the KL divergence is the same as maximising the second term, which is exactly (1/n) times the log-likelihood. Goodfellow, Bengio and Courville call this the unifying view of MLE: "we can see it as minimizing the dissimilarity between the empirical distribution defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence".[^goodfellow_ch5]
An important corollary applies when the model is misspecified, meaning the true data-generating distribution g does not lie in the parametric family {p_θ}. Then the MLE converges to the pseudo-true parameter θ* that minimises D_KL(g ‖ p_θ), the KL projection of the truth onto the model. This is the basis of quasi-maximum likelihood estimation, formalised by Halbert White in his 1982 paper Maximum Likelihood Estimation of Misspecified Models. Standard inference no longer applies under misspecification because Var(θ̂) is not I(θ)⁻¹/n, and one has to use the so-called sandwich variance estimator instead.[^white1982]
For a categorical model with predicted probabilities p_θ(y | x), the negative log-likelihood of a single labelled example (x, y) is
−log p_θ(y | x) = H(δ_y, p_θ(· | x))
where δ_y is the one-hot distribution on the true label and H denotes cross-entropy. Summing over the dataset gives
−ℓ(θ) = ∑_{i=1}^{n} H(δ_{y_i}, p_θ(· | x_i))
This is exactly the cross-entropy loss used to train classification models. Every time a deep neural network is trained on a classification task with log loss, it is being trained by maximum likelihood under a categorical likelihood; "cross-entropy" and "negative log-likelihood" are different names for the same objective.[^goodfellow_ch5] The fact that virtually all modern classifiers, from logistic regression to ImageNet CNNs to GPT-style transformers, are trained by minimising cross-entropy is just the statement that they are trained by maximum likelihood.
For a regression model y = f_θ(x) + ε with ε ∼ N(0, σ²), the negative log-likelihood of a single example is
−log p_θ(y | x) = (1/(2σ²))(y − f_θ(x))² + (1/2) log(2πσ²)
Summing over the dataset and discarding terms that do not depend on θ, the MLE of θ is the minimiser of ∑ (y_i − f_θ(x_i))², which is the mean squared error. So MSE is MLE under a Gaussian noise assumption with constant variance. This explains why the textbook regression loss is squared error and why robust regression methods (Huber loss, L1 loss) implicitly assume non-Gaussian noise distributions.[^goodfellow_ch5]
For a handful of "nice" models, the likelihood equations can be solved in closed form. The Bernoulli, multinomial, Gaussian and ordinary linear regression cases above are the main examples. For most other models, the maximum has to be found numerically.
| Method | Typical use case |
|---|---|
| Closed-form solution | Exponential family with sufficient statistics, Gaussian, OLS regression |
| Newton-Raphson | Smooth concave log-likelihoods, small to medium parameter dimension |
| Fisher scoring | Generalised linear models; replaces observed Hessian with expected Fisher information |
| Iteratively reweighted least squares (IRLS) | Logistic regression and other GLMs |
| Quasi-Newton (BFGS, L-BFGS) | Medium-dimensional smooth problems where Hessian is expensive |
| Expectation-Maximisation (EM) | Latent-variable and missing-data models, Gaussian mixture models, HMMs |
| Stochastic gradient descent | Massive datasets and high-dimensional models, including deep neural networks |
The Expectation-Maximisation algorithm of Dempster, Laird and Rubin (1977) is the standard iterative method for MLE in models with latent variables. Each EM iteration alternates between an E-step that computes the expected complete-data log-likelihood given current parameters and an M-step that maximises this expected log-likelihood. The algorithm is guaranteed to monotonically increase the observed-data likelihood at every iteration, although it can converge to local optima.[^dempster1977]
For very large parameter spaces, second-order methods become infeasible, and the dominant tool is stochastic gradient descent on the negative log-likelihood, optionally with mini-batch updates and adaptive learning rates such as Adam. Modern deep learning is, almost without exception, MLE in disguise: the loss function is some flavour of negative log-likelihood, and the training loss curve is monitored to verify that ℓ is increasing.
MLE returns a single point estimate, the mode of the likelihood. Bayesian inference instead returns the full posterior distribution
p(θ | x) ∝ p(x | θ) p(θ)
where p(θ) is the prior. The mode of the posterior, called the maximum a posteriori (MAP) estimate, equals the MLE when the prior is flat. Adding a Gaussian prior on θ corresponds to ridge (L2) regularisation; adding a Laplace prior corresponds to lasso (L1) regularisation. So regularised MLE is a special case of MAP estimation, and MAP estimation is a special case of full Bayesian inference that throws away everything except the posterior mode.[^goodfellow_ch5]
| Estimator | Definition | Output | Uses prior? | Uncertainty quantification |
|---|---|---|---|---|
| MLE | argmax_θ p(x ; θ) | point estimate | no | only via asymptotic Fisher information |
| MAP | argmax_θ p(x ; θ) p(θ) | point estimate | yes | only via Laplace approximation around the mode |
| Bayesian posterior | p(θ | x) ∝ p(x ; θ) p(θ) | full distribution over θ | yes |
For large n, the Bernstein-von Mises theorem implies that the posterior under any "sensible" prior concentrates on the MLE with width 1/√n, so MLE, MAP and the posterior mean all become equivalent. For small n, or when the prior carries genuine information, the three estimators can give noticeably different answers. Bayesian inference is more honest about uncertainty but harder to compute. MLE is cheap and asymptotically efficient.
MLE is the default for good reason, but it has well-known failure modes that practitioners should keep in mind.
Finite-sample bias. The MLE is consistent but not unbiased in general. The Gaussian variance estimator with denominator n is the textbook example: it is biased downward by a factor (n − 1)/n. Bias corrections of order 1/n exist (jackknife, Barndorff-Nielsen's modified profile likelihood) but require extra work.[^mle_wiki]
Sensitivity to model misspecification. If the assumed family does not contain the true distribution, the MLE converges to the KL projection of the truth onto the family, not to anything that need correspond to a quantity of scientific interest. White's quasi-MLE framework gives correct standard errors via the sandwich estimator but cannot fix a fundamentally wrong model.[^white1982]
Sensitivity to outliers. Because the log-density goes to −∞ near zero, a single grossly outlying observation can drive the likelihood arbitrarily low. The Gaussian MLE of the mean is the sample mean, which has zero breakdown point. Robust alternatives (M-estimators, trimmed likelihoods) trade efficiency for resistance to outliers.
Multiple local optima. When the log-likelihood is non-concave, as in Gaussian mixture models, neural networks, and most latent-variable models, the optimisation surface has many local optima, and the EM algorithm or gradient descent can converge to any of them. Initialisation matters, and there is no general guarantee that the global maximum will be found.
Overfitting in high dimensions. When the parameter dimension is comparable to or larger than the sample size, the MLE will fit noise and generalise poorly. This is the entire reason ridge and lasso regression exist: they are MAP estimates with Gaussian or Laplace priors that shrink coefficients toward zero, regularising the unregularised MLE.
Numerical underflow. Multiplying many small probabilities underflows to zero in floating-point arithmetic. Always work in log space, summing log-probabilities rather than multiplying probabilities. Standard implementations use the log-sum-exp trick for numerically stable computation of normalised likelihoods.
Boundary and unbounded likelihoods. Some likelihoods are unbounded above. A two-component Gaussian mixture, for instance, can drive its likelihood to +∞ by centring one component on a single data point and shrinking its variance to zero. The naive MLE does not exist, and one has to use either constrained optimisation, a prior that penalises tiny variances, or the so-called penalised likelihood approach.
Calibration of high-capacity models. A deep network trained to convergence by maximum likelihood on cross-entropy will typically produce overconfident predictions on the training set. Label smoothing, introduced in Szegedy et al.'s 2016 Rethinking the Inception Architecture for Computer Vision, replaces the one-hot target with a softened distribution, equivalent to adding a KL divergence to the uniform distribution to the negative log-likelihood objective. It is a deliberate move away from the unregularised MLE, and it consistently improves calibration and generalisation.[^szegedy2016]
MLE is not just a textbook procedure; it is the explicit training principle behind most of modern probabilistic ML.
What MLE does not train are GANs (which use a minimax loss instead of a likelihood), Wasserstein models (which use a Wasserstein loss instead of KL), and hinge-loss SVMs (which use hinge loss instead of log loss). Each alternative is motivated by a specific failure mode of MLE: mode collapse, gradient saturation, or sensitivity to misspecification.
A short checklist for using MLE in practice. Always optimise the log-likelihood, not the likelihood itself, because floating-point underflow is otherwise certain. Check identifiability before optimising; an unidentifiable parameter produces a flat ridge in the likelihood and no unique answer. Use a second-order method when feasible: Newton-Raphson and IRLS converge in few iterations for well-behaved problems, and the Hessian at the optimum is the observed Fisher information, which gives standard errors essentially for free. For large n, switch to stochastic gradient descent, since computing the full-batch log-likelihood is wasteful when the empirical loss is a sum of n terms. Regularise: pure MLE in high dimensions overfits, so add a prior (i.e. switch to MAP), use early stopping, weight decay, or label smoothing. Diagnose misspecification: if model-based and sandwich standard errors disagree, the model is probably wrong. For non-concave problems, use multiple random restarts and report sensitivity to initialisation.
Bayes' theorem, Bayesian inference, Bayesian neural network, convex optimization, estimator, focal loss, Gaussian process, loss curve, loss function, loss surface, naive Bayes, training loss, validation loss.