Maximum likelihood estimation (MLE)

Maximum likelihood estimation (MLE) is a method for estimating the parameters of a probability distribution from observed data. Given a parametric model with density (or mass) function p(x; θ) and a dataset x, the maximum likelihood estimate θ̂ is the value of θ that makes the observed data most probable under the model. It is the dominant frequentist parameter-estimation procedure in classical statistics and the workhorse training principle behind modern probabilistic machine learning, including logistic regression, generalised linear models, hidden Markov models, Gaussian mixture models, and the autoregressive language models that drive contemporary deep learning.

The procedure was developed by Ronald A. Fisher in a series of papers between 1912 and 1922, with the now-standard name and the modern formulation appearing in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics" published in the Philosophical Transactions of the Royal Society. That single paper also introduced the words parameter, statistic, consistency, efficiency, and sufficiency into the statistical vocabulary, and it remains the historical anchor for almost all of frequentist estimation theory.[^fisher1922][^aldrich1997]

Definition

Let x = (x₁, ..., x_n) denote observed data and let p(x; θ) be a parametric statistical model indexed by a parameter θ that lies in some parameter space Θ ⊆ ℝ^k. The likelihood function is the joint density (or mass) of the data, viewed as a function of θ with x held fixed:

L(θ; x) = p(x; θ)

Fisher was emphatic that the likelihood is not a probability distribution over θ. Likelihood and probability are wholly distinct objects: one fixes the parameter and varies the data, the other fixes the data and varies the parameter. Fisher coined the term "likelihood" precisely to mark this distinction.[^fisher1922][^likelihood_wiki]

For independent and identically distributed (i.i.d.) samples the joint density factorises, so the likelihood is a product:

L(θ; x) = ∏_{i=1}^{n} p(x_i; θ)

In practice one almost always works with the log-likelihood ℓ(θ; x) = log L(θ; x), which converts the product into a sum:

ℓ(θ; x) = ∑_{i=1}^{n} log p(x_i; θ)

The logarithm is monotone, so any maximiser of L is also a maximiser of ℓ. Working in log space is computationally essential because individual densities can be vanishingly small in high dimensions, and a product of n such densities will underflow to zero in floating-point arithmetic for even modest n. The maximum likelihood estimator is then

θ̂_MLE = argmax_{θ ∈ Θ} ℓ(θ; x)

When ℓ is differentiable, the estimator satisfies the likelihood equations ∂ℓ/∂θ = 0 evaluated at θ̂. The vector of partial derivatives s(θ; x) = ∇_θ ℓ(θ; x) is called the score function, and its expected value is zero at the true parameter under standard regularity conditions.[^fisher_info_wiki]

Worked examples

The canonical examples below illustrate why MLE so often produces the "obvious" empirical estimator, and where it gives more interesting answers such as ordinary least squares.

Model	Parameter	MLE	Closed form?
Bernoulli(p)	p ∈ [0,1]	sample proportion of successes, k/n	yes
Categorical(π_1, ..., π_K)	class probabilities	empirical class frequencies, n_k / n	yes
Normal(μ, σ²)	μ	sample mean x̄	yes
Normal(μ, σ²)	σ²	(1/n) ∑ (x_i − x̄)², biased downward	yes
Linear regression with Gaussian noise	weight vector w	ordinary least squares, (XᵀX)⁻¹Xᵀy	yes
Logistic regression	weight vector w	requires iterative optimisation (IRLS, gradient descent)	no
Gaussian mixture model	mixture weights, means, covariances	local optima found by EM	no

Bernoulli

For n i.i.d. Bernoulli(p) trials with k successes, the log-likelihood is ℓ(p) = k log p + (n − k) log(1 − p). Differentiating and setting the derivative to zero gives p̂ = k/n. The MLE of a Bernoulli probability is just the empirical proportion of successes.[^wasserman_ch9][^mle_wiki]

Gaussian

For n i.i.d. samples x₁, ..., x_n from N(μ, σ²) the log-likelihood is

ℓ(μ, σ²) = −(n/2) log(2π) − (n/2) log σ² − (1/(2σ²)) ∑ (x_i − μ)²

Maximising over μ for fixed σ² yields the sample mean μ̂ = x̄. Substituting back and maximising over σ² yields σ̂²_MLE = (1/n) ∑ (x_i − x̄)². The mean estimator is unbiased; the variance estimator is biased downward by a factor (n − 1)/n, which is why the textbook "sample variance" with denominator n − 1 is preferred for unbiased estimation. The MLE is still consistent: the bias vanishes as n → ∞.[^bishop_prml][^mle_wiki]

Categorical

For n i.i.d. samples drawn from a K-class categorical distribution with probabilities π = (π_1, ..., π_K), the MLE under the constraint ∑ π_k = 1 (using a Lagrange multiplier) is π̂_k = n_k / n, where n_k is the number of times class k was observed. Empirical frequencies maximise the categorical likelihood, which is why "counting" is the natural training rule for unsmoothed naive Bayes and n-gram language models.[^wasserman_ch9]

Linear regression with Gaussian noise

If y_i = xᵢᵀw + ε_i with ε_i ∼ N(0, σ²) and the inputs xᵢ are treated as fixed, the conditional log-likelihood factorises as a sum of Gaussian log-densities, and maximising over w reduces to minimising ∑ (y_i − xᵢᵀw)². The MLE is the ordinary least squares solution

ŵ_MLE = (XᵀX)⁻¹ Xᵀy

This identity is the formal reason that squared loss shows up everywhere in regression: it is exactly the negative log-likelihood under a homoscedastic Gaussian noise model, up to constants that do not depend on w.[^goodfellow_ch5][^bishop_prml]

Logistic regression

For binary logistic regression with σ(z) = 1/(1+e^(-z)) and labels y_i ∈ {0,1}, the log-likelihood is

ℓ(w) = ∑ [y_i log σ(xᵢᵀw) + (1 − y_i) log(1 − σ(xᵢᵀw))]

The negative of this expression is the binary cross-entropy loss, also called the log loss. The likelihood equations have no closed-form solution because of the nonlinearity in σ, so the MLE is computed by iterative methods. The traditional choice is iteratively reweighted least squares (IRLS), which is Newton's method applied to the GLM likelihood with Fisher scoring. Each Newton step solves a weighted least squares problem in a working response, and the algorithm converges to the unique global maximum because the log-likelihood is strictly concave in w.[^irls_wiki][^goodfellow_ch5]

The extension to more than two classes is multi-class logistic regression, trained by minimising categorical cross-entropy.

Theoretical properties

MLE has a tightly interlocking set of large-sample guarantees that, taken together, make it the canonical "good" estimator in classical statistics.

Property	Statement	Source
Consistency	θ̂_n → θ_true in probability as n → ∞	Cramér 1946
Asymptotic normality	√n (θ̂_n − θ_true) → N(0, I(θ)⁻¹)	Cramér 1946
Asymptotic efficiency	Variance attains the Cramér-Rao lower bound asymptotically	Rao 1945, Cramér 1946
Equivariance / invariance	If g is a one-to-one function, then the MLE of g(θ) is g(θ̂)	Casella and Berger ch. 7
Asymptotic equivalence with Bayes	Posterior concentrates as N(θ̂, I(θ)⁻¹/n) for any well-behaved prior	Bernstein-von Mises theorem

Consistency

Under Cramér's regularity conditions, the maximum likelihood estimator is consistent: θ̂_n converges in probability to the true parameter θ_true as the sample size n grows without bound.[^cramer1946][^stanford_lec16] Intuitively, the empirical log-likelihood (1/n) ∑ log p(x_i; θ) converges by the law of large numbers to its expectation E_{x ∼ p_true}[log p(x; θ)], which is uniquely maximised at θ_true when the model is correctly specified. The argmax of the empirical objective therefore converges to the argmax of the population objective.

Asymptotic normality

Under the same regularity conditions,

√n (θ̂_n − θ_true) → N(0, I(θ_true)⁻¹)

in distribution, where I(θ) is the Fisher information matrix

I(θ) = E[(∇_θ log p(X; θ))(∇_θ log p(X; θ))ᵀ] = −E[∇²_θ log p(X; θ)]

The two definitions agree under the regularity conditions that allow exchange of differentiation and integration. Fisher information is, intuitively, the curvature of the log-likelihood at the truth: a sharply peaked likelihood means a small variance for the estimator.[^fisher_info_wiki][^mit_18443]

Asymptotic efficiency and the Cramér-Rao bound

The Cramér-Rao lower bound, proved by Harald Cramér and C. R. Rao in the 1940s, states that for any unbiased estimator T(X) of θ,

Var(T) ≥ I(θ)⁻¹

The MLE attains this bound asymptotically, in the sense that its asymptotic variance equals I(θ)⁻¹. No other consistent estimator can have lower asymptotic variance under the same model. This is what is meant by saying the MLE is asymptotically efficient.[^cramer1946][^stanford_stats200]

Equivariance and invariance under reparameterisation

If θ̂ is the MLE of θ and g is a one-to-one transformation, then g(θ̂) is the MLE of g(θ). This is the invariance property of the MLE, due in its modern form to Zehna (1966) and stated as Theorem 7.2.10 in Casella and Berger's Statistical Inference.[^casella_berger] Bayesian estimators do not enjoy this property in general: the posterior mean of g(θ) is not g(posterior mean of θ).

Bernstein-von Mises

The Bernstein-von Mises theorem states that under regularity conditions, the Bayesian posterior distribution converges in total variation distance to a Gaussian centred at the MLE with covariance given by the inverse Fisher information divided by n. As n → ∞, the prior is washed out, the posterior looks Gaussian, and the posterior mean and the MLE become indistinguishable to leading order. This is the deep reason that frequentist confidence intervals based on Fisher information and Bayesian credible intervals from "reasonable" priors agree in large samples.[^bvm_wiki]

Regularity conditions

The asymptotic guarantees above hold under what are usually called Cramér's regularity conditions, after Harald Cramér's 1946 textbook Mathematical Methods of Statistics.[^cramer1946][^stanford_lec16] Common formulations require:

Identifiability. Different parameter values give different distributions: θ ≠ θ' implies p(·; θ) ≠ p(·; θ').
Common support. The support of p(x; θ) does not depend on θ. (Distributions like Uniform(0, θ) violate this and require separate treatment; the MLE is still consistent there but is not asymptotically normal in the usual sense.)
Interior point. The true parameter θ_true lies in the interior of the parameter space Θ, so the score equation can vanish there.
Smoothness and dominated derivatives. The log-density log p(x; θ) is twice continuously differentiable in θ for almost every x, and the partial derivatives are dominated by an integrable function so that differentiation under the integral sign is justified.
Positive definite information. The Fisher information matrix I(θ_true) is finite and positive definite.

When these conditions fail, MLE can still be defined but its behaviour can be pathological. A famous boundary case is estimating θ in Uniform(0, θ): the MLE is the sample maximum, which converges at rate 1/n rather than the usual 1/√n, and its limiting distribution is exponential rather than Gaussian.

Relationship with KL divergence

Maximising the log-likelihood is, up to a constant that does not depend on θ, equivalent to minimising the Kullback-Leibler divergence from the empirical data distribution to the model. Let p̂_data denote the empirical distribution that puts mass 1/n on each observed sample. Then

D_KL(p̂_data ‖ p_θ) = E_{x ∼ p̂_data}[log p̂_data(x)] − E_{x ∼ p̂_data}[log p_θ(x)]

The first term does not depend on θ, so minimising the KL divergence is the same as maximising the second term, which is exactly (1/n) times the log-likelihood. Goodfellow, Bengio and Courville call this the unifying view of MLE: "we can see it as minimizing the dissimilarity between the empirical distribution defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence".[^goodfellow_ch5]

An important corollary applies when the model is misspecified, meaning the true data-generating distribution g does not lie in the parametric family {p_θ}. Then the MLE converges to the pseudo-true parameter θ* that minimises D_KL(g ‖ p_θ), the KL projection of the truth onto the model. This is the basis of quasi-maximum likelihood estimation, formalised by Halbert White in his 1982 paper Maximum Likelihood Estimation of Misspecified Models. Standard inference no longer applies under misspecification because Var(θ̂) is not I(θ)⁻¹/n, and one has to use the so-called sandwich variance estimator instead.[^white1982]

Relationship with cross-entropy

For a categorical model with predicted probabilities p_θ(y | x), the negative log-likelihood of a single labelled example (x, y) is

−log p_θ(y | x) = H(δ_y, p_θ(· | x))

where δ_y is the one-hot distribution on the true label and H denotes cross-entropy. Summing over the dataset gives

−ℓ(θ) = ∑_{i=1}^{n} H(δ_{y_i}, p_θ(· | x_i))

This is exactly the cross-entropy loss used to train classification models. Every time a deep neural network is trained on a classification task with log loss, it is being trained by maximum likelihood under a categorical likelihood; "cross-entropy" and "negative log-likelihood" are different names for the same objective.[^goodfellow_ch5] The fact that virtually all modern classifiers, from logistic regression to ImageNet CNNs to GPT-style transformers, are trained by minimising cross-entropy is just the statement that they are trained by maximum likelihood.

Relationship with mean squared error

For a regression model y = f_θ(x) + ε with ε ∼ N(0, σ²), the negative log-likelihood of a single example is

−log p_θ(y | x) = (1/(2σ²))(y − f_θ(x))² + (1/2) log(2πσ²)

Summing over the dataset and discarding terms that do not depend on θ, the MLE of θ is the minimiser of ∑ (y_i − f_θ(x_i))², which is the mean squared error. So MSE is MLE under a Gaussian noise assumption with constant variance. This explains why the textbook regression loss is squared error and why robust regression methods (Huber loss, L1 loss) implicitly assume non-Gaussian noise distributions.[^goodfellow_ch5]

Numerical methods

For a handful of "nice" models, the likelihood equations can be solved in closed form. The Bernoulli, multinomial, Gaussian and ordinary linear regression cases above are the main examples. For most other models, the maximum has to be found numerically.

Method	Typical use case
Closed-form solution	Exponential family with sufficient statistics, Gaussian, OLS regression
Newton-Raphson	Smooth concave log-likelihoods, small to medium parameter dimension
Fisher scoring	Generalised linear models; replaces observed Hessian with expected Fisher information
Iteratively reweighted least squares (IRLS)	Logistic regression and other GLMs
Quasi-Newton (BFGS, L-BFGS)	Medium-dimensional smooth problems where Hessian is expensive
Expectation-Maximisation (EM)	Latent-variable and missing-data models, Gaussian mixture models, HMMs
Stochastic gradient descent	Massive datasets and high-dimensional models, including deep neural networks

The Expectation-Maximisation algorithm of Dempster, Laird and Rubin (1977) is the standard iterative method for MLE in models with latent variables. Each EM iteration alternates between an E-step that computes the expected complete-data log-likelihood given current parameters and an M-step that maximises this expected log-likelihood. The algorithm is guaranteed to monotonically increase the observed-data likelihood at every iteration, although it can converge to local optima.[^dempster1977]

For very large parameter spaces, second-order methods become infeasible, and the dominant tool is stochastic gradient descent on the negative log-likelihood, optionally with mini-batch updates and adaptive learning rates such as Adam. Modern deep learning is, almost without exception, MLE in disguise: the loss function is some flavour of negative log-likelihood, and the training loss curve is monitored to verify that ℓ is increasing.

Connection to Bayesian inference

MLE returns a single point estimate, the mode of the likelihood. Bayesian inference instead returns the full posterior distribution

p(θ | x) ∝ p(x | θ) p(θ)

where p(θ) is the prior. The mode of the posterior, called the maximum a posteriori (MAP) estimate, equals the MLE when the prior is flat. Adding a Gaussian prior on θ corresponds to ridge (L2) regularisation; adding a Laplace prior corresponds to lasso (L1) regularisation. So regularised MLE is a special case of MAP estimation, and MAP estimation is a special case of full Bayesian inference that throws away everything except the posterior mode.[^goodfellow_ch5]

Estimator	Definition	Output	Uses prior?	Uncertainty quantification
MLE	argmax_θ p(x ; θ)	point estimate	no	only via asymptotic Fisher information
MAP	argmax_θ p(x ; θ) p(θ)	point estimate	yes	only via Laplace approximation around the mode
Bayesian posterior	p(θ	x) ∝ p(x ; θ) p(θ)	full distribution over θ	yes

For large n, the Bernstein-von Mises theorem implies that the posterior under any "sensible" prior concentrates on the MLE with width 1/√n, so MLE, MAP and the posterior mean all become equivalent. For small n, or when the prior carries genuine information, the three estimators can give noticeably different answers. Bayesian inference is more honest about uncertainty but harder to compute. MLE is cheap and asymptotically efficient.

Limitations and pitfalls

MLE is the default for good reason, but it has well-known failure modes that practitioners should keep in mind.

Finite-sample bias. The MLE is consistent but not unbiased in general. The Gaussian variance estimator with denominator n is the textbook example: it is biased downward by a factor (n − 1)/n. Bias corrections of order 1/n exist (jackknife, Barndorff-Nielsen's modified profile likelihood) but require extra work.[^mle_wiki]

Sensitivity to model misspecification. If the assumed family does not contain the true distribution, the MLE converges to the KL projection of the truth onto the family, not to anything that need correspond to a quantity of scientific interest. White's quasi-MLE framework gives correct standard errors via the sandwich estimator but cannot fix a fundamentally wrong model.[^white1982]

Sensitivity to outliers. Because the log-density goes to −∞ near zero, a single grossly outlying observation can drive the likelihood arbitrarily low. The Gaussian MLE of the mean is the sample mean, which has zero breakdown point. Robust alternatives (M-estimators, trimmed likelihoods) trade efficiency for resistance to outliers.

Multiple local optima. When the log-likelihood is non-concave, as in Gaussian mixture models, neural networks, and most latent-variable models, the optimisation surface has many local optima, and the EM algorithm or gradient descent can converge to any of them. Initialisation matters, and there is no general guarantee that the global maximum will be found.

Overfitting in high dimensions. When the parameter dimension is comparable to or larger than the sample size, the MLE will fit noise and generalise poorly. This is the entire reason ridge and lasso regression exist: they are MAP estimates with Gaussian or Laplace priors that shrink coefficients toward zero, regularising the unregularised MLE.

Numerical underflow. Multiplying many small probabilities underflows to zero in floating-point arithmetic. Always work in log space, summing log-probabilities rather than multiplying probabilities. Standard implementations use the log-sum-exp trick for numerically stable computation of normalised likelihoods.

Boundary and unbounded likelihoods. Some likelihoods are unbounded above. A two-component Gaussian mixture, for instance, can drive its likelihood to +∞ by centring one component on a single data point and shrinking its variance to zero. The naive MLE does not exist, and one has to use either constrained optimisation, a prior that penalises tiny variances, or the so-called penalised likelihood approach.

Calibration of high-capacity models. A deep network trained to convergence by maximum likelihood on cross-entropy will typically produce overconfident predictions on the training set. Label smoothing, introduced in Szegedy et al.'s 2016 Rethinking the Inception Architecture for Computer Vision, replaces the one-hot target with a softened distribution, equivalent to adding a KL divergence to the uniform distribution to the negative log-likelihood objective. It is a deliberate move away from the unregularised MLE, and it consistently improves calibration and generalisation.[^szegedy2016]

Use cases in modern machine learning

MLE is not just a textbook procedure; it is the explicit training principle behind most of modern probabilistic ML.

Classical statistics: linear and generalised linear models, survival analysis, mixed-effects models, factor analysis.
Latent-variable models: Gaussian mixture models, hidden Markov models, latent Dirichlet allocation. The standard fitting routine is the EM algorithm, which is MLE for incomplete data.[^dempster1977]
Neural network classifiers: every model trained on cross-entropy, from logistic regression to ResNets to GPT-style transformers, is trained by MLE under a categorical likelihood. The cross-entropy loss curve is the negative log-likelihood, normalised by batch size.
Neural network regressors: every model trained on mean squared error is trained by MLE under a homoscedastic Gaussian noise model.
Autoregressive language models: an n-token model factorises p(x_1, ..., x_n) = ∏t p(x_t | x<t), and training maximises ∑t log p(x_t | x<t). Next-token prediction is just MLE of an autoregressive model.
Normalising flows: invertible neural networks trained by exact MLE on continuous data via the change-of-variables formula.
Variational autoencoders: trained by maximising a lower bound on the log-likelihood (the ELBO), which is MLE plus a KL regularisation term.
Reinforcement learning fine-tuning: methods such as DPO can be cast as maximum likelihood on a preference dataset.

What MLE does not train are GANs (which use a minimax loss instead of a likelihood), Wasserstein models (which use a Wasserstein loss instead of KL), and hinge-loss SVMs (which use hinge loss instead of log loss). Each alternative is motivated by a specific failure mode of MLE: mode collapse, gradient saturation, or sensitivity to misspecification.

Practical considerations

A short checklist for using MLE in practice. Always optimise the log-likelihood, not the likelihood itself, because floating-point underflow is otherwise certain. Check identifiability before optimising; an unidentifiable parameter produces a flat ridge in the likelihood and no unique answer. Use a second-order method when feasible: Newton-Raphson and IRLS converge in few iterations for well-behaved problems, and the Hessian at the optimum is the observed Fisher information, which gives standard errors essentially for free. For large n, switch to stochastic gradient descent, since computing the full-batch log-likelihood is wasteful when the empirical loss is a sum of n terms. Regularise: pure MLE in high dimensions overfits, so add a prior (i.e. switch to MAP), use early stopping, weight decay, or label smoothing. Diagnose misspecification: if model-based and sandwich standard errors disagree, the model is probably wrong. For non-concave problems, use multiple random restarts and report sensitivity to initialisation.

References

Maximum likelihood estimation (MLE)

Definition

Worked examples

Bernoulli

Gaussian

Categorical

Linear regression with Gaussian noise

Logistic regression

Theoretical properties

Consistency

Asymptotic normality

Asymptotic efficiency and the Cramér-Rao bound

Equivariance and invariance under reparameterisation

Bernstein-von Mises

Regularity conditions

Relationship with KL divergence

Relationship with cross-entropy

Relationship with mean squared error

Numerical methods

Connection to Bayesian inference

Limitations and pitfalls

Use cases in modern machine learning

Practical considerations

See also

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

Maximum likelihood estimation (MLE)

Definition

Worked examples

Bernoulli

Gaussian

Categorical

Linear regression with Gaussian noise

Logistic regression

Theoretical properties

Consistency

Asymptotic normality

Asymptotic efficiency and the Cramér-Rao bound

Equivariance and invariance under reparameterisation

Bernstein-von Mises

Regularity conditions

Relationship with KL divergence

Relationship with cross-entropy

Relationship with mean squared error

Numerical methods

Connection to Bayesian inference

Limitations and pitfalls

Use cases in modern machine learning

Practical considerations

See also

References

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness