# Maximum likelihood estimation (MLE)

> Source: https://aiwiki.ai/wiki/maximum_likelihood_estimation
> Updated: 2026-06-23
> Categories: Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Maximum likelihood estimation (MLE)** is the method of choosing the parameters of a probability model so that they make the observed data as probable as possible: given a parametric model with density (or mass) function p(x; θ) and a dataset x, the maximum likelihood estimate θ̂ is the value of θ that maximizes the likelihood L(θ; x) = p(x; θ). It is the dominant frequentist parameter-estimation procedure in classical statistics and the explicit training principle behind almost all of modern probabilistic machine learning, including [logistic regression](/wiki/logistic_regression), [generalised linear models](/wiki/linear_regression), hidden Markov models, [Gaussian mixture models](/wiki/gaussian_mixture_model), and the autoregressive [large language models](/wiki/large_language_model) that drive contemporary deep learning. Training a neural network by minimizing [cross-entropy loss](/wiki/cross_entropy_loss) is exactly maximum likelihood estimation under a categorical model, and next-token prediction in models such as [GPT-3](/wiki/gpt_3) is MLE of an autoregressive language model.[^goodfellow_ch5][^brown2020]

The procedure was developed by Ronald A. Fisher in a series of papers between 1912 and 1922, with the now-standard name and the modern formulation appearing in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics" published in the Philosophical Transactions of the Royal Society (Series A, volume 222, pages 309-368).[^fisher1922] According to historian John Aldrich, that single paper introduced the terms *parameter*, *statistic*, *consistency*, *efficiency*, and *sufficiency* into the statistical vocabulary, and Fisher used the word *parameter* in its modern sense 57 times in the article; it remains the historical anchor for almost all of frequentist estimation theory.[^aldrich1997]

## What is the formal definition of MLE?

Let x = (x₁, ..., x_n) denote observed data and let p(x; θ) be a parametric statistical model indexed by a parameter θ that lies in some parameter space Θ ⊆ ℝ^k. The **likelihood function** is the joint density (or mass) of the data, viewed as a function of θ with x held fixed:

```
L(θ; x) = p(x; θ)
```

Fisher was emphatic that the likelihood is *not* a probability distribution over θ. Likelihood and probability are wholly distinct objects: one fixes the parameter and varies the data, the other fixes the data and varies the parameter. Fisher coined the term "likelihood" precisely to mark this distinction, writing in 1922 that "the likelihood that any parameter (or set of parameters) should have any assigned value (or set of values) is proportional to the probability that if this were so, the totality of observations should be that observed."[^fisher1922][^likelihood_wiki]

For independent and identically distributed (i.i.d.) samples the joint density factorises, so the likelihood is a product:

```
L(θ; x) = ∏_{i=1}^{n} p(x_i; θ)
```

In practice one almost always works with the **log-likelihood** ℓ(θ; x) = log L(θ; x), which converts the product into a sum:

```
ℓ(θ; x) = ∑_{i=1}^{n} log p(x_i; θ)
```

The logarithm is monotone, so any maximiser of L is also a maximiser of ℓ. Working in log space is computationally essential because individual densities can be vanishingly small in high dimensions, and a product of n such densities will underflow to zero in floating-point arithmetic for even modest n. The **maximum likelihood estimator** is then

```
θ̂_MLE = argmax_{θ ∈ Θ} ℓ(θ; x)
```

When ℓ is differentiable, the estimator satisfies the **likelihood equations** ∂ℓ/∂θ = 0 evaluated at θ̂. The vector of partial derivatives s(θ; x) = ∇_θ ℓ(θ; x) is called the **score function**, and its expected value is zero at the true parameter under standard regularity conditions.[^fisher_info_wiki]

## Worked examples

The canonical examples below illustrate why MLE so often produces the "obvious" empirical estimator, and where it gives more interesting answers such as ordinary least squares.

| Model | Parameter | MLE | Closed form? |
|---|---|---|---|
| Bernoulli(p) | p ∈ [0,1] | sample proportion of successes, k/n | yes |
| Categorical(π_1, ..., π_K) | class probabilities | empirical class frequencies, n_k / n | yes |
| Normal(μ, σ²) | μ | sample mean x̄ | yes |
| Normal(μ, σ²) | σ² | (1/n) ∑ (x_i − x̄)², biased downward | yes |
| Linear regression with Gaussian noise | weight vector w | ordinary least squares, (XᵀX)⁻¹Xᵀy | yes |
| [Logistic regression](/wiki/logistic_regression) | weight vector w | requires iterative optimisation (IRLS, gradient descent) | no |
| [Gaussian mixture model](/wiki/gaussian_mixture_model) | mixture weights, means, covariances | local optima found by EM | no |

### Bernoulli

For n i.i.d. Bernoulli(p) trials with k successes, the log-likelihood is ℓ(p) = k log p + (n − k) log(1 − p). Differentiating and setting the derivative to zero gives p̂ = k/n. The MLE of a Bernoulli probability is just the empirical proportion of successes.[^wasserman_ch9][^mle_wiki]

### Gaussian

For n i.i.d. samples x₁, ..., x_n from N(μ, σ²) the log-likelihood is

```
ℓ(μ, σ²) = −(n/2) log(2π) − (n/2) log σ² − (1/(2σ²)) ∑ (x_i − μ)²
```

Maximising over μ for fixed σ² yields the sample mean μ̂ = x̄. Substituting back and maximising over σ² yields σ̂²_MLE = (1/n) ∑ (x_i − x̄)². The mean estimator is unbiased; the variance estimator is biased downward by a factor (n − 1)/n, which is why the textbook "sample variance" with denominator n − 1 is preferred for unbiased estimation. The MLE is still consistent: the bias vanishes as n → ∞.[^bishop_prml][^mle_wiki]

### Categorical

For n i.i.d. samples drawn from a K-class categorical distribution with probabilities π = (π_1, ..., π_K), the MLE under the constraint ∑ π_k = 1 (using a Lagrange multiplier) is π̂_k = n_k / n, where n_k is the number of times class k was observed. Empirical frequencies maximise the categorical likelihood, which is why "counting" is the natural training rule for unsmoothed [naive Bayes](/wiki/naive_bayes) and n-gram language models.[^wasserman_ch9]

### Linear regression with Gaussian noise

If y_i = xᵢᵀw + ε_i with ε_i ∼ N(0, σ²) and the inputs xᵢ are treated as fixed, the conditional log-likelihood factorises as a sum of Gaussian log-densities, and maximising over w reduces to minimising ∑ (y_i − xᵢᵀw)². The MLE is the [ordinary least squares](/wiki/linear_regression) solution

```
ŵ_MLE = (XᵀX)⁻¹ Xᵀy
```

This identity is the formal reason that [squared loss](/wiki/squared_loss) shows up everywhere in regression: it is exactly the negative log-likelihood under a homoscedastic Gaussian noise model, up to constants that do not depend on w.[^goodfellow_ch5][^bishop_prml]

### Logistic regression

For binary [logistic regression](/wiki/logistic_regression) with σ(z) = 1/(1+e^(-z)) and labels y_i ∈ {0,1}, the log-likelihood is

```
ℓ(w) = ∑ [y_i log σ(xᵢᵀw) + (1 − y_i) log(1 − σ(xᵢᵀw))]
```

The negative of this expression is the binary [cross-entropy loss](/wiki/cross_entropy_loss), also called the [log loss](/wiki/log_loss). The likelihood equations have no closed-form solution because of the nonlinearity in σ, so the MLE is computed by iterative methods. The traditional choice is **iteratively reweighted least squares (IRLS)**, which is Newton's method applied to the GLM likelihood with Fisher scoring. Each Newton step solves a weighted least squares problem in a working response, and the algorithm converges to the unique global maximum because the log-likelihood is strictly concave in w.[^irls_wiki][^goodfellow_ch5]

The extension to more than two classes is [multi-class logistic regression](/wiki/multi-class_logistic_regression), trained by minimising categorical cross-entropy.

## What are the theoretical properties of MLE?

MLE has a tightly interlocking set of large-sample guarantees that, taken together, make it the canonical "good" estimator in classical statistics.

| Property | Statement | Source |
|---|---|---|
| Consistency | θ̂_n → θ_true in probability as n → ∞ | Cramér 1946 |
| Asymptotic normality | √n (θ̂_n − θ_true) → N(0, I(θ)⁻¹) | Cramér 1946 |
| Asymptotic efficiency | Variance attains the Cramér-Rao lower bound asymptotically | Rao 1945, Cramér 1946 |
| Equivariance / invariance | If g is a one-to-one function, then the MLE of g(θ) is g(θ̂) | Casella and Berger ch. 7 |
| Asymptotic equivalence with Bayes | Posterior concentrates as N(θ̂, I(θ)⁻¹/n) for any well-behaved prior | Bernstein-von Mises theorem |

### Consistency

Under Cramér's regularity conditions, the maximum likelihood estimator is **consistent**: θ̂_n converges in probability to the true parameter θ_true as the sample size n grows without bound.[^cramer1946][^stanford_lec16] Intuitively, the empirical log-likelihood (1/n) ∑ log p(x_i; θ) converges by the law of large numbers to its expectation E_{x ∼ p_true}[log p(x; θ)], which is uniquely maximised at θ_true when the model is correctly specified. The argmax of the empirical objective therefore converges to the argmax of the population objective.

### Asymptotic normality

Under the same regularity conditions,

```
√n (θ̂_n − θ_true) → N(0, I(θ_true)⁻¹)
```

in distribution, where I(θ) is the **Fisher information matrix**

```
I(θ) = E[(∇_θ log p(X; θ))(∇_θ log p(X; θ))ᵀ] = −E[∇²_θ log p(X; θ)]
```

The two definitions agree under the regularity conditions that allow exchange of differentiation and integration. Fisher information is, intuitively, the curvature of the log-likelihood at the truth: a sharply peaked likelihood means a small variance for the estimator.[^fisher_info_wiki][^mit_18443]

### Asymptotic efficiency and the Cramér-Rao bound

The **Cramér-Rao lower bound** states that for any unbiased estimator T(X) of θ,

```
Var(T) ≥ I(θ)⁻¹
```

The bound was derived independently by several authors in the 1940s, including Maurice Fréchet (1943), Georges Darmois, C. R. Rao in his 1945 paper "Information and the Accuracy Attainable in the Estimation of Statistical Parameters," and Harald Cramér in 1946, which is why it is sometimes called the Fréchet-Darmois-Cramér-Rao inequality.[^crlb_wiki] The MLE attains this bound asymptotically, in the sense that its asymptotic variance equals I(θ)⁻¹. No other consistent estimator can have lower asymptotic variance under the same model. This is what is meant by saying the MLE is **asymptotically efficient**.[^cramer1946][^stanford_stats200]

### Equivariance and invariance under reparameterisation

If θ̂ is the MLE of θ and g is a one-to-one transformation, then g(θ̂) is the MLE of g(θ). This is the **invariance property of the MLE**, due in its modern form to Zehna (1966) and stated as Theorem 7.2.10 in Casella and Berger's *Statistical Inference*.[^casella_berger] Bayesian estimators do not enjoy this property in general: the posterior mean of g(θ) is not g(posterior mean of θ).

### Bernstein-von Mises

The **Bernstein-von Mises theorem** states that under regularity conditions, the Bayesian posterior distribution converges in total variation distance to a Gaussian centred at the MLE with covariance given by the inverse Fisher information divided by n. As n → ∞, the prior is washed out, the posterior looks Gaussian, and the posterior mean and the MLE become indistinguishable to leading order. This is the deep reason that frequentist confidence intervals based on Fisher information and Bayesian credible intervals from "reasonable" priors agree in large samples.[^bvm_wiki]

## What regularity conditions does MLE require?

The asymptotic guarantees above hold under what are usually called **Cramér's regularity conditions**, after Harald Cramér's 1946 textbook *Mathematical Methods of Statistics*.[^cramer1946][^stanford_lec16] Common formulations require:

1. **Identifiability.** Different parameter values give different distributions: θ ≠ θ' implies p(·; θ) ≠ p(·; θ').
2. **Common support.** The support of p(x; θ) does not depend on θ. (Distributions like Uniform(0, θ) violate this and require separate treatment; the MLE is still consistent there but is not asymptotically normal in the usual sense.)
3. **Interior point.** The true parameter θ_true lies in the interior of the parameter space Θ, so the score equation can vanish there.
4. **Smoothness and dominated derivatives.** The log-density log p(x; θ) is twice continuously differentiable in θ for almost every x, and the partial derivatives are dominated by an integrable function so that differentiation under the integral sign is justified.
5. **Positive definite information.** The Fisher information matrix I(θ_true) is finite and positive definite.

When these conditions fail, MLE can still be defined but its behaviour can be pathological. A famous boundary case is estimating θ in Uniform(0, θ): the MLE is the sample maximum, which converges at rate 1/n rather than the usual 1/√n, and its limiting distribution is exponential rather than Gaussian.

## How is MLE related to KL divergence?

Maximising the log-likelihood is, up to a constant that does not depend on θ, equivalent to **minimising the [Kullback-Leibler divergence](/wiki/kl_divergence) from the empirical data distribution to the model**. Let p̂_data denote the empirical distribution that puts mass 1/n on each observed sample. Then

```
D_KL(p̂_data ‖ p_θ) = E_{x ∼ p̂_data}[log p̂_data(x)] − E_{x ∼ p̂_data}[log p_θ(x)]
```

The first term does not depend on θ, so minimising the KL divergence is the same as maximising the second term, which is exactly (1/n) times the log-likelihood. Goodfellow, Bengio and Courville call this the unifying view of MLE: "One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence."[^goodfellow_ch5]

An important corollary applies when the model is **misspecified**, meaning the true data-generating distribution g does not lie in the parametric family {p_θ}. Then the MLE converges to the **pseudo-true parameter** θ* that minimises D_KL(g ‖ p_θ), the KL projection of the truth onto the model. This is the basis of **quasi-maximum likelihood estimation**, formalised by Halbert White in his 1982 paper *Maximum Likelihood Estimation of Misspecified Models*. Standard inference no longer applies under misspecification because Var(θ̂) is not I(θ)⁻¹/n, and one has to use the so-called **sandwich variance estimator** instead.[^white1982]

## How is MLE related to cross-entropy?

For a categorical model with predicted probabilities p_θ(y | x), the negative log-likelihood of a single labelled example (x, y) is

```
−log p_θ(y | x) = H(δ_y, p_θ(· | x))
```

where δ_y is the one-hot distribution on the true label and H denotes [cross-entropy](/wiki/cross-entropy). Summing over the dataset gives

```
−ℓ(θ) = ∑_{i=1}^{n} H(δ_{y_i}, p_θ(· | x_i))
```

This is exactly the **cross-entropy loss** used to train classification models. Every time a deep neural network is trained on a classification task with [log loss](/wiki/log_loss), it is being trained by maximum likelihood under a categorical likelihood; "cross-entropy" and "negative log-likelihood" are different names for the same objective.[^goodfellow_ch5] The fact that virtually all modern classifiers, from logistic regression to ImageNet CNNs to GPT-style transformers, are trained by minimising cross-entropy is just the statement that they are trained by maximum likelihood.

## How is MLE related to mean squared error?

For a regression model y = f_θ(x) + ε with ε ∼ N(0, σ²), the negative log-likelihood of a single example is

```
−log p_θ(y | x) = (1/(2σ²))(y − f_θ(x))² + (1/2) log(2πσ²)
```

Summing over the dataset and discarding terms that do not depend on θ, the MLE of θ is the minimiser of ∑ (y_i − f_θ(x_i))², which is the **mean [squared error](/wiki/squared_loss)**. So MSE is MLE under a Gaussian noise assumption with constant variance. This explains why the textbook regression loss is squared error and why robust regression methods (Huber loss, [L1 loss](/wiki/l1_loss)) implicitly assume non-Gaussian noise distributions.[^goodfellow_ch5]

## What numerical methods solve for the MLE?

For a handful of "nice" models, the likelihood equations can be solved in closed form. The Bernoulli, multinomial, Gaussian and ordinary linear regression cases above are the main examples. For most other models, the maximum has to be found numerically.

| Method | Typical use case |
|---|---|
| Closed-form solution | Exponential family with sufficient statistics, Gaussian, OLS regression |
| Newton-Raphson | Smooth concave log-likelihoods, small to medium parameter dimension |
| Fisher scoring | Generalised linear models; replaces observed Hessian with expected Fisher information |
| Iteratively reweighted least squares (IRLS) | [Logistic regression](/wiki/logistic_regression) and other GLMs |
| Quasi-Newton (BFGS, L-BFGS) | Medium-dimensional smooth problems where Hessian is expensive |
| Expectation-Maximisation (EM) | Latent-variable and missing-data models, [Gaussian mixture models](/wiki/gaussian_mixture_model), HMMs |
| [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) | Massive datasets and high-dimensional models, including deep neural networks |

The **[Expectation-Maximisation algorithm](/wiki/expectation_maximization)** of Dempster, Laird and Rubin (1977) is the standard iterative method for MLE in models with latent variables. Each EM iteration alternates between an E-step that computes the expected complete-data log-likelihood given current parameters and an M-step that maximises this expected log-likelihood. The algorithm is guaranteed to monotonically increase the observed-data likelihood at every iteration, although it can converge to local optima.[^dempster1977]

For very large parameter spaces, second-order methods become infeasible, and the dominant tool is [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) on the negative log-likelihood, optionally with [mini-batch](/wiki/mini-batch_stochastic_gradient_descent) updates and adaptive learning rates such as Adam. Modern deep learning is, almost without exception, MLE in disguise: the [loss function](/wiki/loss_function) is some flavour of negative log-likelihood, and the [training loss](/wiki/training_loss) curve is monitored to verify that ℓ is increasing.

## How does MLE differ from MAP and Bayesian inference?

MLE returns a single point estimate, the mode of the likelihood. **Bayesian inference** instead returns the full posterior distribution

```
p(θ | x) ∝ p(x | θ) p(θ)
```

where p(θ) is the prior. The mode of the posterior, called the **maximum a posteriori (MAP)** estimate, equals the MLE when the prior is flat. Adding a Gaussian prior on θ corresponds to ridge ([L2](/wiki/l2_loss)) regularisation; adding a Laplace prior corresponds to lasso ([L1](/wiki/l1_loss)) regularisation. So regularised MLE is a special case of MAP estimation, and MAP estimation is a special case of full [Bayesian inference](/wiki/bayesian_inference) that throws away everything except the posterior mode.[^goodfellow_ch5]

| Estimator | Definition | Output | Uses prior? | Uncertainty quantification |
|---|---|---|---|---|
| MLE | argmax_θ p(x ; θ) | point estimate | no | only via asymptotic Fisher information |
| MAP | argmax_θ p(x ; θ) p(θ) | point estimate | yes | only via Laplace approximation around the mode |
| Bayesian posterior | p(θ | x) ∝ p(x ; θ) p(θ) | full distribution over θ | yes | full posterior, exact credible intervals |

For large n, the Bernstein-von Mises theorem implies that the posterior under any "sensible" prior concentrates on the MLE with width 1/√n, so MLE, MAP and the posterior mean all become equivalent. For small n, or when the prior carries genuine information, the three estimators can give noticeably different answers. Bayesian inference is more honest about uncertainty but harder to compute. MLE is cheap and asymptotically efficient.

## What are the limitations and pitfalls of MLE?

MLE is the default for good reason, but it has well-known failure modes that practitioners should keep in mind.

**Finite-sample bias.** The MLE is consistent but not unbiased in general. The Gaussian variance estimator with denominator n is the textbook example: it is biased downward by a factor (n − 1)/n. Bias corrections of order 1/n exist (jackknife, Barndorff-Nielsen's modified profile likelihood) but require extra work.[^mle_wiki]

**Sensitivity to model misspecification.** If the assumed family does not contain the true distribution, the MLE converges to the KL projection of the truth onto the family, not to anything that need correspond to a quantity of scientific interest. White's quasi-MLE framework gives correct standard errors via the sandwich estimator but cannot fix a fundamentally wrong model.[^white1982]

**Sensitivity to outliers.** Because the log-density goes to −∞ near zero, a single grossly outlying observation can drive the likelihood arbitrarily low. The Gaussian MLE of the mean is the sample mean, which has zero breakdown point. Robust alternatives (M-estimators, trimmed likelihoods) trade efficiency for resistance to outliers.

**Multiple local optima.** When the log-likelihood is non-concave, as in [Gaussian mixture models](/wiki/gaussian_mixture_model), neural networks, and most latent-variable models, the optimisation surface has many local optima, and the EM algorithm or gradient descent can converge to any of them. Initialisation matters, and there is no general guarantee that the global maximum will be found.

**Overfitting in high dimensions.** When the parameter dimension is comparable to or larger than the sample size, the MLE will fit noise and generalise poorly. This is the entire reason ridge and lasso regression exist: they are MAP estimates with Gaussian or Laplace priors that shrink coefficients toward zero, regularising the unregularised MLE.

**Numerical underflow.** Multiplying many small probabilities underflows to zero in floating-point arithmetic. Always work in log space, summing log-probabilities rather than multiplying probabilities. Standard implementations use the log-sum-exp trick for numerically stable computation of normalised likelihoods.

**Boundary and unbounded likelihoods.** Some likelihoods are unbounded above. A two-component Gaussian mixture, for instance, can drive its likelihood to +∞ by centring one component on a single data point and shrinking its variance to zero. The naive MLE does not exist, and one has to use either constrained optimisation, a prior that penalises tiny variances, or the so-called penalised likelihood approach.

**Calibration of high-capacity models.** A deep network trained to convergence by maximum likelihood on cross-entropy will typically produce overconfident predictions on the training set. **Label smoothing**, introduced in Szegedy et al.'s 2016 *Rethinking the Inception Architecture for Computer Vision*, replaces the one-hot target with a softened distribution, equivalent to adding a KL divergence to the uniform distribution to the negative log-likelihood objective. It is a deliberate move *away* from the unregularised MLE, and it consistently improves calibration and generalisation.[^szegedy2016]

## Is MLE used to train large language models?

Yes. MLE is not just a textbook procedure; it is the explicit training principle behind most of modern probabilistic ML, including the largest neural networks ever trained.

- **Classical statistics**: linear and generalised linear models, survival analysis, mixed-effects models, factor analysis.
- **Latent-variable models**: Gaussian mixture models, hidden Markov models, latent Dirichlet allocation. The standard fitting routine is the EM algorithm, which is MLE for incomplete data.[^dempster1977]
- **Neural network classifiers**: every model trained on cross-entropy, from logistic regression to ResNets to GPT-style transformers, is trained by MLE under a categorical likelihood. The cross-entropy [loss](/wiki/loss) curve is the negative log-likelihood, normalised by batch size.
- **Neural network regressors**: every model trained on mean squared error is trained by MLE under a homoscedastic Gaussian noise model.
- **Autoregressive language models**: an n-token model factorises p(x_1, ..., x_n) = ∏_t p(x_t | x_<t), and training maximises ∑_t log p(x_t | x_<t). Next-token prediction is just MLE of an autoregressive model. [GPT-3](/wiki/gpt_3), an autoregressive [large language model](/wiki/large_language_model) with 175 billion parameters trained on roughly 300 billion tokens, is fit this way; the original 2020 paper reports the standard autoregressive cross-entropy objective and notes that "all models were trained for a total of 300 billion tokens."[^brown2020]
- **[Normalising flows](/wiki/variational_autoencoder)**: invertible neural networks trained by exact MLE on continuous data via the change-of-variables formula.
- **Variational autoencoders**: the [variational autoencoder](/wiki/variational_autoencoder) is trained by maximising a lower bound on the log-likelihood (the ELBO), which is MLE plus a KL regularisation term.
- **Reinforcement learning fine-tuning**: methods such as [DPO](/wiki/direct_preference_optimization_dpo) can be cast as maximum likelihood on a preference dataset.

What MLE does not train are GANs (which use a [minimax loss](/wiki/minimax_loss) instead of a likelihood), Wasserstein models (which use a [Wasserstein loss](/wiki/wasserstein_loss) instead of KL), and hinge-loss SVMs (which use [hinge loss](/wiki/hinge_loss) instead of log loss). Each alternative is motivated by a specific failure mode of MLE: mode collapse, gradient saturation, or sensitivity to misspecification.

## Practical considerations

A short checklist for using MLE in practice. Always optimise the log-likelihood, not the likelihood itself, because floating-point underflow is otherwise certain. Check identifiability before optimising; an unidentifiable parameter produces a flat ridge in the likelihood and no unique answer. Use a second-order method when feasible: Newton-Raphson and IRLS converge in few iterations for well-behaved problems, and the Hessian at the optimum is the observed Fisher information, which gives standard errors essentially for free. For large n, switch to [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd), since computing the full-batch log-likelihood is wasteful when the empirical loss is a sum of n terms. Regularise: pure MLE in high dimensions overfits, so add a prior (i.e. switch to MAP), use early stopping, weight decay, or label smoothing. Diagnose misspecification: if model-based and sandwich standard errors disagree, the model is probably wrong. For non-concave problems, use multiple random restarts and report sensitivity to initialisation.

## See also

[Bayes' theorem](/wiki/bayes_theorem), [Bayesian inference](/wiki/bayesian_inference), [Bayesian neural network](/wiki/bayesian_neural_network), [convex optimization](/wiki/convex_optimization), [estimator](/wiki/estimator), [focal loss](/wiki/focal_loss), [Gaussian process](/wiki/gaussian_process), [loss curve](/wiki/loss_curve), [loss function](/wiki/loss_function), [loss surface](/wiki/loss_surface), [naive Bayes](/wiki/naive_bayes), [training loss](/wiki/training_loss), [validation loss](/wiki/validation_loss).

## References

[^fisher1922]: Fisher, R. A. (1922). *On the Mathematical Foundations of Theoretical Statistics*. Philosophical Transactions of the Royal Society of London, Series A, 222, 309-368. [royalsocietypublishing.org/doi/10.1098/rsta.1922.0009](https://royalsocietypublishing.org/doi/10.1098/rsta.1922.0009)
[^aldrich1997]: Aldrich, J. (1997). R. A. Fisher and the Making of Maximum Likelihood 1912-1922. *Statistical Science*, 12(3), 162-176. [projecteuclid.org/journals/statistical-science/volume-12/issue-3/RA-Fisher-and-the-making-of-maximum-likelihood-1912-1922/10.1214/ss/1030037906.full](https://projecteuclid.org/journals/statistical-science/volume-12/issue-3/RA-Fisher-and-the-making-of-maximum-likelihood-1912-1922/10.1214/ss/1030037906.full)
[^cramer1946]: Cramér, H. (1946). *Mathematical Methods of Statistics*. Princeton University Press. (Original statement of the regularity conditions and the Cramér-Rao bound for MLE.)
[^crlb_wiki]: Wikipedia. *Cramér-Rao bound*. [en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound](https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound)
[^dempster1977]: Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. *Journal of the Royal Statistical Society, Series B*, 39(1), 1-38.
[^white1982]: White, H. (1982). Maximum Likelihood Estimation of Misspecified Models. *Econometrica*, 50(1), 1-25.
[^szegedy2016]: Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the Inception Architecture for Computer Vision. *CVPR 2016*. [cv-foundation.org PDF](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf)
[^brown2020]: Brown, T. B., et al. (2020). *Language Models are Few-Shot Learners*. NeurIPS 2020 / arXiv:2005.14165. (GPT-3: a 175-billion-parameter autoregressive language model trained on 300 billion tokens.) [arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)
[^goodfellow_ch5]: Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*, Chapter 5.5 "Maximum Likelihood Estimation". MIT Press. [deeplearningbook.org/contents/ml.html](https://www.deeplearningbook.org/contents/ml.html)
[^bishop_prml]: Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*, Section 1.2.5 "Curve fitting re-visited" and Section 2.3.4. Springer.
[^wasserman_ch9]: Wasserman, L. (2004). *All of Statistics: A Concise Course in Statistical Inference*, Chapter 9 "Parametric Inference". Springer.
[^casella_berger]: Casella, G., and Berger, R. L. (2002). *Statistical Inference*, Chapter 7 "Point Estimation", 2nd edition. Duxbury.
[^mle_wiki]: Wikipedia. *Maximum likelihood estimation*. [en.wikipedia.org/wiki/Maximum_likelihood_estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)
[^likelihood_wiki]: Wikipedia. *Likelihood function*. [en.wikipedia.org/wiki/Likelihood_function](https://en.wikipedia.org/wiki/Likelihood_function)
[^fisher_info_wiki]: Wikipedia. *Fisher information*. [en.wikipedia.org/wiki/Fisher_information](https://en.wikipedia.org/wiki/Fisher_information)
[^bvm_wiki]: Wikipedia. *Bernstein-von Mises theorem*. [en.wikipedia.org/wiki/Bernstein%E2%80%93von_Mises_theorem](https://en.wikipedia.org/wiki/Bernstein%E2%80%93von_Mises_theorem)
[^irls_wiki]: Wikipedia. *Iteratively reweighted least squares*. [en.wikipedia.org/wiki/Iteratively_reweighted_least_squares](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares)
[^mit_18443]: MIT OpenCourseWare 18.443. *Statistics for Applications, Lecture 3: Consistency, Asymptotic Normality, Fisher Information*. [ocw.mit.edu PDF](https://ocw.mit.edu/courses/18-443-statistics-for-applications-fall-2006/03b407da8a94b3fe22d987453807ca46_lecture3.pdf)
[^stanford_stats200]: Stanford STATS 200, Lecture 15. *Fisher Information and the Cramer-Rao Bound*. [web.stanford.edu PDF](https://web.stanford.edu/class/stats200/Lecture15.pdf)
[^stanford_lec16]: Stanford STATS 200, Lecture 16. *MLE under Model Misspecification*. [web.stanford.edu PDF](https://web.stanford.edu/class/archive/stats/stats200/stats200.1172/Lecture16.pdf)

