The Expectation-Maximization (EM) algorithm is an iterative optimization method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models that involve latent (unobserved) variables or incomplete data. The algorithm alternates between two steps: an expectation step (E-step) that computes the expected value of the log-likelihood given the current parameter estimates and the observed data, and a maximization step (M-step) that updates the parameters to maximize this expected log-likelihood. EM is widely used across machine learning, statistics, natural language processing, computer vision, and computational biology, powering methods such as Gaussian mixture models, hidden Markov models, and topic models.
Imagine you have a bag of colored marbles, but the lights are dim and you can't see the colors clearly. You want to figure out how many marbles of each color there are. Here is what you do: first, you make your best guess about which color each marble is (that is the E-step). Then, based on those guesses, you recalculate how many of each color you think there are overall (that is the M-step). You repeat this process over and over, and each time your guesses get a little better. Eventually, your guesses stop changing much, and you have a good estimate of the true color counts. The EM algorithm works the same way: it guesses the hidden information, updates its model, and repeats until the answer stabilizes.
The EM algorithm was formally introduced and named by Arthur Dempster, Nan Laird, and Donald Rubin in their 1977 paper "Maximum Likelihood from Incomplete Data via the EM Algorithm," published in the Journal of the Royal Statistical Society, Series B. As of 2021, this paper had accumulated over 64,000 citations according to Google Scholar, making it one of the most cited papers in all of statistics.
However, the underlying ideas predated the 1977 paper by several decades. Cedric Smith had used a similar gene-counting method for estimating allele frequencies in genetics. H.O. Hartley published a related approach in 1958, and Hartley and Hocking expanded on it in 1971. Rolf Sundberg, working with Per Martin-Löf and Anders Martin-Löf, provided detailed treatments of EM-type methods for exponential family distributions in the early 1970s.
What Dempster, Laird, and Rubin accomplished in 1977 was to unify these scattered approaches into a single general framework and sketch a convergence analysis. However, their convergence proof contained flaws. C.F. Jeff Wu published a corrected and more rigorous convergence analysis in 1983 in The Annals of Statistics, establishing the algorithm's convergence properties beyond exponential families.
Another notable contribution came from the Baum-Welch algorithm, developed by Leonard Baum, Ted Petrie, George Soules, and Norman Weiss in a 1970 paper. The Baum-Welch algorithm is a special case of EM applied to hidden Markov models. The forward-backward procedure used in its E-step became foundational in speech recognition and sequence modeling.
The EM algorithm addresses the problem of maximum likelihood estimation when the data is incomplete or when the model contains latent variables. The setup involves three components:
The complete-data likelihood is p(X, Z | θ), but since Z is unobserved, we can only work with the marginal (incomplete-data) likelihood:
p(X | θ) = Σ_Z p(X, Z | θ) (discrete case)
or
p(X | θ) = ∫ p(X, Z | θ) dZ (continuous case)
Direct maximization of the marginal likelihood is often intractable because the summation or integral over Z couples the parameters in complicated ways. EM sidesteps this by iteratively optimizing a more tractable surrogate.
Given the current parameter estimate θ^(t), compute the expected value of the complete-data log-likelihood with respect to the conditional distribution of the latent variables given the observed data:
Q(θ | θ^(t)) = E_{Z|X,θ^(t)} [log p(X, Z | θ)]
This amounts to computing the posterior distribution p(Z | X, θ^(t)) and using it to average the complete-data log-likelihood over all possible values of the latent variables.
Find the parameter values that maximize the Q-function:
θ^(t+1) = arg max_θ Q(θ | θ^(t))
For many models (especially those in the exponential family), this maximization has a closed-form solution, making the M-step computationally straightforward.
The E-step and M-step alternate until convergence. A common convergence criterion is that the change in log-likelihood or in the parameters falls below a specified threshold:
|log p(X | θ^(t+1)) - log p(X | θ^(t))| < ε
The mathematical justification of EM relies on a decomposition of the log-likelihood. For any distribution q(Z) over the latent variables:
log p(X | θ) = L(q, θ) + D_KL(q || p(Z | X, θ))
where:
Since D_KL ≥ 0 (with equality if and only if q(Z) = p(Z | X, θ)), the ELBO is always a lower bound on the log-likelihood:
log p(X | θ) ≥ L(q, θ)
The ELBO can also be derived through Jensen's inequality. Because the logarithm is a concave function:
log p(X | θ) = log Σ_Z p(X, Z | θ) = log Σ_Z q(Z) [p(X, Z | θ) / q(Z)] ≥ Σ_Z q(Z) log [p(X, Z | θ) / q(Z)]
This inequality becomes an equality when p(X, Z | θ) / q(Z) is constant with respect to Z, which happens precisely when q(Z) = p(Z | X, θ).
Neal and Hinton (1998) showed that EM can be understood as coordinate ascent on the function F(q, θ) = L(q, θ):
This viewpoint connects EM directly to variational inference and provides a clean framework for understanding convergence and generalizations.
The EM algorithm has several well-established convergence properties.
The most fundamental property of EM is that each iteration is guaranteed to increase (or leave unchanged) the observed-data log-likelihood:
log p(X | θ^(t+1)) ≥ log p(X | θ^(t))
This follows from the ELBO decomposition. In the E-step, setting q = p(Z | X, θ^(t)) makes the ELBO equal to the log-likelihood. In the M-step, maximizing the ELBO over θ produces a new θ^(t+1) with a higher (or equal) ELBO. Since the log-likelihood is always at least as large as the ELBO, the log-likelihood at θ^(t+1) is at least as large as at θ^(t).
Wu (1983) proved that under regularity conditions, the EM algorithm converges to a stationary point of the likelihood function. Specifically:
EM is not guaranteed to find the global maximum of the likelihood. It converges to a local maximum or saddle point, depending on the initialization. In practice, multiple random restarts are commonly used to mitigate this issue.
The convergence rate of EM is linear (first-order), governed by the fraction of missing information. Dempster, Laird, and Rubin (1977) showed that the rate of convergence near a fixed point is determined by the largest eigenvalue of the matrix:
J = I_oc^{-1} I_m
where I_oc is the observed-data information matrix and I_m is the missing-data information matrix. When the fraction of missing information is large, the convergence rate is slow. When it is small, convergence is fast.
Meng and Rubin (1991) further developed this connection in their SEM (Supplemented EM) algorithm, using the EM convergence rate to compute asymptotic variance-covariance matrices.
The Gaussian mixture model (GMM) is perhaps the most common application of the EM algorithm and serves as a standard pedagogical example.
A GMM models data as arising from a mixture of K Gaussian distributions:
p(x | θ) = Σ_{k=1}^{K} π_k N(x | μ_k, Σ_k)
where π_k are the mixing coefficients (Σ π_k = 1), and μ_k and Σ_k are the mean and covariance of the k-th component. The latent variable z_i indicates which component generated data point x_i.
Compute the posterior probability (responsibility) that component k generated data point x_i:
r_{ik} = π_k N(x_i | μ_k, Σ_k) / Σ_{j=1}^{K} π_j N(x_i | μ_j, Σ_j)
This is a direct application of Bayes' theorem.
Update the parameters using the responsibilities as soft assignments:
| Parameter | Update formula |
|---|---|
| Mixing coefficient π_k | π_k = (1/N) Σ_{i=1}^{N} r_{ik} |
| Mean μ_k | μ_k = Σ_{i=1}^{N} r_{ik} x_i / Σ_{i=1}^{N} r_{ik} |
| Covariance Σ_k | Σ_k = Σ_{i=1}^{N} r_{ik} (x_i - μ_k)(x_i - μ_k)^T / Σ_{i=1}^{N} r_{ik} |
These closed-form updates make the M-step computationally efficient. The algorithm alternates between recomputing responsibilities and updating the parameters until the log-likelihood converges.
K-means clustering can be viewed as a limiting case of EM for GMMs. When the covariance matrices are constrained to be σ²I and σ → 0, the soft assignments r_{ik} become hard assignments (0 or 1), and the EM algorithm reduces to the K-means algorithm.
The EM algorithm is used in a wide range of fields and applications.
| Application domain | Specific use | Latent variables |
|---|---|---|
| Clustering | Gaussian mixture models | Cluster assignments |
| Speech recognition | Baum-Welch algorithm for HMMs | Hidden states in acoustic models |
| Natural language processing | Inside-outside algorithm for PCFGs | Parse trees |
| Topic modeling | Latent Dirichlet Allocation | Topic assignments per word |
| Medical imaging | PET, SPECT, CT reconstruction | Emission/attenuation maps |
| Genetics and bioinformatics | Allele frequency estimation, motif finding | Population membership, motif positions |
| Recommender systems | Matrix factorization with missing entries | Unobserved ratings |
| Psychometrics | Item response theory | Latent ability parameters |
| Signal processing | Source separation, denoising | Source signals |
| Finance | Portfolio modeling with missing data | Unobserved market factors |
One of the original motivations for EM was handling missing data in statistical analyses. When data entries are missing at random, the EM algorithm can estimate model parameters by treating the missing values as latent variables. In the E-step, missing values are imputed using their expected values given the observed data and current parameter estimates. In the M-step, parameters are re-estimated using the completed data. This approach preserves the relationships between variables, unlike simpler methods such as mean imputation.
The Baum-Welch algorithm, developed in 1970, is the EM algorithm applied to hidden Markov models. The E-step uses the forward-backward algorithm to compute the posterior probabilities of hidden states given the observed sequence. The M-step updates the transition probabilities, emission probabilities, and initial state distribution. This algorithm was foundational in speech recognition systems throughout the 1980s and 1990s and remains relevant in computational biology for tasks like gene finding and protein structure prediction.
In Latent Dirichlet Allocation (LDA) and related topic models, the EM algorithm (or variational approximations to it) is used to learn topic-word distributions and document-topic distributions from text corpora. The latent variables are the per-word topic assignments, and the observed data is the word counts in each document.
Several variants of the EM algorithm have been developed to address its limitations or adapt it to specific settings.
The generalized EM algorithm relaxes the requirement that the M-step fully maximizes the Q-function. Instead, the M-step only needs to find a θ^(t+1) such that Q(θ^(t+1) | θ^(t)) ≥ Q(θ^(t) | θ^(t)). This is useful when full maximization is computationally expensive. Wu (1983) showed that the convergence guarantees of EM also apply to GEM.
Proposed by Meng and Rubin (1993), ECM replaces the M-step with a sequence of conditional maximization (CM) steps, each of which maximizes the Q-function over a subset of the parameters while holding the others fixed. This is especially useful when the joint maximization over all parameters is difficult but conditional maximizations are tractable. ECM retains the monotonic convergence property of standard EM.
Liu and Rubin (1994) extended ECM to create the ECME algorithm, which allows some CM-steps to maximize the actual observed-data likelihood rather than the Q-function, often achieving faster convergence while preserving monotonic likelihood increase.
When the E-step is analytically intractable (i.e., computing the posterior p(Z | X, θ) in closed form is not possible), MCEM uses Monte Carlo sampling to approximate the expectation. Samples are drawn from p(Z | X, θ^(t)), and the Q-function is approximated by the sample average. Wei and Tanner (1990) introduced this approach, and it has been widely used in complex Bayesian models.
The stochastic EM algorithm replaces the E-step with a stochastic simulation: instead of computing the full posterior expectation, a single realization of Z is drawn from p(Z | X, θ^(t)), and the M-step proceeds using this simulated complete data. SEM can be viewed as MCEM with a Monte Carlo sample size of one. While individual iterations are noisier, SEM can help escape local optima.
Delyon, Lavielle, and Moulines (1999) proposed SAEM, which replaces the E-step with a stochastic approximation that computes a weighted average of the current empirical estimate and all previous approximations. This approach draws on the Robbins-Monro stochastic approximation framework and often converges more reliably than MCEM.
When the exact posterior p(Z | X, θ) is intractable, variational EM replaces the E-step with a variational approximation. Instead of computing the exact posterior, a simpler distribution q(Z) from a restricted family is chosen to minimize D_KL(q || p(Z | X, θ)). This approach is closely related to variational inference and is used in models like LDA where exact inference is not feasible.
Online EM processes data points one at a time (or in mini-batches) rather than using the entire dataset in each iteration. Cappé and Moulines (2009) developed a framework for online EM that is suited to large-scale datasets and streaming data. The algorithm maintains sufficient statistics that are incrementally updated as new data arrives.
Liu, Rubin, and Wu (1998) introduced PX-EM, which introduces auxiliary parameters to create an expanded complete-data model. This expanded model leads to faster convergence by reducing the fraction of missing information while preserving the simplicity of each iteration.
| Variant | Key modification | When to use |
|---|---|---|
| GEM | Partial M-step (increase Q, don't maximize) | Full maximization is expensive |
| ECM | Sequential conditional maximization | Joint maximization over all parameters is hard |
| MCEM | Monte Carlo approximation of E-step | Posterior expectations are intractable |
| SEM | Single stochastic sample in E-step | Need to escape local optima |
| SAEM | Stochastic approximation of E-step | Complex models with intractable E-step |
| Variational EM | Approximate posterior in restricted family | Exact posterior is intractable |
| Online EM | Incremental updates from streaming data | Large datasets or streaming data |
| PX-EM | Expanded parameter space | Slow convergence due to high missing information |
The connection between EM and variational inference was formalized by Neal and Hinton (1998). Both methods can be understood as optimizing the ELBO, but they differ in what is treated as fixed and what is optimized.
In standard EM, the E-step computes the exact posterior p(Z | X, θ), which sets the KL divergence to zero. The M-step then optimizes θ. In variational inference, the posterior is approximated by a distribution from a restricted family, so the KL divergence is never exactly zero. Variational inference also typically treats θ as a random variable with its own posterior, rather than a point estimate.
Variational Bayes (VB) can be seen as an extension of EM from maximum likelihood estimation to fully Bayesian inference. In VB, both the latent variables Z and the parameters θ are treated as random variables, and the goal is to approximate the joint posterior p(Z, θ | X). The ELBO serves the same role, but it is optimized over a larger variational family.
The key relationships are:
| Method | What is optimized | Treatment of θ | Posterior approximation |
|---|---|---|---|
| Standard EM | Point estimate of θ | Fixed (optimized) | Exact |
| Variational EM | Point estimate of θ | Fixed (optimized) | Approximate (restricted family) |
| Variational Bayes | Approximate posterior of θ | Random variable | Approximate (restricted family) |
| MCMC | Samples from posterior | Random variable | Exact (asymptotically) |
Despite its wide use, the EM algorithm has several notable limitations.
Convergence to local optima. EM is only guaranteed to converge to a local maximum or saddle point, not the global maximum. For multimodal likelihoods, the solution depends on initialization. In Gaussian mixture models, for example, poor initialization can lead to degenerate solutions where one component collapses to a single data point with zero variance.
Slow convergence. The convergence rate is linear (first-order), which can be slow compared to second-order methods like Newton-Raphson. The rate is governed by the fraction of missing information: when the latent variables carry a large proportion of the total information, convergence can be very slow. In some high-dimensional settings, convergence can be extremely slow.
Intractable E-step. For complex models, computing the exact posterior p(Z | X, θ) may be analytically or computationally intractable. This requires resorting to approximate methods such as variational EM or MCEM, which introduce their own approximation errors.
No standard error estimates. Standard EM does not directly produce standard errors or confidence intervals for the parameter estimates. Computing the observed information matrix requires additional work, such as the SEM algorithm of Meng and Rubin (1991) or Louis' method (1982).
Sensitivity to initialization. The final solution can depend heavily on the initial parameter values. Common strategies to address this include running EM multiple times from different random initializations, using K-means or hierarchical clustering to initialize mixture models, and applying deterministic annealing or simulated annealing techniques.
Singularities in the likelihood. In Gaussian mixture models, the likelihood function is unbounded: a component can achieve infinite likelihood by collapsing onto a single data point. Practical implementations use regularization (such as adding a small value to the diagonal of covariance matrices) to prevent this.
| Property | EM algorithm | Gradient descent | Newton-Raphson |
|---|---|---|---|
| Convergence rate | Linear (first-order) | Linear (first-order) | Quadratic (second-order) |
| Requires gradients | No | Yes | Yes (plus Hessian) |
| Monotonic likelihood increase | Yes | Not guaranteed (depends on step size) | Not guaranteed |
| Handles latent variables naturally | Yes | No (requires marginalization) | No (requires marginalization) |
| Memory requirements | Low to moderate | Low | High (stores Hessian) |
| Risk of divergence | Very low | Moderate (bad step size) | Moderate (non-positive-definite Hessian) |
Several practical issues arise when implementing the EM algorithm.
Initialization. Good initialization is important because EM converges to local optima. For mixture models, common approaches include random initialization, K-means initialization, and initialization from a hierarchical clustering.
Convergence monitoring. The log-likelihood should be computed at each iteration and monitored for convergence. Any decrease in the log-likelihood indicates an implementation error. Convergence is typically declared when the relative change in log-likelihood falls below a threshold (e.g., 10^{-6}).
Numerical stability. The E-step often involves computing products of many small probabilities. Using log-probabilities and the log-sum-exp trick can prevent numerical underflow.
Degeneracy prevention. In GMMs, components can degenerate if they collapse onto single data points. Adding a small regularization term to the covariance matrices (e.g., a small multiple of the identity matrix) prevents this.