Expectation-Maximization (EM) Algorithm

Machine Learning Statistics

21 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v7 · 4,106 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Expectation-Maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of the parameters of statistical models that involve latent (unobserved) variables or incomplete data.^[1] Each iteration alternates two steps: an expectation step (E-step) that computes the expected value of the complete-data log-likelihood under the current parameter estimate and the observed data, and a maximization step (M-step) that updates the parameters to maximize that expected log-likelihood.^[1] EM is guaranteed to never decrease the observed-data likelihood at any step, and it is the standard fitting procedure behind Gaussian mixture models, hidden Markov models (via the Baum-Welch algorithm), and many topic models. It was formalized by Arthur Dempster, Nan Laird, and Donald Rubin in 1977, in a paper that had accumulated 64,769 citations on Google Scholar as of November 2021, making it the second most cited paper in all of statistics.^[1]^[15]

EM is used across machine learning, statistics, natural language processing, computer vision, and computational biology. The 1977 paper described its scope plainly: "A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality."^[1]

ELI5 (Explain like I'm 5)

Imagine you have a bag of colored marbles, but the lights are dim and you can't see the colors clearly. You want to figure out how many marbles of each color there are. Here is what you do: first, you make your best guess about which color each marble is (that is the E-step). Then, based on those guesses, you recalculate how many of each color you think there are overall (that is the M-step). You repeat this process over and over, and each time your guesses get a little better. Eventually, your guesses stop changing much, and you have a good estimate of the true color counts. The EM algorithm works the same way: it guesses the hidden information, updates its model, and repeats until the answer stabilizes.

Who invented the EM algorithm and when?

The EM algorithm was formally introduced and named by Arthur Dempster, Nan Laird, and Donald Rubin in their 1977 paper "Maximum Likelihood from Incomplete Data via the EM Algorithm," published with discussion in the Journal of the Royal Statistical Society, Series B, volume 39, issue 1, pages 1-38.^[1] As of November 2021 this paper had accumulated 64,769 citations on Google Scholar, ranking it the second most cited paper in all of statistics.^[15]

However, the underlying ideas predated the 1977 paper by several decades. Cedric Smith had used a similar gene-counting method for estimating allele frequencies in genetics. H.O. Hartley published a related approach in 1958, and Hartley and Hocking expanded on it in 1971. Rolf Sundberg, working with Per Martin-Löf and Anders Martin-Löf, provided detailed treatments of EM-type methods for exponential family distributions in the early 1970s.

What Dempster, Laird, and Rubin accomplished in 1977 was to unify these scattered approaches into a single general framework and sketch a convergence analysis.^[1] However, their convergence proof contained flaws. C.F. Jeff Wu published a corrected and more rigorous convergence analysis in 1983 in The Annals of Statistics, volume 11, pages 95-103, establishing the algorithm's convergence properties beyond exponential families.^[2]

Another notable contribution came from the Baum-Welch algorithm, developed by Leonard Baum, Ted Petrie, George Soules, and Norman Weiss in a 1970 paper in The Annals of Mathematical Statistics, volume 41, pages 164-171.^[6] The Baum-Welch algorithm is a special case of EM applied to hidden Markov models, predating the general EM framework by seven years. The forward-backward procedure used in its E-step became foundational in speech recognition and sequence modeling.

How does the EM algorithm work?

The EM algorithm addresses the problem of maximum likelihood estimation when the data is incomplete or when the model contains latent variables. The setup involves three components:

Observed data X: the data that is actually measured or recorded
Latent variables Z: hidden or unobserved variables that, if known, would simplify the estimation problem
Parameters θ: the unknown model parameters to be estimated

The complete-data likelihood is $p(X, Z \mid \theta)$ , but since Z is unobserved, we can only work with the marginal (incomplete-data) likelihood:

p(X \mid \theta) = \sum_Z p(X, Z \mid \theta) \quad \text{(discrete case)}

or

p(X \mid \theta) = \int p(X, Z \mid \theta)\, dZ \quad \text{(continuous case)}

Direct maximization of the marginal likelihood is often intractable because the summation or integral over Z couples the parameters in complicated ways. EM sidesteps this by iteratively optimizing a more tractable surrogate.

E-step (expectation step)

Given the current parameter estimate $\theta^{(t)}$ , compute the expected value of the complete-data log-likelihood with respect to the conditional distribution of the latent variables given the observed data:

Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}}\left[\log p(X, Z \mid \theta)\right]

This amounts to computing the posterior distribution $p(Z \mid X, \theta^{(t)})$ and using it to average the complete-data log-likelihood over all possible values of the latent variables.

M-step (maximization step)

Find the parameter values that maximize the Q-function:

\theta^{(t+1)} = \arg\max_\theta Q(\theta \mid \theta^{(t)})

For many models (especially those in the exponential family), this maximization has a closed-form solution, making the M-step computationally straightforward.^[1]

Iteration

The E-step and M-step alternate until convergence. A common convergence criterion is that the change in log-likelihood or in the parameters falls below a specified threshold:

\left| \log p(X \mid \theta^{(t+1)}) - \log p(X \mid \theta^{(t)}) \right| < \epsilon

Mathematical formulation

Log-likelihood decomposition

The mathematical justification of EM relies on a decomposition of the log-likelihood. For any distribution $q(Z)$ over the latent variables:

\log p(X \mid \theta) = L(q, \theta) + D_{\mathrm{KL}}(q \parallel p(Z \mid X, \theta))

where:

$L(q, \theta) = \sum_Z q(Z) \log\left[p(X, Z \mid \theta) / q(Z)\right]$ is the evidence lower bound (ELBO)
$D_{\mathrm{KL}}(q \parallel p(Z \mid X, \theta)) = \sum_Z q(Z) \log\left[q(Z) / p(Z \mid X, \theta)\right]$ is the Kullback-Leibler divergence

Since $D_{\mathrm{KL}} \ge 0$ (with equality if and only if $q(Z) = p(Z \mid X, \theta)$ ), the ELBO is always a lower bound on the log-likelihood:

\log p(X \mid \theta) \ge L(q, \theta)

Jensen's inequality

The ELBO can also be derived through Jensen's inequality. Because the logarithm is a concave function:

\log p(X \mid \theta) = \log \sum_Z p(X, Z \mid \theta) = \log \sum_Z q(Z)\left[p(X, Z \mid \theta) / q(Z)\right] \ge \sum_Z q(Z) \log\left[p(X, Z \mid \theta) / q(Z)\right]

This inequality becomes an equality when $p(X, Z \mid \theta) / q(Z)$ is constant with respect to Z, which happens precisely when $q(Z) = p(Z \mid X, \theta)$ .

EM as coordinate ascent on the ELBO

Neal and Hinton (1998) showed that EM can be understood as coordinate ascent on the function $F(q, \theta) = L(q, \theta)$ :^[3]

E-step: Fix $\theta^{(t)}$ and maximize F over q. The optimal q is $q^*(Z) = p(Z \mid X, \theta^{(t)})$ , which sets the KL divergence to zero and makes the ELBO equal to the log-likelihood at $\theta^{(t)}$ .
M-step: Fix q and maximize F over θ. Since q is now fixed, maximizing $L(q, \theta)$ with respect to $\theta$ is equivalent to maximizing $\mathbb{E}_q\left[\log p(X, Z \mid \theta)\right]$ , which is the Q-function.

This viewpoint connects EM directly to variational inference and provides a clean framework for understanding convergence and generalizations.^[3]

Does the EM algorithm always converge?

The EM algorithm has several well-established convergence properties. It converges reliably, but only to a local optimum, not necessarily the global one.

Monotonic likelihood increase

The most fundamental property of EM is that each iteration is guaranteed to increase (or leave unchanged) the observed-data log-likelihood:

\log p(X \mid \theta^{(t+1)}) \ge \log p(X \mid \theta^{(t)})

This monotone behaviour was proven in the original 1977 paper, which derived "theory showing the monotone behaviour of the likelihood and convergence of the algorithm."^[1] It follows from the ELBO decomposition. In the E-step, setting q = p(Z | X, θ^(t)) makes the ELBO equal to the log-likelihood. In the M-step, maximizing the ELBO over θ produces a new θ^(t+1) with a higher (or equal) ELBO. Since the log-likelihood is always at least as large as the ELBO, the log-likelihood at θ^(t+1) is at least as large as at θ^(t).^[1]

Convergence to stationary points

Wu (1983) proved that under regularity conditions, the EM algorithm converges to a stationary point of the likelihood function.^[2] Specifically:

The sequence of likelihood values {log p(X | θ^(t))} converges monotonically.
Every limit point of the parameter sequence {θ^(t)} is a stationary point of the likelihood.
Under additional conditions (such as the likelihood having a finite number of stationary points), the entire parameter sequence converges to a single point.

Wu's analysis showed that for a curved exponential family with compact parameter space, "all the limit points of any EM sequence are stationary points" of the likelihood function.^[2]

Local convergence, not global

EM is not guaranteed to find the global maximum of the likelihood. It converges to a local maximum or saddle point, depending on the initialization. In practice, multiple random restarts are commonly used to mitigate this issue.

Rate of convergence

The convergence rate of EM is linear (first-order), governed by the fraction of missing information. Dempster, Laird, and Rubin (1977) showed that the rate of convergence near a fixed point is determined by the largest eigenvalue of the matrix:^[1]

J = I_{oc}^{-1} I_m

where $I_{oc}$ is the observed-data information matrix and $I_m$ is the missing-data information matrix. When the fraction of missing information is large, the convergence rate is slow. When it is small, convergence is fast.

Meng and Rubin (1991) further developed this connection in their SEM (Supplemented EM) algorithm, using the EM convergence rate to compute asymptotic variance-covariance matrices.^[5]

How is EM used to fit a Gaussian mixture model?

The Gaussian mixture model (GMM) is perhaps the most common application of the EM algorithm and serves as a standard pedagogical example.^[13]

A GMM models data as arising from a mixture of K Gaussian distributions:

p(x \mid \theta) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)

where $\pi_k$ are the mixing coefficients ( $\sum \pi_k = 1$ ), and $\mu_k$ and $\Sigma_k$ are the mean and covariance of the k-th component. The latent variable $z_i$ indicates which component generated data point $x_i$ .

E-step for GMMs

Compute the posterior probability (responsibility) that component k generated data point x_i:

r_{ik} = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}

This is a direct application of Bayes' theorem.

M-step for GMMs

Update the parameters using the responsibilities as soft assignments:

Parameter	Update formula
Mixing coefficient $\pi_k$	$\pi_k = \frac{1}{N} \sum_{i=1}^{N} r_{ik}$
Mean $\mu_k$	$\mu_k = \frac{\sum_{i=1}^{N} r_{ik} x_i}{\sum_{i=1}^{N} r_{ik}}$
Covariance $\Sigma_k$	$\Sigma_k = \frac{\sum_{i=1}^{N} r_{ik} (x_i - \mu_k)(x_i - \mu_k)^\top}{\sum_{i=1}^{N} r_{ik}}$

These closed-form updates make the M-step computationally efficient. The algorithm alternates between recomputing responsibilities and updating the parameters until the log-likelihood converges.^[13]

How does EM relate to K-means?

K-means clustering can be viewed as a limiting case of EM for GMMs. When the covariance matrices are constrained to be $\sigma^2 I$ and $\sigma \to 0$ , the soft assignments $r_{ik}$ become hard assignments (0 or 1), and the EM algorithm reduces to the K-means algorithm.^[13] In this sense K-means is EM with hard, rather than probabilistic, cluster assignments.

What is the EM algorithm used for?

The EM algorithm is used in a wide range of fields and applications.

Application domain	Specific use	Latent variables
Clustering	Gaussian mixture models	Cluster assignments
Speech recognition	Baum-Welch algorithm for HMMs	Hidden states in acoustic models
Natural language processing	Inside-outside algorithm for PCFGs	Parse trees
Topic modeling	Latent Dirichlet Allocation	Topic assignments per word
Medical imaging	PET, SPECT, CT reconstruction	Emission/attenuation maps
Genetics and bioinformatics	Allele frequency estimation, motif finding	Population membership, motif positions
Recommender systems	Matrix factorization with missing entries	Unobserved ratings
Psychometrics	Item response theory	Latent ability parameters
Signal processing	Source separation, denoising	Source signals
Finance	Portfolio modeling with missing data	Unobserved market factors

Missing data problems

One of the original motivations for EM was handling missing data in statistical analyses.^[1] When data entries are missing at random, the EM algorithm can estimate model parameters by treating the missing values as latent variables. In the E-step, missing values are imputed using their expected values given the observed data and current parameter estimates. In the M-step, parameters are re-estimated using the completed data. This approach preserves the relationships between variables, unlike simpler methods such as mean imputation.

Hidden Markov models

The Baum-Welch algorithm, developed in 1970, is the EM algorithm applied to hidden Markov models.^[6] The E-step uses the forward-backward algorithm to compute the posterior probabilities of hidden states given the observed sequence. The M-step updates the transition probabilities, emission probabilities, and initial state distribution. This algorithm was foundational in speech recognition systems throughout the 1980s and 1990s and remains relevant in computational biology for tasks like gene finding and protein structure prediction.

Topic models

In Latent Dirichlet Allocation (LDA) and related topic models, the EM algorithm (or variational approximations to it) is used to learn topic-word distributions and document-topic distributions from text corpora. The latent variables are the per-word topic assignments, and the observed data is the word counts in each document.

What are the main variants of the EM algorithm?

Several variants of the EM algorithm have been developed to address its limitations or adapt it to specific settings.

Generalized EM (GEM)

The generalized EM algorithm relaxes the requirement that the M-step fully maximizes the Q-function. Instead, the M-step only needs to find a $\theta^{(t+1)}$ such that $Q(\theta^{(t+1)} \mid \theta^{(t)}) \ge Q(\theta^{(t)} \mid \theta^{(t)})$ . This is useful when full maximization is computationally expensive. Wu (1983) showed that the convergence guarantees of EM also apply to GEM.^[2]

Expectation Conditional Maximization (ECM)

Proposed by Meng and Rubin (1993), ECM replaces the M-step with a sequence of conditional maximization (CM) steps, each of which maximizes the Q-function over a subset of the parameters while holding the others fixed.^[4] This is especially useful when the joint maximization over all parameters is difficult but conditional maximizations are tractable. ECM retains the monotonic convergence property of standard EM.^[4]

ECME algorithm

Liu and Rubin (1994) extended ECM to create the ECME algorithm, which allows some CM-steps to maximize the actual observed-data likelihood rather than the Q-function, often achieving faster convergence while preserving monotonic likelihood increase.^[9]

Monte Carlo EM (MCEM)

When the E-step is analytically intractable (i.e., computing the posterior $p(Z \mid X, \theta)$ in closed form is not possible), MCEM uses Monte Carlo sampling to approximate the expectation. Samples are drawn from p(Z | X, θ^(t)), and the Q-function is approximated by the sample average. Wei and Tanner (1990) introduced this approach, and it has been widely used in complex Bayesian models.^[7]

Stochastic EM (SEM)

The stochastic EM algorithm replaces the E-step with a stochastic simulation: instead of computing the full posterior expectation, a single realization of Z is drawn from $p(Z \mid X, \theta^{(t)})$ , and the M-step proceeds using this simulated complete data. SEM can be viewed as MCEM with a Monte Carlo sample size of one. While individual iterations are noisier, SEM can help escape local optima.

Stochastic Approximation EM (SAEM)

Delyon, Lavielle, and Moulines (1999) proposed SAEM, which replaces the E-step with a stochastic approximation that computes a weighted average of the current empirical estimate and all previous approximations.^[8] This approach draws on the Robbins-Monro stochastic approximation framework and often converges more reliably than MCEM.^[8]

Variational EM

When the exact posterior p(Z | X, θ) is intractable, variational EM replaces the E-step with a variational approximation. Instead of computing the exact posterior, a simpler distribution $q(Z)$ from a restricted family is chosen to minimize $D_{\mathrm{KL}}(q \parallel p(Z \mid X, \theta))$ . This approach is closely related to variational inference and is used in models like LDA where exact inference is not feasible.

Online EM

Online EM processes data points one at a time (or in mini-batches) rather than using the entire dataset in each iteration. Cappé and Moulines (2009) developed a framework for online EM that is suited to large-scale datasets and streaming data.^[10] The algorithm maintains sufficient statistics that are incrementally updated as new data arrives.

Parameter-expanded EM (PX-EM)

Liu, Rubin, and Wu (1998) introduced PX-EM, which introduces auxiliary parameters to create an expanded complete-data model.^[11] This expanded model leads to faster convergence by reducing the fraction of missing information while preserving the simplicity of each iteration.^[11]

Variant	Key modification	When to use
GEM	Partial M-step (increase Q, don't maximize)	Full maximization is expensive
ECM	Sequential conditional maximization	Joint maximization over all parameters is hard
MCEM	Monte Carlo approximation of E-step	Posterior expectations are intractable
SEM	Single stochastic sample in E-step	Need to escape local optima
SAEM	Stochastic approximation of E-step	Complex models with intractable E-step
Variational EM	Approximate posterior in restricted family	Exact posterior is intractable
Online EM	Incremental updates from streaming data	Large datasets or streaming data
PX-EM	Expanded parameter space	Slow convergence due to high missing information

How does EM relate to variational inference?

The connection between EM and variational inference was formalized by Neal and Hinton (1998).^[3] Both methods can be understood as optimizing the ELBO, but they differ in what is treated as fixed and what is optimized.

In standard EM, the E-step computes the exact posterior $p(Z \mid X, \theta)$ , which sets the KL divergence to zero. The M-step then optimizes θ. In variational inference, the posterior is approximated by a distribution from a restricted family, so the KL divergence is never exactly zero. Variational inference also typically treats θ as a random variable with its own posterior, rather than a point estimate.

Variational Bayes (VB) can be seen as an extension of EM from maximum likelihood estimation to fully Bayesian inference. In VB, both the latent variables Z and the parameters θ are treated as random variables, and the goal is to approximate the joint posterior $p(Z, \theta \mid X)$ . The ELBO serves the same role, but it is optimized over a larger variational family.^[14]

The key relationships are:

Method	What is optimized	Treatment of θ	Posterior approximation
Standard EM	Point estimate of θ	Fixed (optimized)	Exact
Variational EM	Point estimate of θ	Fixed (optimized)	Approximate (restricted family)
Variational Bayes	Approximate posterior of θ	Random variable	Approximate (restricted family)
MCMC	Samples from posterior	Random variable	Exact (asymptotically)

What are the limitations of the EM algorithm?

Despite its wide use, the EM algorithm has several notable limitations.

Convergence to local optima. EM is only guaranteed to converge to a local maximum or saddle point, not the global maximum. For multimodal likelihoods, the solution depends on initialization. In Gaussian mixture models, for example, poor initialization can lead to degenerate solutions where one component collapses to a single data point with zero variance.^[13]

Slow convergence. The convergence rate is linear (first-order), which can be slow compared to second-order methods like Newton-Raphson. The rate is governed by the fraction of missing information: when the latent variables carry a large proportion of the total information, convergence can be very slow.^[1] In some high-dimensional settings, convergence can be extremely slow.

Intractable E-step. For complex models, computing the exact posterior $p(Z \mid X, \theta)$ may be analytically or computationally intractable. This requires resorting to approximate methods such as variational EM or MCEM, which introduce their own approximation errors.

No standard error estimates. Standard EM does not directly produce standard errors or confidence intervals for the parameter estimates. Computing the observed information matrix requires additional work, such as the SEM algorithm of Meng and Rubin (1991) or Louis' method (1982).^[5]

Sensitivity to initialization. The final solution can depend heavily on the initial parameter values. Common strategies to address this include running EM multiple times from different random initializations, using K-means or hierarchical clustering to initialize mixture models, and applying deterministic annealing or simulated annealing techniques.

Singularities in the likelihood. In Gaussian mixture models, the likelihood function is unbounded: a component can achieve infinite likelihood by collapsing onto a single data point. Practical implementations use regularization (such as adding a small value to the diagonal of covariance matrices) to prevent this.^[13]

How does EM compare to gradient descent and Newton-Raphson?

Property	EM algorithm	Gradient descent	Newton-Raphson
Convergence rate	Linear (first-order)	Linear (first-order)	Quadratic (second-order)
Requires gradients	No	Yes	Yes (plus Hessian)
Monotonic likelihood increase	Yes	Not guaranteed (depends on step size)	Not guaranteed
Handles latent variables naturally	Yes	No (requires marginalization)	No (requires marginalization)
Memory requirements	Low to moderate	Low	High (stores Hessian)
Risk of divergence	Very low	Moderate (bad step size)	Moderate (non-positive-definite Hessian)

Implementation considerations

Several practical issues arise when implementing the EM algorithm.

Initialization. Good initialization is important because EM converges to local optima. For mixture models, common approaches include random initialization, K-means initialization, and initialization from a hierarchical clustering.

Convergence monitoring. The log-likelihood should be computed at each iteration and monitored for convergence. Any decrease in the log-likelihood indicates an implementation error. Convergence is typically declared when the relative change in log-likelihood falls below a threshold (e.g., $10^{-6}$ ).

Numerical stability. The E-step often involves computing products of many small probabilities. Using log-probabilities and the log-sum-exp trick can prevent numerical underflow.

Degeneracy prevention. In GMMs, components can degenerate if they collapse onto single data points. Adding a small regularization term to the covariance matrices (e.g., a small multiple of the identity matrix) prevents this.^[13]

References

Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm." *Journal of the Royal Statistical Society, Series B*, 39(1), 1-38. doi:10.1111/j.2517-6161.1977.tb01600.x. https://rss.onlinelibrary.wiley.com/doi/10.1111/j.2517-6161.1977.tb01600.x ↩
Wu, C.F.J. (1983). "On the Convergence Properties of the EM Algorithm." *The Annals of Statistics*, 11(1), 95-103. doi:10.1214/aos/1176346060. https://projecteuclid.org/journals/annals-of-statistics/volume-11/issue-1/On-the-Convergence-Properties-of-the-EM-Algorithm/10.1214/aos/1176346060.full ↩
Neal, R.M. and Hinton, G.E. (1998). "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants." In *Learning in Graphical Models*, Springer, 355-368. ↩
Meng, X.L. and Rubin, D.B. (1993). "Maximum Likelihood Estimation via the ECM Algorithm: A General Framework." *Biometrika*, 80(2), 267-278. ↩
Meng, X.L. and Rubin, D.B. (1991). "Using EM to Obtain Asymptotic Variance-Covariance Matrices: The SEM Algorithm." *Journal of the American Statistical Association*, 86(416), 899-909. ↩
Baum, L.E., Petrie, T., Soules, G., and Weiss, N. (1970). "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains." *The Annals of Mathematical Statistics*, 41(1), 164-171. doi:10.1214/aoms/1177697196. ↩
Wei, G.C.G. and Tanner, M.A. (1990). "A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms." *Journal of the American Statistical Association*, 85(411), 699-704. ↩
Delyon, B., Lavielle, M., and Moulines, E. (1999). "Convergence of a Stochastic Approximation Version of the EM Algorithm." *The Annals of Statistics*, 27(1), 94-128. ↩
Liu, C. and Rubin, D.B. (1994). "The ECME Algorithm: A Simple Extension of EM and ECM with Faster Monotone Convergence." *Biometrika*, 81(4), 633-648. ↩
Cappé, O. and Moulines, E. (2009). "On-line Expectation-Maximization Algorithm for Latent Data Models." *Journal of the Royal Statistical Society, Series B*, 71(3), 593-613. ↩
Liu, C., Rubin, D.B., and Wu, Y.N. (1998). "Parameter Expansion to Accelerate EM: The PX-EM Algorithm." *Biometrika*, 85(4), 755-770. ↩
McLachlan, G.J. and Krishnan, T. (2007). *The EM Algorithm and Extensions*. 2nd edition. Wiley-Interscience.
Bishop, C.M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 9. ↩
Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. (2017). "Variational Inference: A Review for Statisticians." *Journal of the American Statistical Association*, 112(518), 859-877. ↩
"A Legacy of EM Algorithms." PMC (National Library of Medicine), 2023. Notes the 1977 Dempster, Laird, and Rubin paper had 64,769 Google Scholar citations as of November 2021, the second most cited paper in statistics. https://pmc.ncbi.nlm.nih.gov/articles/PMC10191373/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

Clustering Co-Training Gaussian Mixture Model Iteration K-Means Latent Dirichlet allocation Machine learning terms/Clustering Maximum likelihood estimation (MLE)Michael I. Jordan Mixture of Experts (MoE)Non-Response Bias Novelty Detection ReST / ReST-EM (Reinforced Self-Training)Topic model

ELI5 (Explain like I'm 5)

Who invented the EM algorithm and when?

How does the EM algorithm work?

E-step (expectation step)

M-step (maximization step)

Iteration

Mathematical formulation

Log-likelihood decomposition

Jensen's inequality

EM as coordinate ascent on the ELBO

Does the EM algorithm always converge?

Monotonic likelihood increase

Convergence to stationary points

Local convergence, not global

Rate of convergence

How is EM used to fit a Gaussian mixture model?

E-step for GMMs

M-step for GMMs

How does EM relate to K-means?

What is the EM algorithm used for?

Missing data problems

Hidden Markov models

Topic models

What are the main variants of the EM algorithm?

Generalized EM (GEM)

Expectation Conditional Maximization (ECM)

ECME algorithm

Monte Carlo EM (MCEM)

Stochastic EM (SEM)

Stochastic Approximation EM (SAEM)

Variational EM

Online EM

Parameter-expanded EM (PX-EM)

How does EM relate to variational inference?

What are the limitations of the EM algorithm?

How does EM compare to gradient descent and Newton-Raphson?

Implementation considerations

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here