See also: Bayesian inference, Monte Carlo methods, Probabilistic programming
Markov chain Monte Carlo (MCMC) is a class of algorithms that draw samples from a probability distribution by constructing a Markov chain whose stationary distribution matches the target distribution of interest. As the chain runs for a sufficient number of steps, the distribution of the generated samples converges to the target distribution, allowing practitioners to approximate expectations, marginal distributions, and credible intervals that would otherwise be intractable to compute analytically.
MCMC methods are most widely used in Bayesian inference, where the goal is to sample from a posterior distribution that is known only up to a normalizing constant. Because the posterior is proportional to the product of the likelihood and the prior, direct sampling is rarely feasible for models with more than a handful of parameters. MCMC sidesteps this difficulty by requiring only the ability to evaluate the unnormalized density at any given point, making it applicable to an enormous range of statistical models [1].
Since its introduction in the 1950s, MCMC has become one of the most widely used computational tools in statistics, machine learning, physics, computational biology, and many other scientific fields. The development of practical MCMC algorithms has been described as one of the ten most important algorithmic advances of the twentieth century [2].
Imagine you are blindfolded in a hilly park and you want to figure out which areas have the most grass (the "high probability" areas). You cannot see the whole park at once, but you can feel the grass under your feet right where you are standing.
Here is what you do: you take a step in a random direction. If you land on a spot with more grass than where you were, you stay there. If there is less grass, you might still move there, but only sometimes. Then you repeat, taking one random step at a time.
After many, many steps, you will have spent most of your time in the grassiest areas, even though you never saw the whole park. If someone asked "where is the most grass?" you could just look at all the places you visited and the places you visited most often would be the answer.
That is MCMC. The "Markov chain" part means each step depends only on where you are right now, not where you were before. The "Monte Carlo" part means you are using randomness to explore.
The roots of MCMC trace back to the Manhattan Project at Los Alamos National Laboratory. In the late 1940s, Stanislaw Ulam and John von Neumann developed the Monte Carlo method for simulating neutron transport in nuclear weapons. The name "Monte Carlo" was coined as a code name, referencing the Monte Carlo Casino in Monaco, reportedly because of Ulam's uncle's gambling habits [3].
Nicholas Metropolis arrived at Los Alamos in April 1943 as one of the original fifty scientists on the Manhattan Project. He led the group that designed and built the MANIAC I computer in 1952 and was deeply involved in early Monte Carlo computations, including rewiring the ENIAC computer in 1948 to perform nuclear core simulations [4].
In 1953, Metropolis, along with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller, and Edward Teller, published the landmark paper "Equation of State Calculations by Fast Computing Machines." This paper introduced the Metropolis algorithm for sampling from the Boltzmann distribution in statistical mechanics simulations. The computations were performed on the MANIAC I computer at Los Alamos. Although the paper is credited to all five authors, later historical analysis suggests that Arianna Rosenbluth wrote the code and Marshall Rosenbluth designed the algorithm, while Metropolis provided access to the MANIAC I [1][5].
The original Metropolis algorithm assumed a symmetric proposal distribution. In 1970, W. Keith Hastings at the University of Toronto published a generalization that allowed asymmetric proposal distributions, introducing what is now called the Metropolis-Hastings algorithm. Hastings also introduced component-wise updating, a precursor to Gibbs sampling. Despite its significance, the paper received relatively little attention for nearly two decades [6].
In 1984, Stuart and Donald Geman introduced Gibbs sampling in the context of image restoration, naming it after the physicist Josiah Willard Gibbs. The method gained widespread attention in statistics after Alan Gelfand and Adrian Smith published their influential 1990 paper demonstrating that Gibbs sampling could be applied to a broad range of Bayesian problems. This triggered a revolution in Bayesian statistics, as researchers realized that MCMC provided a practical approach to problems that had been considered computationally intractable [7].
The 2000s and 2010s saw the development of more efficient MCMC algorithms. Radford Neal popularized Hamiltonian Monte Carlo (HMC) for statistical applications in 1996, building on work by Duane, Kennedy, Pendleton, and Roweth from 1987. In 2011, Matthew Hoffman and Andrew Gelman introduced the No-U-Turn Sampler (NUTS), which eliminated the need for hand-tuning HMC parameters. NUTS became the default sampler in the Stan probabilistic programming language, released in 2012 [8][9].
A Markov chain is a sequence of random variables X_0, X_1, X_2, ... where the distribution of X_{n+1} depends only on the current state X_n, not on the history of previous states. This property is called the Markov property. Formally:
P(X_{n+1} | X_n, X_{n-1}, ..., X_0) = P(X_{n+1} | X_n)
A distribution pi is called the stationary distribution (or invariant distribution) of a Markov chain if, once the chain reaches distribution pi, it remains in that distribution at all subsequent steps. For MCMC to work correctly, the chain must be:
When these conditions are satisfied, the ergodic theorem guarantees that the time average of any function along the chain converges to the expectation under the stationary distribution [1]:
(1/N) * sum_{n=1}^{N} f(X_n) -> E_pi[f(X)] as N -> infinity
A sufficient (but not necessary) condition for pi to be the stationary distribution is the detailed balance condition:
pi(x) * T(x -> x') = pi(x') * T(x' -> x)
where T(x -> x') is the transition probability from state x to state x'. This condition ensures that the probability flow between any two states is equal in both directions, which implies stationarity. Most MCMC algorithms are designed to satisfy detailed balance [1].
The Metropolis-Hastings (MH) algorithm is the most general and foundational MCMC method. It provides a framework for constructing Markov chains that converge to a specified target distribution pi(x) [5][6].
Given a current state x:
alpha(x', x) = min(1, [pi(x') * q(x | x')] / [pi(x) * q(x' | x)])
The ratio pi(x') / pi(x) means that the algorithm only requires the target density up to a normalizing constant, since the constant cancels in the ratio. This property makes MH especially well suited to Bayesian inference, where the posterior is known only up to the marginal likelihood [5].
| Variant | Proposal distribution | Acceptance ratio | Key property |
|---|---|---|---|
| Metropolis (symmetric) | q(x' | x) = q(x | x'), e.g., Gaussian centered at x | min(1, pi(x') / pi(x)) | Simplest form; proposal ratio cancels |
| Random walk MH | x' = x + epsilon, epsilon ~ N(0, sigma^2 I) | Full MH ratio | Step size sigma requires tuning |
| Independent MH | q(x' | x) = q(x'), independent of current state | min(1, [pi(x') q(x)] / [pi(x) q(x')]) | Works well if proposal approximates target |
| Langevin-adjusted (MALA) | x' = x + (epsilon^2/2) nabla log pi(x) + epsilon z, z ~ N(0, I) | Full MH ratio with gradient correction | Uses gradient information for better proposals |
The efficiency of the MH algorithm depends heavily on the proposal distribution. If the step size is too small, the chain explores slowly (high acceptance rate but high autocorrelation). If the step size is too large, most proposals are rejected (low acceptance rate). Theoretical results suggest an optimal acceptance rate of approximately 23.4% for random walk Metropolis in high-dimensional problems with Gaussian targets, and approximately 57.4% for MALA [10].
Gibbs sampling is a special case of the Metropolis-Hastings algorithm that is applicable when the joint distribution is difficult to sample from directly, but the full conditional distributions of each variable are known and easy to sample from [7].
For a vector of parameters theta = (theta_1, theta_2, ..., theta_d), each iteration updates every component in turn:
Each update draws from the exact full conditional distribution, so every proposal is accepted (the acceptance probability is always 1). This eliminates the need for tuning a proposal distribution.
| Aspect | Description |
|---|---|
| No tuning required | Each step samples from an exact conditional, so there is no step size or proposal variance to set |
| 100% acceptance rate | Every drawn sample is accepted |
| Easy implementation | Straightforward when conditionals have standard forms (e.g., conjugate priors) |
| Slow mixing with correlations | When parameters are highly correlated, the component-wise updates cause slow exploration |
| Requires known conditionals | The full conditional distributions must be available in closed form or easy to sample from |
| Scales poorly in high dimensions | Updating one variable at a time becomes inefficient as dimensionality grows |
Gibbs sampling was the foundation of the BUGS (Bayesian inference Using Gibbs Sampling) software family, including WinBUGS, OpenBUGS, and JAGS (Just Another Gibbs Sampler). It remains widely used for models with conjugate structure, such as latent Dirichlet allocation and Gaussian mixture models [7].
Hamiltonian Monte Carlo (HMC), originally called hybrid Monte Carlo, uses concepts from Hamiltonian mechanics to generate proposals that can move long distances in parameter space while maintaining a high acceptance probability. It was introduced by Duane, Kennedy, Pendleton, and Roweth in 1987 for lattice quantum chromodynamics simulations, and later popularized for statistical applications by Radford Neal [8].
HMC augments the target distribution pi(theta) with auxiliary momentum variables p, defining a joint distribution through the Hamiltonian:
H(theta, p) = U(theta) + (1/2) p^T M^{-1} p
where U(theta) = -log pi(theta) is the "potential energy" and (1/2) p^T M^{-1} p is the "kinetic energy," with M being a mass matrix (typically the identity or a diagonal matrix).
Each iteration proceeds as follows:
The leapfrog integrator is symplectic, meaning it approximately preserves the Hamiltonian. This leads to high acceptance rates even for large moves through parameter space, dramatically reducing autocorrelation compared to random walk methods [8].
HMC avoids the random walk behavior that plagues the basic Metropolis algorithm in high dimensions. Where random walk Metropolis requires O(d^2) steps to produce an independent sample in d dimensions, HMC can achieve O(d^{5/4}) scaling under favorable conditions. This makes HMC far more practical for models with hundreds or thousands of parameters [8].
HMC requires setting two parameters: the step size epsilon and the number of leapfrog steps L. Poor choices lead to either low acceptance rates (epsilon too large), slow exploration (epsilon too small), or wasted computation (L too large, causing the trajectory to loop back). Optimal tuning of these parameters has historically required extensive preliminary runs.
The No-U-Turn Sampler, introduced by Hoffman and Gelman in 2014, automates the tuning of HMC by adaptively selecting the number of leapfrog steps L during sampling. NUTS eliminates the most difficult tuning decision in HMC and is the default sampler in Stan [9].
NUTS builds a binary tree of leapfrog steps by doubling the trajectory length in both the forward and backward directions. At each doubling, it checks a "U-turn" condition:
(theta^+ - theta^-) . p^- < 0 or (theta^+ - theta^-) . p^+ < 0
where theta^+ and theta^- are the endpoints of the trajectory and p^+ and p^- are their momenta. When this condition is triggered, it indicates the trajectory is beginning to double back on itself, and further integration would be wasteful. A sample is then selected uniformly from the valid portion of the trajectory [9].
NUTS performs at least as efficiently as a well-tuned HMC implementation, and often more efficiently because it can adaptively vary the trajectory length across different regions of the parameter space. Recent theoretical work has established that NUTS is irreducible, aperiodic, and geometrically ergodic under conditions similar to those required for standard HMC [9].
Determining whether an MCMC chain has converged to its target distribution is one of the most challenging practical aspects of using these methods. Several diagnostic tools have been developed to help practitioners assess convergence [11].
The initial portion of an MCMC chain is typically discarded because the chain may not yet have reached its stationary distribution. This discarded portion is called the burn-in or warm-up period. The length of the burn-in depends on the starting point and the mixing properties of the chain. Common practice is to discard the first half of the chain, though this is a rough heuristic.
The R-hat statistic, introduced by Andrew Gelman and Donald Rubin in 1992, is one of the most widely used convergence diagnostics. It works by running multiple chains from different starting points and comparing the within-chain variance (W) to the between-chain variance (B) [11].
The potential scale reduction factor is computed as:
R-hat = sqrt([(n-1)/n * W + (1/n) * B] / W)
where n is the number of samples per chain. If the chains have converged to the same distribution, R-hat should be close to 1. The standard recommendation is that R-hat should be below 1.01 (an updated threshold, tightened from the original 1.1) [11].
Vehtari, Gelman, Simpson, Carpenter, and Burkner (2021) proposed an improved rank-based R-hat that addresses failures of the traditional version when dealing with heavy-tailed distributions or chains with varying variance [11].
Because MCMC samples are autocorrelated, N successive samples contain less information than N independent samples. The effective sample size quantifies how many independent samples the chain is equivalent to:
ESS = N / (1 + 2 * sum_{k=1}^{infinity} rho_k)
where N is the total number of samples and rho_k is the autocorrelation at lag k. A low ESS relative to N indicates that the chain is mixing slowly and more iterations are needed. As a rule of thumb, an ESS of at least 400 per parameter is recommended for reliable posterior summaries [11].
| Diagnostic | What it checks | Method |
|---|---|---|
| Geweke | Stationarity | Compares means of early and late portions of a single chain using a z-test |
| Heidelberger-Welch | Stationarity and precision | Uses spectral analysis and Cramer-von Mises statistics |
| Raftery-Lewis | Required run length | Estimates burn-in and total iterations needed for a desired quantile precision |
| Trace plots | Visual mixing | Plots parameter values against iteration number; healthy chains look like "fuzzy caterpillars" |
| Autocorrelation plots | Mixing speed | Plots autocorrelation versus lag; should decay rapidly to zero |
| Divergent transitions (HMC) | Numerical issues | Flags iterations where the leapfrog integrator diverged, indicating problematic posterior geometry |
MCMC methods are used across a remarkably wide range of scientific and engineering disciplines.
MCMC is the primary computational engine for Bayesian statistics. It enables calculation of posterior distributions, moments, credible intervals, and model comparison metrics (such as Bayes factors) for models ranging from simple linear regression to hierarchical models with thousands of parameters. In machine learning, MCMC has been applied to Bayesian neural networks, Gaussian processes, latent Dirichlet allocation, and mixture models [1].
The original application of the Metropolis algorithm was simulating the Boltzmann distribution for systems of interacting particles. MCMC remains central to computational physics for studying phase transitions, spin systems (such as the Ising model), lattice gauge theories, and molecular dynamics [5].
MCMC is widely used in phylogenetics (programs like MrBayes use MCMC to estimate phylogenetic trees), population genetics, protein structure prediction, and gene regulatory network inference. In epidemiology, MCMC-based models played a significant role in modeling disease transmission during the COVID-19 pandemic [1].
| Field | Example applications |
|---|---|
| Astronomy and cosmology | Parameter estimation for cosmological models, gravitational wave analysis |
| Finance | Option pricing, risk modeling, stochastic volatility models |
| Ecology | Species distribution modeling, population dynamics, mark-recapture analysis |
| Climate science | Uncertainty quantification in climate models, paleoclimate reconstruction |
| Signal processing | Bayesian source separation, channel estimation |
| Natural language processing | Topic models, probabilistic grammars |
Variational inference (VI) is an alternative to MCMC for approximate Bayesian inference. Rather than sampling from the posterior, VI frames inference as an optimization problem: it searches within a family of tractable distributions for the one that best approximates the posterior, typically by minimizing the Kullback-Leibler divergence [12].
| Criterion | MCMC | Variational inference |
|---|---|---|
| Approach | Sampling | Optimization |
| Asymptotic exactness | Yes, converges to true posterior | No, constrained by variational family |
| Speed | Slower, especially for large datasets | Faster, supports mini-batching and GPU acceleration |
| Posterior correlations | Captured naturally | Often missed (mean-field assumption) |
| Multimodality | Can capture multiple modes (with caveats) | Typically collapses to one mode |
| Variance estimation | Accurate | Tends to underestimate posterior variance |
| Scalability | Challenging for very large datasets | Scales well with stochastic optimization |
| Diagnostics | Well-developed (R-hat, ESS, trace plots) | Less mature (ELBO monitoring) |
| Best suited for | Small-to-moderate data; accurate uncertainty quantification | Large datasets; fast approximate inference |
Salimans, Kingma, and Welling (2015) proposed methods that bridge the gap between MCMC and VI by using MCMC transitions within variational objectives [12].
Several mature software packages implement MCMC algorithms for general-purpose probabilistic programming.
| Software | Language | Primary MCMC algorithm | Key features |
|---|---|---|---|
| Stan | C++ (interfaces for R, Python, Julia) | NUTS (HMC variant) | Automatic differentiation, extensive diagnostics, widely cited [13] |
| PyMC | Python | NUTS, Metropolis, Slice | Built on PyTensor/JAX, GPU support, variational inference also available [14] |
| JAGS | C++ (R interface) | Gibbs sampling, adaptive MH | Declarative model specification similar to BUGS |
| WinBUGS / OpenBUGS | Custom | Gibbs sampling | Pioneering software for Bayesian MCMC; less actively maintained |
| NIMBLE | R / C++ | Configurable (Gibbs, MH, HMC, particle MCMC) | Extensible; user-defined samplers |
| TensorFlow Probability | Python | HMC, NUTS, random walk MH | Integration with TensorFlow ecosystem, GPU support |
| NumPyro | Python (JAX) | NUTS | JAX-based, hardware accelerated, fast compilation |
| Turing.jl | Julia | NUTS, HMC, Gibbs, particle Gibbs | Native Julia, composable inference |
| emcee | Python | Affine-invariant ensemble sampler | Gradient-free, easy to use, popular in astrophysics |
A 2021 comparison by Merkle, Furr, and Rabe-Hesketh found that Stan generally achieved the highest effective sample size per unit time among Stan, JAGS, and NIMBLE, though the best choice depends on the specific model structure [13][14].
There is no general method to prove that an MCMC chain has converged. All diagnostics can only detect certain types of non-convergence; passing all diagnostics does not guarantee convergence. This is a fundamental limitation of the approach.
Standard MCMC algorithms struggle with multimodal distributions because they must traverse low-probability regions to move between modes. The mixing time for a simple mixture of two well-separated Gaussians grows exponentially with the separation distance. Specialized methods such as parallel tempering, replica exchange, and mode-jumping algorithms have been developed to address this problem, but they add complexity and computational cost [15].
Although MCMC avoids the exponential scaling of grid-based methods, it still faces challenges in high dimensions. Random walk Metropolis requires O(d^2) iterations to produce one independent sample in d dimensions. HMC improves this to O(d^{5/4}), but even this scaling becomes impractical for very high-dimensional problems (e.g., millions of parameters in deep neural networks) [8].
Each MCMC iteration requires evaluating the target density (and, for HMC, its gradient). For large datasets, this means passing through the entire dataset at every step, which is expensive. Mini-batch MCMC methods have been proposed (e.g., stochastic gradient Langevin dynamics), but they introduce additional bias and their theoretical properties are less well understood.
MCMC samples are inherently autocorrelated, meaning that the effective number of independent samples is always less than the nominal number of iterations. Poorly tuned chains can have very high autocorrelation, requiring enormous numbers of iterations to obtain a reasonable effective sample size.
Several extensions of the basic MCMC framework address specific challenges.
| Method | Description | Use case |
|---|---|---|
| Parallel tempering | Runs multiple chains at different "temperatures"; high-temperature chains explore broadly, low-temperature chains concentrate on modes | Multimodal distributions |
| Reversible-jump MCMC | Allows proposals that change the dimensionality of the parameter space | Model selection, nonparametric Bayesian models |
| Sequential Monte Carlo (SMC) | Propagates a population of weighted particles through a sequence of distributions | Online inference, model comparison |
| Slice sampling | Samples uniformly from the region under the density curve, avoiding step size tuning | General-purpose, univariate conditionals |
| Adaptive MCMC | Modifies proposal parameters during sampling based on past chain behavior | Automated tuning |
| Pseudo-marginal MCMC | Replaces exact density evaluation with an unbiased estimate | Models with intractable likelihoods |
| Stochastic gradient MCMC | Uses mini-batch gradient estimates instead of full-data gradients | Large-scale Bayesian learning |