Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) is a class of algorithms for drawing samples from a probability distribution by constructing a Markov chain whose stationary distribution equals the target distribution ^[1]. After the chain has been simulated long enough to forget its starting point, the visited states behave as approximate samples from the target, and any expectation under the target can be estimated as a sample average over the chain. MCMC is the workhorse of modern computational statistics because it sidesteps the need to compute normalizing constants or perform analytic integration. Both can be intractable in moderately complex models. The core idea is to design a chain whose transition kernel preserves the target distribution, then run the chain and harvest its visits.

MCMC sits at the intersection of probability, statistics, and physics. The original algorithm was conceived in 1953 at Los Alamos to compute equation-of-state properties of dense fluids on the MANIAC computer ^[2]. It crossed into mainstream statistics through the Hastings generalization of 1970 ^[3] and exploded in popularity after Geman and Geman introduced Gibbs sampling for image restoration in 1984 ^[4] and Gelfand and Smith showed in 1990 how to apply Gibbs sampling to a wide range of Bayesian inference problems. Today MCMC underlies probabilistic programming systems such as Stan, PyMC, and NumPyro, drives Bayesian estimation in econometrics and astronomy, and connects deeply to score-based generative models in machine learning through Langevin dynamics ^[5].

Motivation and basic idea

Many scientific problems reduce to estimating an expectation of a function f under a probability distribution pi on some space X. When pi is a multivariate Gaussian or a low-dimensional standard distribution, the expectation can often be computed analytically or by numerical quadrature. In realistic settings, however, pi is known only up to a normalizing constant. A Bayesian posterior, for example, is proportional to the likelihood times the prior, but the proportionality constant is the marginal likelihood of the data, an integral that is itself usually intractable. Even when pi is tractable, computing expectations by quadrature suffers from the curse of dimensionality. Quadrature error scales exponentially in the number of variables, making naive integration hopeless beyond a handful of dimensions.

Monte Carlo methods replace deterministic integration with random sampling. If one can draw independent samples from pi, the law of large numbers guarantees that the sample average converges to the expectation, and the central limit theorem gives the standard error as a function of sample size. The challenge in modern problems is that drawing independent samples from pi is itself usually impossible. Rejection sampling and importance sampling work in low dimensions but degrade rapidly as the dimension grows because most of the probability mass lives in narrow ridges that random proposals never find efficiently.

MCMC solves this problem by constructing a Markov chain that explores the space according to pi. Rather than producing independent samples, the chain produces correlated samples whose long-run frequency matches pi. The price for correlated samples is a slower effective sample rate; a chain of length N might be worth far fewer than N independent samples. The benefit is that the chain only needs to evaluate the unnormalized density at each visited state, sidestepping the normalizing constant entirely. This single property makes MCMC the dominant method for posterior inference in any model whose marginal likelihood is hard to compute.

Mathematical foundation

A Markov chain on a state space X is a sequence of random variables X_0, X_1, X_2, ... such that the conditional distribution of X_{t+1} given the entire history depends only on X_t. The transition kernel K specifies the conditional distribution of X_{t+1} given X_t. A distribution pi is called stationary for K if drawing X_t from pi and then sampling X_{t+1} from K leaves the marginal distribution of X_{t+1} also equal to pi. If the chain is irreducible (it can reach any region of positive pi-mass from any starting point) and aperiodic (it does not return to states only at regular intervals), then the marginal distribution of X_t converges to pi regardless of the starting state. The chain forgets its initialization at a geometric rate under standard regularity conditions.

Most MCMC algorithms achieve stationarity by enforcing detailed balance. A transition kernel K satisfies detailed balance with respect to pi if pi(x) K(x, y) equals pi(y) K(y, x) for all states x and y. Detailed balance is a stronger condition than stationarity (it implies stationarity but not conversely), and it has the practical advantage that it can be checked locally rather than globally. The Metropolis-Hastings algorithm, Gibbs sampling, slice sampling, and Hamiltonian Monte Carlo all rely on detailed balance or its slightly weaker generalizations.

For a chain that has converged, ergodic averages converge to true expectations under pi by the ergodic theorem, an analog of the law of large numbers for Markov chains. The asymptotic variance of an ergodic average depends on the autocorrelation of the chain. Strongly correlated chains require longer simulations to reach a given Monte Carlo precision, which is why mixing time and effective sample size are central concerns in practical MCMC.

History

Los Alamos and the Metropolis algorithm

The Metropolis algorithm was published in 1953 by Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller in the Journal of Chemical Physics under the title "Equation of State Calculations by Fast Computing Machines" ^[2]. The original problem was to compute pressure and other thermodynamic properties of two-dimensional fluids of hard disks on the MANIAC I computer at Los Alamos National Laboratory. Direct simulation of the canonical ensemble required integrating over a high-dimensional configuration space weighted by the Boltzmann distribution exp(-E/kT), and the team needed a way to draw samples from this distribution without computing the partition function.

Marshall and Arianna Rosenbluth designed the acceptance rule that bears the Metropolis name. A trial move displaces a single particle by a small random amount and accepts the move with probability one if the new energy is lower, or with probability exp(-Delta E / kT) if the new energy is higher. The chain converges to the canonical ensemble, and time averages of any observable estimate its thermodynamic expectation. The 1953 paper is remarkable both for the algorithm itself and for being among the first published examples of stochastic computer simulation. The MANIAC computer had three thousand vacuum tubes and a few kilobytes of memory, yet the team computed several thousand Markov chain steps for systems of 224 particles, enough to extract sensible pressure-density curves.

The algorithm spent its first two decades almost entirely within statistical physics. It was used to study Lennard-Jones fluids, to estimate critical exponents in lattice spin models, and to compute properties of polymer chains. The statistics community was largely unaware of it. Edward Teller later remarked that the algorithm's translation from physics to statistics took longer than its discovery.

Hastings and the generalization to statistics

In 1970 the New Zealand statistician W. K. Hastings published a paper in Biometrika titled "Monte Carlo Sampling Methods Using Markov Chains and Their Applications" ^[3]. Hastings generalized the Metropolis acceptance rule to allow asymmetric proposal distributions. The Metropolis algorithm assumed that the probability of proposing a move from x to y equals the probability of proposing the reverse move; Hastings replaced this with a more flexible acceptance ratio that includes the proposal density in both directions. The generalization, now universally called the Metropolis-Hastings algorithm, opened the door to a much wider range of proposal distributions and allowed practitioners to design proposals that match the local geometry of the target distribution.

Hastings also discussed component-wise updates in which one variable is updated at a time conditional on the others, foreshadowing the Gibbs sampler that Geman and Geman would name fourteen years later. Despite the importance of the paper in retrospect, it received little attention at the time. Hastings published only a handful of follow-up papers on the topic, and the wider statistics community did not begin to exploit the framework until the late 1980s.

Gibbs sampling and the statistics revolution

The modern era of MCMC began with two papers in the 1980s. Stuart Geman and Donald Geman, working on Bayesian image restoration at Brown University, introduced Gibbs sampling in 1984 in IEEE Transactions on Pattern Analysis and Machine Intelligence ^[4]. Their paper named the algorithm after the Gibbs distribution from statistical mechanics and applied it to Markov random field models for noisy images. Although the Gemans were unaware of earlier work by Hastings and others, their algorithm is a special case of Metropolis-Hastings in which each proposal is the full conditional distribution of one component given the others, and the acceptance probability is always one.

The second pivotal paper was Tanner and Wong's 1987 paper in JASA on data augmentation, which used iterative simulation to handle missing data and latent variables in Bayesian models. Together with Gelfand and Smith's 1990 paper in JASA, which showed how to use Gibbs sampling for general posterior inference, these contributions ignited a rapid expansion of MCMC across the statistics literature. By the mid-1990s nearly every issue of major statistics journals contained at least one MCMC paper, and the BUGS software (Bayesian inference Using Gibbs Sampling) had made the method accessible to applied users.

Modern era: Hamiltonian Monte Carlo and adaptive methods

Hamiltonian Monte Carlo originated in the lattice quantum chromodynamics community. Duane, Kennedy, Pendleton, and Roweth published the algorithm in Physics Letters B in 1987 under the name Hybrid Monte Carlo, designed to accelerate Markov chain simulation of lattice gauge theories ^[6]. The method augments the target with auxiliary momentum variables and uses Hamiltonian dynamics to propose long-range moves, leveraging gradient information to escape the random-walk behavior that plagues simple Metropolis methods. Radford Neal introduced HMC to the statistics community in his 1993 thesis and, more decisively, in his influential chapter "MCMC Using Hamiltonian Dynamics" in the 2011 Handbook of Markov Chain Monte Carlo ^[7]. Neal's chapter is the standard pedagogical reference for HMC and remains the entry point for most students of the method.

The No-U-Turn Sampler (NUTS) by Matthew Hoffman and Andrew Gelman, published in the Journal of Machine Learning Research in 2014, removed the last major obstacle to widespread HMC adoption ^[8]. HMC requires the user to specify a step size and a number of leapfrog steps per proposal, and these tuning parameters strongly influence efficiency. NUTS automatically adapts both. It builds a binary tree of leapfrog steps in both forward and backward time directions and stops as soon as the trajectory begins to double back on itself, eliminating the need to specify a path length. Combined with dual averaging for step-size adaptation, NUTS allows HMC to be used as a black-box sampler with essentially no manual tuning. NUTS is the default sampler in Stan and is also used in PyMC and NumPyro.

Reversible jump MCMC, introduced by Peter Green in 1995, generalized Metropolis-Hastings to spaces of varying dimension, allowing chains to move between models with different numbers of parameters. Reversible jump MCMC enabled fully Bayesian model selection in settings such as mixture models with an unknown number of components, regression with variable selection, and nonparametric tree models.

Core algorithms

Metropolis-Hastings

Metropolis-Hastings is the foundational MCMC algorithm and the parent of nearly every other method in the family. Given a target density pi (known up to a constant) and a proposal density q(y | x), the algorithm proceeds by drawing a candidate y from q(y | X_t), computing the acceptance probability alpha = min(1, [pi(y) q(X_t | y)] / [pi(X_t) q(y | X_t)]), and setting X_{t+1} equal to y with probability alpha or X_t otherwise. The acceptance ratio depends only on ratios of pi, so the normalizing constant cancels.

The choice of proposal density determines the chain's efficiency. A symmetric Gaussian proposal centered at the current state is the simplest option and recovers the original Metropolis algorithm. The proposal scale should be tuned so that the chain accepts roughly 23 percent of moves in high-dimensional Gaussian-like targets, a result derived by Roberts, Gelman, and Gilks in the late 1990s. Smaller proposals lead to higher acceptance but slower exploration, while larger proposals explore more aggressively but are rejected more often. The autocorrelation time depends sensitively on proposal scale, and adaptive Metropolis algorithms tune the scale automatically based on the empirical covariance of past samples.

Metropolis-Hastings is robust and easy to implement but suffers from random-walk behavior in high dimensions. The chain takes on the order of d^2 steps to traverse a d-dimensional Gaussian target, making it impractical for problems with hundreds or thousands of parameters unless combined with a smarter proposal mechanism such as Hamiltonian dynamics or component-wise updating.

Gibbs sampling

Gibbs sampling updates one variable (or block of variables) at a time, drawing each from its conditional distribution given the current values of all the others ^[4]. For a target pi(x_1, ..., x_d), a sweep consists of drawing x_1 from pi(x_1 | x_2, ..., x_d), then x_2 from pi(x_2 | x_1, x_3, ..., x_d), and so on through x_d. Each conditional draw is itself a Metropolis-Hastings step with acceptance probability one, so Gibbs sampling can be viewed as a special case of Metropolis-Hastings.

Gibbs sampling is most useful when the conditional distributions have closed-form expressions. Conjugate Bayesian models such as Gaussian-Gaussian regression, Beta-Binomial models, Dirichlet-Multinomial mixtures, and many hierarchical models have tractable conditionals, which is why Gibbs sampling rose to prominence in Bayesian statistics during the 1990s. The BUGS software automated the construction of Gibbs samplers from a graphical model specification, dramatically lowering the barrier to entry.

Gibbs sampling can also be combined with Metropolis-Hastings updates for components without closed-form conditionals, a hybrid sometimes called Metropolis-within-Gibbs. The chain remains valid as long as each update preserves the joint target. The main weakness of Gibbs sampling is that highly correlated variables produce a chain that mixes slowly, since each component is updated along the coordinate axis even when the target's principal direction is diagonal. Block updates and reparameterizations can mitigate the problem.

Slice sampling

Slice sampling, introduced by Radford Neal in a 2003 Annals of Statistics paper, provides a black-box alternative to Metropolis-Hastings that avoids tuning a proposal scale. The basic idea is to introduce an auxiliary uniform variable u between zero and pi(x), then alternate between drawing u given x and drawing x uniformly from the slice {y : pi(y) > u}. The chain is provably correct and adapts naturally to the local scale of the target. The main implementation challenge is identifying the slice, which Neal solves with a stepping-out procedure that grows an interval until both endpoints lie outside the slice, followed by a shrinkage procedure that contracts the interval until a point inside the slice is sampled.

Slice sampling is particularly attractive for univariate components and for use within Gibbs sampling, where it can replace Metropolis-Hastings updates that would otherwise require careful proposal tuning. It is also the default sampler in many software packages for variables that lack tractable conditional forms.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) augments the target distribution with auxiliary momentum variables p drawn from a multivariate Gaussian, then proposes new states by simulating Hamiltonian dynamics on the joint position-momentum space ^[7]. The Hamiltonian H(x, p) equals the negative log of the target plus a kinetic energy term (half the inner product of the momentum with itself, possibly preconditioned by a mass matrix). Hamiltonian dynamics conserves H exactly in continuous time and approximately under numerical integration, so a long deterministic simulation produces a far-away proposal that nonetheless has nearly the same density as the starting point.

HMC uses the leapfrog integrator, a symplectic scheme that alternates half-step momentum updates with full-step position updates. The leapfrog integrator is volume-preserving and time-reversible, two properties that allow the Metropolis-Hastings acceptance ratio to be computed exactly even though the proposal is generated by a complicated dynamical simulation. After running a fixed number of leapfrog steps, the proposal is accepted with probability min(1, exp(H_old - H_new)). Energy conservation under leapfrog is excellent in practice, so acceptance rates of 60 to 90 percent are typical, and the chain can take very long jumps in a single step.

The central advantage of HMC over random-walk Metropolis is that it eliminates the random-walk behavior that limits high-dimensional sampling. Where random-walk Metropolis takes O(d^2) steps to traverse a d-dimensional target, HMC takes only O(d^{1/4}) steps, an exponential improvement. The cost is that HMC requires gradients of the log density, which adds a per-step computational burden but is feasible for any differentiable model.

No-U-Turn Sampler

The No-U-Turn Sampler, introduced by Hoffman and Gelman in 2014, is the most widely used HMC variant in modern probabilistic programming ^[8]. The key innovation is a recursive doubling algorithm that builds a binary tree of leapfrog states forward and backward in fictitious time, doubling the path length at each iteration. The expansion stops as soon as the trajectory begins to retrace itself, detected by a U-turn condition that compares the displacement vector with the momentum at each tree boundary. Once the tree is complete, NUTS samples uniformly from the visited states subject to a slice-sampling constraint that ensures detailed balance.

NUTS also adapts the leapfrog step size during a warmup phase using a primal-dual averaging scheme that targets a user-specified average acceptance probability (typically 0.8). After warmup the step size is fixed and sampling proceeds without further adaptation. The combination of automatic tree-length adaptation and step-size adaptation makes NUTS essentially a black-box sampler. The user supplies a log-density function and its gradient, and NUTS handles the rest.

NUTS is the default backend in Stan, the algorithm of choice in PyMC's modern API, and the foundation of NumPyro's sampling stack. Its empirical efficiency, robustness, and absence of tunable knobs have made it the de facto standard for general-purpose Bayesian inference on continuous parameter spaces.

Reversible jump MCMC

Reversible jump MCMC, due to Peter Green in 1995, extends Metropolis-Hastings to settings where the dimension of the parameter space is itself a random variable. A typical application is fitting a mixture model whose number of components is unknown. Each model in the family has a different parameter dimension, and the chain must move between dimensions in a way that preserves the joint target on the union of model spaces. The algorithm achieves this by introducing dimension-matching auxiliary variables and using a Jacobian factor in the acceptance ratio.

Reversible jump MCMC is more difficult to implement than fixed-dimension methods and requires careful design of trans-dimensional proposals. Software support is limited compared to Stan-style fixed-dimension samplers, and many practitioners use marginal likelihood estimation or variational methods for model selection rather than reversible jump MCMC. The algorithm remains the rigorous gold standard when fully Bayesian model averaging or model selection is required.

Parallel tempering and replica exchange

Parallel tempering, also known as replica exchange MCMC, runs multiple chains in parallel at different temperatures, where the t-th chain samples from a tempered version of the target proportional to pi(x)^{1/T_t}. Higher temperatures flatten the distribution, allowing the corresponding chain to traverse barriers between modes more easily. Periodically, neighboring chains attempt to swap states using a Metropolis-Hastings acceptance probability that preserves their respective stationary distributions. Information flows from the high-temperature chains down through the temperature ladder to the cold chain, which targets the original distribution.

Parallel tempering is the standard technique for sampling from multimodal distributions where vanilla MCMC would become trapped in a single mode. It is widely used in molecular dynamics, protein folding, and Bayesian phylogenetics, and has seen renewed interest in machine learning for sampling from energy-based models and posterior distributions of neural networks.

Comparison of MCMC methods

The following table summarizes the main MCMC algorithms and their relative strengths.

Algorithm	Year	Uses gradients	Tuning needed	Strengths	Weaknesses
Metropolis	1953	No	Proposal scale	Simple, robust, general	Random-walk behavior in high d
Metropolis-Hastings	1970	No	Proposal density	Flexible proposals, asymmetric jumps	Same scaling limits as Metropolis
Gibbs sampling	1984	No	Block structure	Closed-form when conditionals known	Slow with strong correlations
Slice sampling	2003	No	Almost none	Adapts to local scale	Sometimes slow per step
Hamiltonian Monte Carlo	1987	Yes	Step size, path length	Excellent high-d scaling	Needs differentiable target
NUTS	2014	Yes	None (after warmup)	Black-box HMC	Hard to extend to constrained spaces
Reversible jump	1995	Optional	Trans-dimensional proposals	Handles variable model dimension	Difficult to implement
Parallel tempering	1991	Optional	Temperature ladder	Crosses barriers between modes	Many chains needed
Particle MCMC	2010	No	Particle count	State-space models with latent paths	Computationally expensive

Convergence diagnostics

MCMC samples are useful only after the chain has converged to its stationary distribution. Diagnosing convergence is a fundamentally hard problem because the chain's only output is the sample path itself, which gives limited information about the parts of the space that have not been visited. Practical MCMC therefore relies on a battery of diagnostics that detect common failure modes rather than guaranteeing convergence.

Trace plots

A trace plot displays the value of a single parameter (or a derived quantity such as the log posterior) against iteration index. A converged, well-mixing chain produces a stationary, hairy time series with no trends, no large excursions, and no obvious dependence on the starting value. Multiple chains run from different starting points should overlap and become indistinguishable after the warmup phase. Trace plots that drift, that show long sojourns in distinct regions, or that fail to overlap across chains are signs of poor mixing or unconverged chains. Modern recommendations sometimes replace trace plots with rank plots, which compare the empirical rank distribution of each chain across all chains and are more sensitive to subtle failures.

R-hat (Gelman-Rubin statistic)

The R-hat statistic, introduced by Andrew Gelman and Donald Rubin in 1992, compares the variance within each chain to the variance between chains. It is computed by running m parallel chains of length n from overdispersed starting points, then forming W (the average within-chain variance) and B (the variance of the chain means scaled by n). The marginal posterior variance is estimated as a weighted average, and R-hat is the square root of the ratio of this estimate to W. If the chains have converged to the same stationary distribution, R-hat should be close to one. A common threshold is R-hat below 1.01 (the modern stricter cutoff popularized by Vehtari and colleagues) or 1.1 (the original looser cutoff). Values larger than the threshold indicate that the chains have not yet mixed.

The original R-hat has been refined repeatedly. Brooks and Gelman (1998) proposed a multivariate version. Vehtari et al. (2021) introduced a rank-normalized R-hat that is more reliable for chains with heavy tails or non-Gaussian distributions, and they recommend computing R-hat separately for the bulk and tails of the distribution. Many modern packages report this improved R-hat by default.

Effective sample size

The effective sample size (ESS) measures how many independent samples the correlated chain is worth. It is defined as the chain length divided by the integrated autocorrelation time, the sum of all autocorrelations from lag zero to infinity. A chain with N iterations and integrated autocorrelation time tau yields an effective sample size of N/tau. Strong autocorrelations make ESS much smaller than N. A typical recommendation is to require an effective sample size of at least a few hundred for any quantity of interest before declaring inference reliable. Vehtari et al. (2021) introduced a more reliable ESS estimator that splits the chain into halves to detect non-stationarity and computes separate ESS values for the bulk and tails.

Other diagnostics

Monte Carlo standard errors compare the precision of the MCMC estimate to the desired precision. If the standard error is small relative to the posterior standard deviation, the chain has produced enough samples for the practical purpose at hand. The Heidelberger-Welch test detects non-stationarity in a single chain. The Geweke test compares the means of the early and late portions of a chain to detect drift. None of these methods can prove convergence; they can only detect failures.

Applications in statistics and machine learning

MCMC is the standard method for posterior inference in Bayesian statistics, with applications in essentially every quantitative discipline. The following table illustrates the breadth of MCMC use across domains.

Domain	Typical use case	Common method
Bayesian regression	Hierarchical regression, generalized linear models	NUTS, Gibbs sampling
Phylogenetics	Inferring evolutionary trees from sequence data	Metropolis-coupled MCMC, parallel tempering
Astronomy	Fitting cosmological models, exoplanet orbit estimation	Affine-invariant ensemble samplers, HMC
Econometrics	Stochastic volatility, dynamic factor models	Particle MCMC, Gibbs sampling
Epidemiology	Compartmental disease models, contact tracing	Particle MCMC, Metropolis-Hastings
Bayesian deep learning	Posterior over neural network weights	Stochastic gradient HMC, Langevin dynamics
Generative modeling	Sampling from energy-based models, diffusion models	Langevin dynamics, parallel tempering
Mixture models	Inference in Gaussian mixture models	Gibbs sampling, reversible jump
Image processing	Markov random fields, image restoration	Gibbs sampling, simulated annealing
Topic modeling	Latent Dirichlet allocation, hierarchical Dirichlet processes	Collapsed Gibbs sampling

Bayesian inference

The canonical application of MCMC is Bayesian posterior inference. Given a prior p(theta) and a likelihood p(data | theta), Bayes's theorem states that the posterior is proportional to the product of the two. The marginal likelihood in the denominator is usually intractable, but MCMC sidesteps it because the chain only needs the unnormalized posterior. Hierarchical models, in which parameters at one level are drawn from distributions parameterized at a higher level, are particularly well suited to Gibbs sampling because the conditional distributions often have closed forms. Stan, PyMC, and NumPyro have made hierarchical Bayesian modeling routine in fields ranging from political science to ecology.

Bayesian deep learning

Bayesian deep learning treats the weights of a neural network as random variables and aims to infer the posterior distribution over weights given training data. The posterior captures epistemic uncertainty about the model and supports principled prediction intervals. Posterior inference for deep learning models is challenging because the parameter space has hundreds of millions of dimensions and the posterior is highly non-Gaussian.

Full-batch HMC has been shown by Izmailov and colleagues (2021) to give the gold-standard posterior estimate for moderately sized Bayesian neural networks (BNNs), but it requires running through the full dataset for every leapfrog step. Stochastic gradient HMC, stochastic gradient Langevin dynamics, and related methods scale to larger datasets by replacing exact gradients with mini-batch estimates and adding a noise term to maintain the correct stationary distribution. The cost is that the noise injected by the mini-batch estimate biases the stationary distribution slightly, and recovering the exact posterior requires careful step-size scheduling. In practice, stochastic gradient MCMC is often combined with cold posterior tempering or with variational approximations to balance scalability and accuracy.

Symmetric splitting HMC (Cobb and Jalaian, 2021) has enabled exact full-batch HMC for neural networks with millions of parameters by exploiting the structure of the gradient computation. Combined with modern accelerator hardware, these methods are bringing full Bayesian inference within reach for production-scale neural networks.

Connection to diffusion models

Langevin dynamics, the continuous-time stochastic process whose stationary distribution is a target density pi, is the bridge between MCMC and modern score-based generative modeling ^[5]. The discretized Langevin update steps in the direction of the score (the gradient of the log density) and adds Gaussian noise calibrated to the step size. In the small-step-size limit, the discretization converges to a sample from pi.

Score-based generative models, introduced by Yang Song and Stefano Ermon in 2019 and unified with diffusion models in subsequent work, train a neural network to estimate the score of a noise-perturbed data distribution at multiple noise scales. Sampling proceeds by annealed Langevin dynamics: starting from pure noise, the model runs Langevin updates at progressively smaller noise scales, with each scale using a different score estimate. The resulting algorithm is essentially MCMC sampling from a sequence of distributions that interpolate between an easy-to-sample base distribution and the data distribution. This perspective unifies diffusion models, score matching, and MCMC, and recent work continues to exploit MCMC ideas to improve sampler efficiency in generative models.

The critically-damped Langevin diffusion approach by NVIDIA augments the data variable with auxiliary momentum, mirroring the construction in HMC and yielding faster mixing for high-dimensional generative models. The exchange of ideas between the MCMC and generative modeling communities continues to be an active research frontier.

Particle filtering

Particle filters, also known as sequential Monte Carlo (SMC) methods, are not strictly MCMC algorithms but are closely related and often combined with MCMC. They estimate the posterior distribution over latent states in nonlinear non-Gaussian state-space models by maintaining a population of weighted particles that evolve through time according to the state dynamics and are reweighted by their likelihood under each new observation. Particle MCMC, introduced by Andrieu, Doucet, and Holenstein in 2010, embeds a particle filter inside a Metropolis-Hastings or Gibbs sampler to perform fully Bayesian inference for parameters of state-space models with latent state trajectories. The combined algorithm has become a standard tool in epidemiology, ecology, and finance.

Software tools

The following table compares the major probabilistic programming and MCMC libraries.

Software	Language	Default sampler	Strengths	Notes
Stan	Stan DSL, R, Python	NUTS	Industry standard for HMC, mature ecosystem	Curly-brace block syntax
PyMC	Python	NUTS (with JAX backend optional)	Pythonic API, broad model library	Originally PyMC3, rewrote with PyTensor
NumPyro	Python (JAX)	NUTS	JAX-based, fast on GPU	Lightweight, JAX-native
Pyro	Python (PyTorch)	NUTS, HMC	Deep universal probabilistic programming	Designed for variational inference too
TensorFlow Probability	Python (TF)	HMC, NUTS	Integrates with TensorFlow ecosystem	Lower-level than Stan or PyMC
Turing.jl	Julia	NUTS, HMC, particle MCMC	Composable, Julia-native	Strong performance for hierarchical models
BUGS / WinBUGS	BUGS DSL	Gibbs	Original Bayesian inference engine	Largely superseded by Stan and PyMC
JAGS	BUGS-like DSL	Gibbs, slice sampling	Successor to BUGS	Used in R via rjags
emcee	Python	Affine-invariant ensemble	Popular in astronomy	Gradient-free, derivative-free
BlackJAX	Python (JAX)	NUTS, HMC, SMC, others	Modular sampler library	Used as backend by other PPLs

Stan, written by Andrew Gelman, Bob Carpenter, and colleagues at Columbia, was released in 2012 and remains the industry standard for HMC-based Bayesian inference ^[9]. Models are written in a domain-specific language with curly-brace blocks for parameters, transformed parameters, and the model specification, then compiled to C++. The default sampler is dynamic NUTS with several modern improvements, including multinomial sampling of trajectories and improved metric adaptation.

PyMC began life in 2009 as a pure-Python Metropolis-Hastings library, evolved through PyMC3 (which used Theano for autodiff) into the modern PyMC built on the PyTensor backend ^[10]. The default sampler is NUTS, with the option to use external samplers from BlackJAX or NumPyro for faster execution on GPUs. PyMC's API is designed to feel like idiomatic Python, with parameters declared via context managers and probabilistic programs expressed as ordinary Python code.

NumPyro, built by Uber AI Labs and continued by the Pyro team, uses JAX as its computational backend and offers among the fastest NUTS implementations available, particularly on GPU and TPU hardware. Benchmarks on Bayesian item response models and other hierarchical structures have shown NumPyro outperforming PyStan and PyMC by factors of two to ten depending on the problem and hardware.

For scientists working with state-space models, particle MCMC implementations are available in libraries such as particles (Python) and pomp (R). For machine learning researchers exploring stochastic gradient MCMC, libraries such as bayes-torch and the Bayesian inference modules in TensorFlow Probability provide implementations of stochastic gradient Langevin dynamics, stochastic gradient HMC, and related methods.

Practical issues

Burn-in and warmup

MCMC chains are usually started from arbitrary initial values that are not draws from the target distribution. The early portion of the chain is therefore not representative of the target and is typically discarded as burn-in (in the Metropolis tradition) or warmup (in the Stan tradition). Stan and other modern HMC implementations use the warmup period to adapt the step size, the mass matrix, and other tuning parameters, after which the chain runs without further adaptation. Choosing the burn-in length is a judgment call. Trace plots, R-hat, and ESS diagnostics provide guidance, but no single rule fits all problems. A common default in HMC is 1000 warmup iterations followed by 1000 sampling iterations, but harder problems may require ten or one hundred times more.

Thinning

Thinning means discarding all but every k-th sample from the chain to reduce storage and autocorrelation. While intuitively appealing, thinning is statistically inefficient unless storage is the binding constraint, since it discards information that could be used in the ergodic average. Most modern guidance recommends keeping all samples and reporting effective sample size rather than relying on thinning to produce nominally independent draws.

Multimodality

Most MCMC algorithms are local in the sense that proposals concentrate near the current state. When the target has multiple well-separated modes, the chain may become trapped in one mode for a long time, producing a sample that misrepresents the global distribution. Parallel tempering, simulated annealing, and population-based methods are designed to address multimodality. In Bayesian deep learning, multimodality is the rule rather than the exception, and recent work explores ways to combine local MCMC with cycle-based methods or with explicit mode-jumping proposals.

Reparameterization

The efficiency of MCMC algorithms can depend dramatically on the parameterization of the model. Centered parameterizations of hierarchical models, where group-level parameters are drawn directly from population distributions, often produce funnel-shaped posteriors with extreme curvature that slow down NUTS. Non-centered parameterizations, where group-level parameters are written as a population mean plus a scaled standard normal deviate, often sample much more efficiently. Reparameterization is a routine part of model debugging in modern Bayesian workflows and is documented extensively in the Stan user guide.

Limitations and ongoing research

MCMC is a remarkably general framework but has well-known limitations. Sampling is fundamentally serial in nature; producing N samples requires N sequential evaluations of the target density. This makes MCMC slower than variational inference for problems with many parameters or large datasets. Even with HMC's improved high-dimensional scaling, posterior inference for state-of-the-art deep learning models remains computationally prohibitive without approximations.

Multimodality remains a persistent challenge. While parallel tempering and ensemble samplers help in moderate-dimensional problems, no general-purpose method efficiently handles posteriors with many widely separated modes in very high dimensions. Combining MCMC with optimization or variational methods to identify modes, then using local MCMC to characterize each mode, is a common compromise.

The theoretical foundations of MCMC for non-reversible chains, for chains with adaptive proposals, and for stochastic gradient variants continue to be active research areas. Non-reversible MCMC, which violates detailed balance but preserves stationarity, can sometimes mix faster than reversible methods. Adaptive MCMC, which tunes proposal distributions based on past samples, can match the efficiency of well-tuned chains but requires careful design to maintain ergodicity. Stochastic gradient MCMC for machine learning introduces bias that depends on step size, batch size, and gradient noise, and characterizing this bias remains an open problem.

The interplay between MCMC and modern AI is still being worked out. As diffusion models increasingly dominate generative modeling and as Bayesian methods make inroads into deep learning, MCMC ideas continue to find new applications even as the techniques themselves continue to evolve. The 2011 Handbook of Markov Chain Monte Carlo, edited by Brooks, Gelman, Jones, and Meng, remains the most comprehensive reference for the field ^[1], and Robert and Casella's Monte Carlo Statistical Methods is the standard graduate-level text ^[11].

References

Brooks, S., Gelman, A., Jones, G., and Meng, X. L. (eds.) (2011). *Handbook of Markov Chain Monte Carlo*. Chapman and Hall/CRC. https://www.mcmchandbook.net/
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). *Equation of State Calculations by Fast Computing Machines*. Journal of Chemical Physics, 21(6), 1087-1092. https://doi.org/10.1063/1.1699114
Hastings, W. K. (1970). *Monte Carlo Sampling Methods Using Markov Chains and Their Applications*. Biometrika, 57(1), 97-109. https://doi.org/10.1093/biomet/57.1.97
Geman, S. and Geman, D. (1984). *Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images*. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721-741. https://doi.org/10.1109/TPAMI.1984.4767596
Song, Y. and Ermon, S. (2019). *Generative Modeling by Estimating Gradients of the Data Distribution*. NeurIPS. arXiv:1907.05600. https://arxiv.org/abs/1907.05600
Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). *Hybrid Monte Carlo*. Physics Letters B, 195(2), 216-222. https://doi.org/10.1016/0370-2693(87)91197-X
Neal, R. M. (2011). *MCMC Using Hamiltonian Dynamics*. In Handbook of Markov Chain Monte Carlo, Chapter 5. arXiv:1206.1901. https://arxiv.org/abs/1206.1901
Hoffman, M. D. and Gelman, A. (2014). *The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo*. Journal of Machine Learning Research, 15, 1593-1623. https://jmlr.org/papers/v15/hoffman14a.html
Carpenter, B. et al. (2017). *Stan: A Probabilistic Programming Language*. Journal of Statistical Software, 76(1). https://doi.org/10.18637/jss.v076.i01
Abril-Pla, O. et al. (2023). *PyMC: A Modern, and Comprehensive Probabilistic Programming Framework in Python*. PeerJ Computer Science, 9, e1516. https://doi.org/10.7717/peerj-cs.1516
Robert, C. P. and Casella, G. (2004). *Monte Carlo Statistical Methods*, 2nd edition. Springer. https://link.springer.com/book/10.1007/978-1-4757-4145-2
Green, P. J. (1995). *Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination*. Biometrika, 82(4), 711-732. https://doi.org/10.1093/biomet/82.4.711
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Burkner, P. C. (2021). *Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC*. Bayesian Analysis, 16(2), 667-718. https://doi.org/10.1214/20-BA1221
Andrieu, C., Doucet, A., and Holenstein, R. (2010). *Particle Markov Chain Monte Carlo Methods*. Journal of the Royal Statistical Society: Series B, 72(3), 269-342. https://doi.org/10.1111/j.1467-9868.2009.00736.x
Gelman, A. and Rubin, D. B. (1992). *Inference from Iterative Simulation Using Multiple Sequences*. Statistical Science, 7(4), 457-472. https://doi.org/10.1214/ss/1177011136
Neal, R. M. (2003). *Slice Sampling*. Annals of Statistics, 31(3), 705-767. https://doi.org/10.1214/aos/1056562461
Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. (2021). *What Are Bayesian Neural Network Posteriors Really Like?*. ICML. arXiv:2104.14421. https://arxiv.org/abs/2104.14421
Cobb, A. D. and Jalaian, B. (2021). *Scaling Hamiltonian Monte Carlo Inference for Bayesian Neural Networks with Symmetric Splitting*. UAI. arXiv:2010.06772. https://arxiv.org/abs/2010.06772

Motivation and basic idea

Mathematical foundation

History

Los Alamos and the Metropolis algorithm

Hastings and the generalization to statistics

Gibbs sampling and the statistics revolution

Modern era: Hamiltonian Monte Carlo and adaptive methods

Core algorithms

Metropolis-Hastings

Gibbs sampling

Slice sampling

Hamiltonian Monte Carlo

No-U-Turn Sampler

Reversible jump MCMC

Parallel tempering and replica exchange

Comparison of MCMC methods

Convergence diagnostics

Trace plots

R-hat (Gelman-Rubin statistic)

Effective sample size

Other diagnostics

Applications in statistics and machine learning

Bayesian inference

Bayesian deep learning

Connection to diffusion models

Particle filtering

Software tools

Practical issues

Burn-in and warmup

Thinning

Multimodality

Reparameterization

Limitations and ongoing research

See also

References

Improve this article

Related Articles

Bayesian inference

Bayes' theorem

Markov property

A*

XGBoost

LightGBM

Motivation and basic idea

Mathematical foundation

History

Los Alamos and the Metropolis algorithm

Hastings and the generalization to statistics

Gibbs sampling and the statistics revolution

Modern era: Hamiltonian Monte Carlo and adaptive methods

Core algorithms

Metropolis-Hastings

Gibbs sampling

Slice sampling

Hamiltonian Monte Carlo

No-U-Turn Sampler

Reversible jump MCMC

Parallel tempering and replica exchange

Comparison of MCMC methods

Convergence diagnostics

Trace plots

R-hat (Gelman-Rubin statistic)

Effective sample size

Other diagnostics

Applications in statistics and machine learning

Bayesian inference

Bayesian deep learning

Connection to diffusion models

Particle filtering

Software tools

Practical issues

Burn-in and warmup

Thinning

Multimodality

Reparameterization

Limitations and ongoing research

See also

References

Related Articles

Bayesian inference