Markov chain
Last reviewed
May 1, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 · 4,201 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 · 4,201 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Markov chain is a stochastic process in which the conditional probability of the next state depends only on the current state and not on the sequence of states that preceded it. Formally, a discrete-time Markov chain is a sequence of random variables X_0, X_1, X_2, ... taking values in a state space S such that P(X_{n+1} = x | X_0, X_1, ..., X_n) = P(X_{n+1} = x | X_n) for every n and every x. This conditional independence is called the Markov property, and it is the defining feature that makes the family tractable. A Markov chain is the simplest non-trivial generalisation of an independent and identically distributed sequence, allowing memory of length one while preserving most of the analytic structure of independent sequences.
Markov chains were introduced by the Russian mathematician Andrey Markov in a 1906 paper that extended the law of large numbers to sequences of dependent random variables. They are the foundation for hidden Markov models, Markov decision processes, Markov chain Monte Carlo sampling, the Google PageRank algorithm, n-gram language models, queuing theory, population genetics, and a long list of recent constructions including diffusion-based generative models. Almost every problem in reinforcement learning is, at heart, a problem about a Markov chain or its decision-making cousin.
Andrey Markov (1856 to 1922) studied under Pafnuty Chebyshev at Saint Petersburg University and worked on probability and number theory. Chebyshev had proved a law of large numbers for independent random variables; Markov set out to show that the future need only be conditionally independent of the past given the present. The 1906 paper "Расширение закона больших чисел на величины, зависящие друг от друга" ("Extension of the law of large numbers to dependent variables") established this generalisation and is the founding document of Markov-chain theory.
Markov was responding in part to a public dispute with Pavel Nekrasov, a Moscow University mathematician who had argued on theological grounds that the law of large numbers required statistical independence and therefore proved the existence of free will. Markov constructed his chains specifically to refute this: here was a class of dependent processes for which the law of large numbers still held.
In 1913 Markov gave a now-famous application of his theory at an address to the Imperial Academy of Sciences in Saint Petersburg. He took the first 20,000 letters of Alexander Pushkin's verse novel Eugene Onegin, stripped out the spaces and punctuation, and classified each letter as a vowel or a consonant. About 43 percent of the letters were vowels and 57 percent consonants. Counting transitions, he found 1,104 vowel-to-vowel pairs, 3,827 consonant-to-consonant pairs, and 15,069 mixed pairs. The transition counts departed clearly from what one would expect under independence, which Markov used to estimate the parameters of a two-state chain on the alphabet. The Eugene Onegin study was one of the first probabilistic analyses of natural language. Claude Shannon would draw directly on Markov's framework in his 1948 paper A Mathematical Theory of Communication, generating English-like text from n-gram statistics.
Andrey Kolmogorov introduced the continuous-time framework in 1931; Sydney Chapman had independently derived the equation bearing both their names in 1928. Postwar work generalised the theory further: Kemeny and Snell's 1960 monograph Finite Markov Chains gave the modern textbook treatment of finite chains; Doob's Stochastic Processes (1953) embedded Markov processes in the broader theory of measurable processes; texts by Norris (1997), Grimmett and Stirzaker (2001), and Levin, Peres, and Wilmer (2008) became standard graduate references.
A discrete-time Markov chain is specified by a state space S, a transition function p(i, j), and an initial distribution mu_0. When S is finite or countable, the transition probabilities form a matrix P with P_{ij} = P(X_{n+1} = j | X_n = i). Each row of P is non-negative and sums to one, which makes P a row-stochastic matrix. The distribution of X_n is mu_0 P^n. The (i, j) entry of P^n is the probability of moving from i to j in exactly n steps, and the relation generalises to the Chapman-Kolmogorov equation P^{m+n} = P^m P^n.
Three illustrative examples: a two-state weather chain (P_{ss} = 0.8, P_{sr} = 0.2, P_{rs} = 0.5, P_{rr} = 0.5) settles to a long-run distribution roughly 71 percent sunny; the simple random walk on Z with P_{i, i+1} = p, P_{i, i-1} = 1 - p is recurrent if and only if p = 1/2; the Ehrenfest urn model places N balls in two boxes and moves a uniformly chosen ball at each step, giving a birth-death chain on {0, 1, ..., N} with reflecting boundaries. For a finite chain, P is a finite-dimensional linear operator and most questions reduce to linear algebra.
A continuous-time Markov chain is a process (X_t){t >= 0} on a discrete state space S such that, for s < t, the conditional distribution of X_t given the history up to s depends only on X_s. It is described by a generator matrix Q whose off-diagonal entries q{ij} are non-negative transition rates from i to j and whose diagonal entries q_{ii} = -sum_{j != i} q_{ij} make rows sum to zero.
The holding time in state i is exponentially distributed with rate -q_{ii}, and on leaving state i the chain jumps to j with probability q_{ij} / (-q_{ii}). The transition semigroup P(t) satisfies Kolmogorov's forward and backward equations dP/dt = PQ and dP/dt = QP, with solution P(t) = exp(Qt) for finite state spaces. Continuous-time chains are the natural setting for queuing theory (M/M/1 and M/M/c queues), reliability analysis, and continuous-time models of molecular evolution.
Two states i and j communicate if each can be reached from the other in finitely many steps; communication is an equivalence relation whose classes are called communicating classes. A chain is irreducible if there is only one communicating class.
| Property | Definition | Class property? |
|---|---|---|
| Communicating | i and j can each reach the other | Yes (defines classes) |
| Recurrent | Chain returns to state i with probability 1 | Yes |
| Transient | Positive probability of never returning | Yes |
| Positive recurrent | Expected return time to i is finite | Yes |
| Null recurrent | Recurrent but expected return time is infinite | Yes |
| Periodic with period d | Returns to i possible only at multiples of d > 1 | Yes |
| Aperiodic | Period equals 1 | Yes |
| Absorbing | P_{ii} = 1, no transitions out of i | Trivially |
| Ergodic | Aperiodic and positive recurrent | Yes |
For a finite chain every state is transient or positive recurrent, and a finite irreducible chain is automatically positive recurrent. For infinite chains the picture is richer. The simple symmetric random walk on Z is null recurrent: it returns with probability 1 but the expected time to return is infinite. The asymmetric random walk on Z (p not equal to 1/2) is transient and drifts to infinity.
George Polya (1921) proved that the simple symmetric random walk on Z^d is recurrent for d = 1 and d = 2 and transient for d >= 3. The dimensional cutoff is sometimes summarised by the phrase "a drunk man finds his way home but a drunk bird may get lost forever".
A distribution pi on S is stationary (or invariant) for the chain with transition matrix P if pi P = pi, that is sum_i pi_i P_{ij} = pi_j for every j. Equivalently, pi is a left eigenvector of P with eigenvalue 1. If the chain starts in pi it remains in pi forever: P(X_n = j) = pi_j for every n.
The central convergence result is the ergodic theorem: for an irreducible, aperiodic, positive recurrent chain on a countable state space, the stationary distribution pi exists and is unique, and for every initial distribution mu_0 the distribution mu_0 P^n converges to pi in total variation distance. For finite chains, existence of pi follows from the Perron-Frobenius theorem: an irreducible non-negative matrix has a largest eigenvalue of multiplicity one with a strictly positive eigenvector.
A chain is reversible with respect to pi if the detailed balance equations pi_i P_{ij} = pi_j P_{ji} hold for every pair (i, j). Detailed balance is sufficient (but not necessary) for stationarity: sum the equations over i to get pi P = pi. The local equations are usually much easier to solve than the global stationarity equations, and almost every MCMC algorithm constructs a reversible chain whose stationary distribution is the target.
For a finite irreducible chain there is also a clean characterisation: the stationary probability of state i equals 1 / E_i[T_i], where T_i is the first return time to i (the Kac formula).
The ergodic theorem says mu_0 P^n converges to pi but not how fast. Quantifying the speed of convergence is the subject of mixing-time theory, an active area since the 1980s. The standard measure of distance to stationarity is the total variation distance d(n) = max_{mu_0} || mu_0 P^n - pi ||_TV; the mixing time with tolerance epsilon is t_mix(epsilon) = min{n : d(n) <= epsilon}, with epsilon = 1/4 a common choice.
For a reversible chain the eigenvalues of P are real and lie in [-1, 1], and the rate of convergence is governed by the second-largest eigenvalue in absolute value, lambda_. The spectral gap gamma = 1 - lambda_ bounds the mixing time: roughly t_mix is on the order of (1/gamma) log(1/pi_min), where pi_min is the smallest stationary probability. Lower bounds use conductance (the Cheeger inequality) and canonical paths.
The cutoff phenomenon, studied since the work of Aldous, Diaconis, and Shahshahani in the early 1980s, occurs in many natural families of chains: the distance to stationarity stays close to 1 for a long time and then drops abruptly to nearly 0 in a window much shorter than the mixing time itself. The classical example is the riffle-shuffle Markov chain, where seven shuffles suffice to mix a 52-card deck (Bayer and Diaconis, 1992).
For a state j the hitting time T_j is the first time the chain visits j; hitting times satisfy a linear system solvable efficiently for finite chains. For an absorbing chain, arrange P = [[Q, R], [0, I]] with Q the transition matrix among transient states and R from transient to absorbing states. The fundamental matrix N = (I - Q)^{-1} has entry N_{ij} equal to the expected number of visits to transient state j starting from i, and absorption probabilities are given by NR. The classical gambler's ruin, in which a gambler with fortune k bets on fair coin tosses until reaching 0 (ruin) or N (target), is the canonical example: the probability of reaching N before 0 is k / N for fair play.
| Chain | State space | Transition rule | Notable property |
|---|---|---|---|
| Random walk on Z | Integers | +1 or -1 each step | Recurrent if symmetric, transient otherwise |
| Random walk on Z^2 | Integer lattice in plane | +-1 in a coordinate | Null recurrent (Polya) |
| Random walk on Z^d, d >= 3 | Integer lattice | +-1 in a coordinate | Transient (Polya) |
| Gambler's ruin | {0, 1, ..., N} | +-1, absorbed at 0 and N | Absorption probability = k / N when fair |
| Ehrenfest urn | {0, 1, ..., N} | Move random ball | Reversible, stationary is binomial(N, 1/2) |
| Birth-death chain | Z or {0, 1, 2, ...} | +-1 with state-dependent rates | Always reversible |
| Branching process (Galton-Watson) | Non-negative integers | Each individual has random offspring | Extinct iff mean offspring <= 1 |
| Wright-Fisher model | {0, 1, ..., N} | Genetic drift in haploid population | Absorbing chain; eventually fixates |
| PageRank random surfer | Web pages | Follow link or teleport | Stationary distribution is PageRank |
| Discrete card-shuffling chains | Permutations | Apply random shuffle | Mixing-time poster children |
The Ehrenfest urn was introduced by Paul and Tatiana Ehrenfest in 1907 to clarify the apparent paradox between microscopic reversibility and macroscopic entropy increase.
| Area | Markov-chain construction | Notes |
|---|---|---|
| Hidden Markov models | Latent state is a Markov chain; observations are emissions | Speech recognition (HTK era), POS tagging, gene-finding |
| Markov decision processes | Chain plus actions and rewards | Foundation of reinforcement learning |
| Partially observable MDPs | MDP with hidden state | Robotics under uncertainty, dialogue systems |
| MCMC sampling | Construct chain whose stationary is the target | Markov Chain Monte Carlo (MCMC) for Bayesian inference |
| PageRank | Random surfer on the web graph | Original Google ranking algorithm |
| n-gram language models | Words form a Markov chain of order n - 1 | Pre-neural NLP standard |
| Diffusion generative models | Forward chain adds noise; reverse chain learned | DDPM (Ho et al., 2020); extends Sohl-Dickstein et al. (2015) |
| Particle filters | Sequential Monte Carlo on a state-space chain | Tracking, robot localisation |
| Markov random fields | Undirected analogue with local conditional independence | Image denoising, segmentation |
| Wright-Fisher and coalescent | Chains modelling genetic drift | Population genetics |
| M/M/1 and M/M/c queues | Continuous-time birth-death chains | Performance analysis, networking |
| Reliability theory | Chains over component states | Systems engineering |
These are variations on a common theme: model the system as a Markov chain (possibly with actions, rewards, or partial observations) and then analyse it. Most stochastic processes can be rendered Markovian by enlarging the state space to include enough of the recent history.
Markov chain Monte Carlo (MCMC) is the application that has done the most to spread Markov-chain theory beyond probabilists. The idea, originating with Metropolis and others at Los Alamos in the early 1950s and generalised by Hastings (1970), is to construct a Markov chain whose stationary distribution is some target pi (often the posterior in a Bayesian model) and run the chain long enough to draw approximate samples. The Gibbs sampler (Geman and Geman, 1984) and the Metropolis-Hastings algorithm are the workhorses; Hamiltonian Monte Carlo (Duane et al., 1987) and the No-U-Turn Sampler (Hoffman and Gelman, 2014) underlie Stan and PyMC. Robert and Casella's Monte Carlo Statistical Methods (2004) is the standard reference. Almost every MCMC algorithm relies on detailed balance to certify that the chain has the right stationary distribution.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd introduced PageRank in 1998 as the ranking core of the Google search engine. The algorithm models a "random surfer" who at each step either follows a uniformly random outgoing link from the current page (with probability d, the damping factor, typically 0.85) or teleports to a uniformly random page (with probability 1 - d). This is a Markov chain on the set of web pages. Teleportation ensures the chain is irreducible and aperiodic, so the stationary distribution exists and is unique by the ergodic theorem; that stationary distribution is PageRank. Computationally PageRank reduces to power iteration on a matrix with billions of rows.
A Markov decision process generalises a Markov chain by allowing an agent to choose actions at each state that influence the transition probabilities and produce a reward. Richard Bellman introduced MDPs in 1957 in the context of dynamic programming; Howard's 1960 monograph Dynamic Programming and Markov Processes gave the first textbook treatment. Sutton and Barto's Reinforcement Learning: An Introduction (1998, second edition 2018) made MDPs the central object of modern RL. Value iteration, policy iteration, Q-learning, SARSA, actor-critic, and PPO are all algorithms for learning or computing optimal policies on an MDP. When the state is only partially observed, the model becomes a partially observable MDP (POMDP), studied since Astrom (1965) and Smallwood and Sondik (1973).
Diffusion-based generative models, which now dominate text-to-image generation, are built on Markov chains in time. Sohl-Dickstein, Weiss, Maheswaranathan, and Ganguli (2015) introduced deep unsupervised learning using nonequilibrium thermodynamics, framing image generation as the reverse of a Markov chain that gradually adds Gaussian noise to data. Ho, Jain, and Abbeel (2020) refined this into the Denoising Diffusion Probabilistic Model (DDPM), which trains a neural network to predict the noise added at each step of the forward chain and then runs the learned reverse chain to sample new images. Both forward (noising) and reverse (denoising) processes are explicit Markov chains in the time index; DDPM's loss derives from a variational bound on the Markov-chain likelihood. Score-based generative models (Song and Ermon, 2019; Song et al., 2021) are equivalent in continuous time and connect the construction to stochastic differential equations.
The entropy rate of a stationary Markov chain is H(X_1 | X_0) = -sum_i pi_i sum_j P_{ij} log P_{ij}, the average uncertainty about the next state given the current one; this is the long-run information rate of the chain in the sense of Shannon. The asymptotic equipartition property and the Shannon-McMillan-Breiman theorem extend from independent sequences to ergodic Markov chains, with the entropy rate playing the role of per-symbol entropy. In data compression, n-gram models (themselves Markov chains over symbols) appear as workhorses of pre-neural language modelling.
Many of the concepts above extend to continuous state spaces, with sums replaced by integrals and matrices replaced by integral kernels. The general framework is the theory of Markov processes, which subsumes:
| Process | State space | Time | Notes |
|---|---|---|---|
| Discrete-time Markov chain | Countable | Discrete | This article's main subject |
| Continuous-time Markov chain | Countable | Continuous | Generator matrix Q |
| Brownian motion | Real numbers (R or R^d) | Continuous | Limit of random walks; foundational continuous-time process |
| Diffusion process | R^d | Continuous | Solution of an SDE; continuous-state generalisation |
| Levy process | R^d | Continuous | Independent stationary increments; includes jumps |
| Markov process on Polish space | Polish | Discrete or continuous | General theoretical framework |
Brownian motion, introduced by Einstein (1905) and rigorously constructed by Wiener (1923), is the canonical continuous-state continuous-time Markov process and is the limit of suitably scaled simple random walks. Stochastic differential equations dX_t = b(X_t) dt + sigma(X_t) dW_t define diffusion processes, which are Markov by construction. Levy processes (Levy 1937) generalise further to allow jumps.
| Concept | State | Time | Hidden? | Actions? | Notes |
|---|---|---|---|---|---|
| Markov chain | Discrete | Discrete | Observed | None | Defining object |
| Continuous-time Markov chain | Discrete | Continuous | Observed | None | Generator Q |
| Hidden Markov model | Discrete | Discrete | Hidden | None | Observation conditional on state |
| Markov decision process (MDP) | Discrete (typically) | Discrete | Observed | Yes | Adds rewards and actions |
| Partially observable MDP (POMDP) | Discrete | Discrete | Hidden | Yes | Belief-state MDP under the hood |
| Markov random field | Graph nodes | None | Observed | None | Undirected graphical model |
| Brownian motion | Continuous (R^d) | Continuous | Observed | None | Continuous-state Markov process |
| Diffusion model | Continuous (R^d) | Discrete (typically) | Observed | None | Markov chain in time used for generation |
| Independent sequence | Any | Discrete | Observed | None | Special case (memoryless beyond zeroth order) |
The Markov chain sits at the centre of this taxonomy, with related models extending it: hide the state, add actions, take a continuous-time or continuous-state limit, or drop the directionality. Almost every probabilistic model used in modern machine learning either is a Markov chain, contains one, or is a close graphical-model relative.
| Library or tool | Language | Capabilities |
|---|---|---|
| numpy / scipy | Python | Transition-matrix arithmetic, eigendecompositions |
| hmmlearn | Python | Discrete and Gaussian HMMs |
| pomegranate | Python | HMMs, Bayesian networks, mixture models |
| markovchain | R | Discrete Markov chains, stationary distributions |
| depmixS4 | R | HMMs and dependent mixture models |
| Stan | C++ / R / Python | Hamiltonian Monte Carlo |
| PyMC | Python | NUTS, Metropolis, Gibbs sampling |
| NumPyro | Python (JAX) | NUTS, Hamiltonian Monte Carlo |
| PRISM | Standalone | Probabilistic model checker for DTMCs and CTMCs |
| Stable-Baselines3, RLlib, Tianshou | Python | Reinforcement learning over MDPs |
| Diffusers (Hugging Face) | Python | Implementation of diffusion-model Markov chains |
For small chains numpy is sufficient; for continuous-time chains scipy.linalg computes P(t) = exp(Qt) directly; for very large chains (as in PageRank) sparse matrices and the power method are standard.
| Result | Year | What it says |
|---|---|---|
| Markov's law of large numbers for dependent variables | 1906 | The LLN extends to chains, not just independent sequences |
| Perron-Frobenius theorem | 1907 (Perron), 1912 (Frobenius) | Existence and uniqueness of the principal eigenvector for non-negative matrices |
| Polya's recurrence theorem | 1921 | Simple symmetric random walk is recurrent for d <= 2, transient for d >= 3 |
| Chapman-Kolmogorov equation | 1928 / 1931 | Composition law P^{m+n} = P^m P^n for transition kernels |
| Doeblin's coupling | 1938 | Quantitative ergodic theorem via coupling |
| Birkhoff's ergodic theorem | 1931 | Time averages equal space averages for ergodic systems |
| Kemeny-Snell fundamental matrix | 1960 | N = (I - Q)^{-1} for absorbing chains |
| Coupling from the past | 1996 (Propp and Wilson) | Exact sampling from the stationary distribution |
| Cutoff phenomenon for shuffling | 1992 (Bayer and Diaconis) | Seven riffle shuffles suffice for a 52-card deck |
| Markov chain central limit theorem | various, 20th century | Gaussian fluctuations of time averages for ergodic chains |