Markov chain

A Markov chain is a stochastic process in which the conditional probability of the next state depends only on the current state and not on the sequence of states that preceded it. Formally, a discrete-time Markov chain is a sequence of random variables X_0, X_1, X_2, ... taking values in a state space S such that P(X_{n+1} = x | X_0, X_1, ..., X_n) = P(X_{n+1} = x | X_n) for every n and every x. This conditional independence is called the Markov property, and it is the defining feature that makes the family tractable. A Markov chain is the simplest non-trivial generalisation of an independent and identically distributed sequence, allowing memory of length one while preserving most of the analytic structure of independent sequences.

Markov chains were introduced by the Russian mathematician Andrey Markov in a 1906 paper that extended the law of large numbers to sequences of dependent random variables. They are the foundation for hidden Markov models, Markov decision processes, Markov chain Monte Carlo sampling, the Google PageRank algorithm, n-gram language models, queuing theory, population genetics, and a long list of recent constructions including diffusion-based generative models. Almost every problem in reinforcement learning is, at heart, a problem about a Markov chain or its decision-making cousin.

History

Andrey Markov (1856 to 1922) studied under Pafnuty Chebyshev at Saint Petersburg University and worked on probability and number theory. Chebyshev had proved a law of large numbers for independent random variables; Markov set out to show that the future need only be conditionally independent of the past given the present. The 1906 paper "Расширение закона больших чисел на величины, зависящие друг от друга" ("Extension of the law of large numbers to dependent variables") established this generalisation and is the founding document of Markov-chain theory.

Markov was responding in part to a public dispute with Pavel Nekrasov, a Moscow University mathematician who had argued on theological grounds that the law of large numbers required statistical independence and therefore proved the existence of free will. Markov constructed his chains specifically to refute this: here was a class of dependent processes for which the law of large numbers still held.

In 1913 Markov gave a now-famous application of his theory at an address to the Imperial Academy of Sciences in Saint Petersburg. He took the first 20,000 letters of Alexander Pushkin's verse novel Eugene Onegin, stripped out the spaces and punctuation, and classified each letter as a vowel or a consonant. About 43 percent of the letters were vowels and 57 percent consonants. Counting transitions, he found 1,104 vowel-to-vowel pairs, 3,827 consonant-to-consonant pairs, and 15,069 mixed pairs. The transition counts departed clearly from what one would expect under independence, which Markov used to estimate the parameters of a two-state chain on the alphabet. The Eugene Onegin study was one of the first probabilistic analyses of natural language. Claude Shannon would draw directly on Markov's framework in his 1948 paper A Mathematical Theory of Communication, generating English-like text from n-gram statistics.

Andrey Kolmogorov introduced the continuous-time framework in 1931; Sydney Chapman had independently derived the equation bearing both their names in 1928. Postwar work generalised the theory further: Kemeny and Snell's 1960 monograph Finite Markov Chains gave the modern textbook treatment of finite chains; Doob's Stochastic Processes (1953) embedded Markov processes in the broader theory of measurable processes; texts by Norris (1997), Grimmett and Stirzaker (2001), and Levin, Peres, and Wilmer (2008) became standard graduate references.

Discrete-time Markov chains

A discrete-time Markov chain is specified by a state space S, a transition function p(i, j), and an initial distribution mu_0. When S is finite or countable, the transition probabilities form a matrix P with P_{ij} = P(X_{n+1} = j | X_n = i). Each row of P is non-negative and sums to one, which makes P a row-stochastic matrix. The distribution of X_n is mu_0 P^n. The (i, j) entry of P^n is the probability of moving from i to j in exactly n steps, and the relation generalises to the Chapman-Kolmogorov equation P^{m+n} = P^m P^n.

Three illustrative examples: a two-state weather chain (P_{ss} = 0.8, P_{sr} = 0.2, P_{rs} = 0.5, P_{rr} = 0.5) settles to a long-run distribution roughly 71 percent sunny; the simple random walk on Z with P_{i, i+1} = p, P_{i, i-1} = 1 - p is recurrent if and only if p = 1/2; the Ehrenfest urn model places N balls in two boxes and moves a uniformly chosen ball at each step, giving a birth-death chain on {0, 1, ..., N} with reflecting boundaries. For a finite chain, P is a finite-dimensional linear operator and most questions reduce to linear algebra.

Continuous-time Markov chains

A continuous-time Markov chain is a process (X_t){t >= 0} on a discrete state space S such that, for s < t, the conditional distribution of X_t given the history up to s depends only on X_s. It is described by a generator matrix Q whose off-diagonal entries q{ij} are non-negative transition rates from i to j and whose diagonal entries q_{ii} = -sum_{j != i} q_{ij} make rows sum to zero.

The holding time in state i is exponentially distributed with rate -q_{ii}, and on leaving state i the chain jumps to j with probability q_{ij} / (-q_{ii}). The transition semigroup P(t) satisfies Kolmogorov's forward and backward equations dP/dt = PQ and dP/dt = QP, with solution P(t) = exp(Qt) for finite state spaces. Continuous-time chains are the natural setting for queuing theory (M/M/1 and M/M/c queues), reliability analysis, and continuous-time models of molecular evolution.

Classification of states

Two states i and j communicate if each can be reached from the other in finitely many steps; communication is an equivalence relation whose classes are called communicating classes. A chain is irreducible if there is only one communicating class.

Property	Definition	Class property?
Communicating	i and j can each reach the other	Yes (defines classes)
Recurrent	Chain returns to state i with probability 1	Yes
Transient	Positive probability of never returning	Yes
Positive recurrent	Expected return time to i is finite	Yes
Null recurrent	Recurrent but expected return time is infinite	Yes
Periodic with period d	Returns to i possible only at multiples of d > 1	Yes
Aperiodic	Period equals 1	Yes
Absorbing	P_{ii} = 1, no transitions out of i	Trivially
Ergodic	Aperiodic and positive recurrent	Yes

For a finite chain every state is transient or positive recurrent, and a finite irreducible chain is automatically positive recurrent. For infinite chains the picture is richer. The simple symmetric random walk on Z is null recurrent: it returns with probability 1 but the expected time to return is infinite. The asymmetric random walk on Z (p not equal to 1/2) is transient and drifts to infinity.

George Polya (1921) proved that the simple symmetric random walk on Z^d is recurrent for d = 1 and d = 2 and transient for d >= 3. The dimensional cutoff is sometimes summarised by the phrase "a drunk man finds his way home but a drunk bird may get lost forever".

Stationary distribution

A distribution pi on S is stationary (or invariant) for the chain with transition matrix P if pi P = pi, that is sum_i pi_i P_{ij} = pi_j for every j. Equivalently, pi is a left eigenvector of P with eigenvalue 1. If the chain starts in pi it remains in pi forever: P(X_n = j) = pi_j for every n.

The central convergence result is the ergodic theorem: for an irreducible, aperiodic, positive recurrent chain on a countable state space, the stationary distribution pi exists and is unique, and for every initial distribution mu_0 the distribution mu_0 P^n converges to pi in total variation distance. For finite chains, existence of pi follows from the Perron-Frobenius theorem: an irreducible non-negative matrix has a largest eigenvalue of multiplicity one with a strictly positive eigenvector.

A chain is reversible with respect to pi if the detailed balance equations pi_i P_{ij} = pi_j P_{ji} hold for every pair (i, j). Detailed balance is sufficient (but not necessary) for stationarity: sum the equations over i to get pi P = pi. The local equations are usually much easier to solve than the global stationarity equations, and almost every MCMC algorithm constructs a reversible chain whose stationary distribution is the target.

For a finite irreducible chain there is also a clean characterisation: the stationary probability of state i equals 1 / E_i[T_i], where T_i is the first return time to i (the Kac formula).

Convergence and mixing times

The ergodic theorem says mu_0 P^n converges to pi but not how fast. Quantifying the speed of convergence is the subject of mixing-time theory, an active area since the 1980s. The standard measure of distance to stationarity is the total variation distance d(n) = max_{mu_0} || mu_0 P^n - pi ||_TV; the mixing time with tolerance epsilon is t_mix(epsilon) = min{n : d(n) <= epsilon}, with epsilon = 1/4 a common choice.

For a reversible chain the eigenvalues of P are real and lie in [-1, 1], and the rate of convergence is governed by the second-largest eigenvalue in absolute value, lambda_. The spectral gap gamma = 1 - lambda_ bounds the mixing time: roughly t_mix is on the order of (1/gamma) log(1/pi_min), where pi_min is the smallest stationary probability. Lower bounds use conductance (the Cheeger inequality) and canonical paths.

The cutoff phenomenon, studied since the work of Aldous, Diaconis, and Shahshahani in the early 1980s, occurs in many natural families of chains: the distance to stationarity stays close to 1 for a long time and then drops abruptly to nearly 0 in a window much shorter than the mixing time itself. The classical example is the riffle-shuffle Markov chain, where seven shuffles suffice to mix a 52-card deck (Bayer and Diaconis, 1992).

Hitting times and absorption probabilities

For a state j the hitting time T_j is the first time the chain visits j; hitting times satisfy a linear system solvable efficiently for finite chains. For an absorbing chain, arrange P = [[Q, R], [0, I]] with Q the transition matrix among transient states and R from transient to absorbing states. The fundamental matrix N = (I - Q)^{-1} has entry N_{ij} equal to the expected number of visits to transient state j starting from i, and absorption probabilities are given by NR. The classical gambler's ruin, in which a gambler with fortune k bets on fair coin tosses until reaching 0 (ruin) or N (target), is the canonical example: the probability of reaching N before 0 is k / N for fair play.

Examples

Chain	State space	Transition rule	Notable property
Random walk on Z	Integers	+1 or -1 each step	Recurrent if symmetric, transient otherwise
Random walk on Z^2	Integer lattice in plane	+-1 in a coordinate	Null recurrent (Polya)
Random walk on Z^d, d >= 3	Integer lattice	+-1 in a coordinate	Transient (Polya)
Gambler's ruin	{0, 1, ..., N}	+-1, absorbed at 0 and N	Absorption probability = k / N when fair
Ehrenfest urn	{0, 1, ..., N}	Move random ball	Reversible, stationary is binomial(N, 1/2)
Birth-death chain	Z or {0, 1, 2, ...}	+-1 with state-dependent rates	Always reversible
Branching process (Galton-Watson)	Non-negative integers	Each individual has random offspring	Extinct iff mean offspring <= 1
Wright-Fisher model	{0, 1, ..., N}	Genetic drift in haploid population	Absorbing chain; eventually fixates
PageRank random surfer	Web pages	Follow link or teleport	Stationary distribution is PageRank
Discrete card-shuffling chains	Permutations	Apply random shuffle	Mixing-time poster children

The Ehrenfest urn was introduced by Paul and Tatiana Ehrenfest in 1907 to clarify the apparent paradox between microscopic reversibility and macroscopic entropy increase.

Applications in AI and machine learning

Area	Markov-chain construction	Notes
Hidden Markov models	Latent state is a Markov chain; observations are emissions	Speech recognition (HTK era), POS tagging, gene-finding
Markov decision processes	Chain plus actions and rewards	Foundation of reinforcement learning
Partially observable MDPs	MDP with hidden state	Robotics under uncertainty, dialogue systems
MCMC sampling	Construct chain whose stationary is the target	Markov Chain Monte Carlo (MCMC) for Bayesian inference
PageRank	Random surfer on the web graph	Original Google ranking algorithm
n-gram language models	Words form a Markov chain of order n - 1	Pre-neural NLP standard
Diffusion generative models	Forward chain adds noise; reverse chain learned	DDPM (Ho et al., 2020); extends Sohl-Dickstein et al. (2015)
Particle filters	Sequential Monte Carlo on a state-space chain	Tracking, robot localisation
Markov random fields	Undirected analogue with local conditional independence	Image denoising, segmentation
Wright-Fisher and coalescent	Chains modelling genetic drift	Population genetics
M/M/1 and M/M/c queues	Continuous-time birth-death chains	Performance analysis, networking
Reliability theory	Chains over component states	Systems engineering

These are variations on a common theme: model the system as a Markov chain (possibly with actions, rewards, or partial observations) and then analyse it. Most stochastic processes can be rendered Markovian by enlarging the state space to include enough of the recent history.

MCMC and Bayesian inference

Markov chain Monte Carlo (MCMC) is the application that has done the most to spread Markov-chain theory beyond probabilists. The idea, originating with Metropolis and others at Los Alamos in the early 1950s and generalised by Hastings (1970), is to construct a Markov chain whose stationary distribution is some target pi (often the posterior in a Bayesian model) and run the chain long enough to draw approximate samples. The Gibbs sampler (Geman and Geman, 1984) and the Metropolis-Hastings algorithm are the workhorses; Hamiltonian Monte Carlo (Duane et al., 1987) and the No-U-Turn Sampler (Hoffman and Gelman, 2014) underlie Stan and PyMC. Robert and Casella's Monte Carlo Statistical Methods (2004) is the standard reference. Almost every MCMC algorithm relies on detailed balance to certify that the chain has the right stationary distribution.

PageRank

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd introduced PageRank in 1998 as the ranking core of the Google search engine. The algorithm models a "random surfer" who at each step either follows a uniformly random outgoing link from the current page (with probability d, the damping factor, typically 0.85) or teleports to a uniformly random page (with probability 1 - d). This is a Markov chain on the set of web pages. Teleportation ensures the chain is irreducible and aperiodic, so the stationary distribution exists and is unique by the ergodic theorem; that stationary distribution is PageRank. Computationally PageRank reduces to power iteration on a matrix with billions of rows.

Reinforcement learning and MDPs

A Markov decision process generalises a Markov chain by allowing an agent to choose actions at each state that influence the transition probabilities and produce a reward. Richard Bellman introduced MDPs in 1957 in the context of dynamic programming; Howard's 1960 monograph Dynamic Programming and Markov Processes gave the first textbook treatment. Sutton and Barto's Reinforcement Learning: An Introduction (1998, second edition 2018) made MDPs the central object of modern RL. Value iteration, policy iteration, Q-learning, SARSA, actor-critic, and PPO are all algorithms for learning or computing optimal policies on an MDP. When the state is only partially observed, the model becomes a partially observable MDP (POMDP), studied since Astrom (1965) and Smallwood and Sondik (1973).

Diffusion models

Diffusion-based generative models, which now dominate text-to-image generation, are built on Markov chains in time. Sohl-Dickstein, Weiss, Maheswaranathan, and Ganguli (2015) introduced deep unsupervised learning using nonequilibrium thermodynamics, framing image generation as the reverse of a Markov chain that gradually adds Gaussian noise to data. Ho, Jain, and Abbeel (2020) refined this into the Denoising Diffusion Probabilistic Model (DDPM), which trains a neural network to predict the noise added at each step of the forward chain and then runs the learned reverse chain to sample new images. Both forward (noising) and reverse (denoising) processes are explicit Markov chains in the time index; DDPM's loss derives from a variational bound on the Markov-chain likelihood. Score-based generative models (Song and Ermon, 2019; Song et al., 2021) are equivalent in continuous time and connect the construction to stochastic differential equations.

Connection to information theory

The entropy rate of a stationary Markov chain is H(X_1 | X_0) = -sum_i pi_i sum_j P_{ij} log P_{ij}, the average uncertainty about the next state given the current one; this is the long-run information rate of the chain in the sense of Shannon. The asymptotic equipartition property and the Shannon-McMillan-Breiman theorem extend from independent sequences to ergodic Markov chains, with the entropy rate playing the role of per-symbol entropy. In data compression, n-gram models (themselves Markov chains over symbols) appear as workhorses of pre-neural language modelling.

Continuous-state and continuous-time generalisations

Many of the concepts above extend to continuous state spaces, with sums replaced by integrals and matrices replaced by integral kernels. The general framework is the theory of Markov processes, which subsumes:

Process	State space	Time	Notes
Discrete-time Markov chain	Countable	Discrete	This article's main subject
Continuous-time Markov chain	Countable	Continuous	Generator matrix Q
Brownian motion	Real numbers (R or R^d)	Continuous	Limit of random walks; foundational continuous-time process
Diffusion process	R^d	Continuous	Solution of an SDE; continuous-state generalisation
Levy process	R^d	Continuous	Independent stationary increments; includes jumps
Markov process on Polish space	Polish	Discrete or continuous	General theoretical framework

Brownian motion, introduced by Einstein (1905) and rigorously constructed by Wiener (1923), is the canonical continuous-state continuous-time Markov process and is the limit of suitably scaled simple random walks. Stochastic differential equations dX_t = b(X_t) dt + sigma(X_t) dW_t define diffusion processes, which are Markov by construction. Levy processes (Levy 1937) generalise further to allow jumps.

Concept	State	Time	Hidden?	Actions?	Notes
Markov chain	Discrete	Discrete	Observed	None	Defining object
Continuous-time Markov chain	Discrete	Continuous	Observed	None	Generator Q
Hidden Markov model	Discrete	Discrete	Hidden	None	Observation conditional on state
Markov decision process (MDP)	Discrete (typically)	Discrete	Observed	Yes	Adds rewards and actions
Partially observable MDP (POMDP)	Discrete	Discrete	Hidden	Yes	Belief-state MDP under the hood
Markov random field	Graph nodes	None	Observed	None	Undirected graphical model
Brownian motion	Continuous (R^d)	Continuous	Observed	None	Continuous-state Markov process
Diffusion model	Continuous (R^d)	Discrete (typically)	Observed	None	Markov chain in time used for generation
Independent sequence	Any	Discrete	Observed	None	Special case (memoryless beyond zeroth order)

The Markov chain sits at the centre of this taxonomy, with related models extending it: hide the state, add actions, take a continuous-time or continuous-state limit, or drop the directionality. Almost every probabilistic model used in modern machine learning either is a Markov chain, contains one, or is a close graphical-model relative.

Software

Library or tool	Language	Capabilities
numpy / scipy	Python	Transition-matrix arithmetic, eigendecompositions
hmmlearn	Python	Discrete and Gaussian HMMs
pomegranate	Python	HMMs, Bayesian networks, mixture models
markovchain	R	Discrete Markov chains, stationary distributions
depmixS4	R	HMMs and dependent mixture models
Stan	C++ / R / Python	Hamiltonian Monte Carlo
PyMC	Python	NUTS, Metropolis, Gibbs sampling
NumPyro	Python (JAX)	NUTS, Hamiltonian Monte Carlo
PRISM	Standalone	Probabilistic model checker for DTMCs and CTMCs
Stable-Baselines3, RLlib, Tianshou	Python	Reinforcement learning over MDPs
Diffusers (Hugging Face)	Python	Implementation of diffusion-model Markov chains

For small chains numpy is sufficient; for continuous-time chains scipy.linalg computes P(t) = exp(Qt) directly; for very large chains (as in PageRank) sparse matrices and the power method are standard.

Famous results

Result	Year	What it says
Markov's law of large numbers for dependent variables	1906	The LLN extends to chains, not just independent sequences
Perron-Frobenius theorem	1907 (Perron), 1912 (Frobenius)	Existence and uniqueness of the principal eigenvector for non-negative matrices
Polya's recurrence theorem	1921	Simple symmetric random walk is recurrent for d <= 2, transient for d >= 3
Chapman-Kolmogorov equation	1928 / 1931	Composition law P^{m+n} = P^m P^n for transition kernels
Doeblin's coupling	1938	Quantitative ergodic theorem via coupling
Birkhoff's ergodic theorem	1931	Time averages equal space averages for ergodic systems
Kemeny-Snell fundamental matrix	1960	N = (I - Q)^{-1} for absorbing chains
Coupling from the past	1996 (Propp and Wilson)	Exact sampling from the stationary distribution
Cutoff phenomenon for shuffling	1992 (Bayer and Diaconis)	Seven riffle shuffles suffice for a 52-card deck
Markov chain central limit theorem	various, 20th century	Gaussian fluctuations of time averages for ergodic chains

References

Markov, A. A. (1906). "Распространение закона больших чисел на величины, зависящие друг от друга" (Extension of the law of large numbers to dependent variables). *Izvestiya Fiziko-Matematicheskogo Obshchestva pri Kazanskom Universitete*, 2nd series, 15, 135-156.
Markov, A. A. (1913). "Пример статистического исследования над текстом 'Евгения Онегина', иллюстрирующий связь испытаний в цепь" (An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains). *Bulletin of the Imperial Academy of Sciences*, Saint Petersburg.
Norris, J. R. (1997). *Markov Chains*. Cambridge University Press.
Kemeny, J. G., and Snell, J. L. (1960). *Finite Markov Chains*. Van Nostrand; reprinted by Springer.
Grimmett, G. R., and Stirzaker, D. R. (2001). *Probability and Random Processes*, third edition. Oxford University Press.
Levin, D. A., and Peres, Y. (2017). *Markov Chains and Mixing Times*, second edition (with E. L. Wilmer). American Mathematical Society.
Robert, C. P., and Casella, G. (2004). *Monte Carlo Statistical Methods*, second edition. Springer.
Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, second edition. MIT Press.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). "The PageRank Citation Ranking: Bringing Order to the Web". Stanford InfoLab Technical Report 1999-66.
Brin, S., and Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". *Computer Networks and ISDN Systems*, 30(1-7), 107-117.
Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications in speech recognition". *Proceedings of the IEEE*, 77(2), 257-286.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Howard, R. A. (1960). *Dynamic Programming and Markov Processes*. MIT Press.
Hastings, W. K. (1970). "Monte Carlo sampling methods using Markov chains and their applications". *Biometrika*, 57(1), 97-109.
Geman, S., and Geman, D. (1984). "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images". *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 6(6), 721-741.
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). "Deep unsupervised learning using nonequilibrium thermodynamics". *Proceedings of the 32nd International Conference on Machine Learning*.
Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising diffusion probabilistic models". *Advances in Neural Information Processing Systems* 33.
Propp, J. G., and Wilson, D. B. (1996). "Exact sampling with coupled Markov chains and applications to statistical mechanics". *Random Structures and Algorithms*, 9(1-2), 223-252.
Bayer, D., and Diaconis, P. (1992). "Trailing the dovetail shuffle to its lair". *Annals of Applied Probability*, 2(2), 294-313.
Polya, G. (1921). "Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz". *Mathematische Annalen*, 84, 149-160.
Shannon, C. E. (1948). "A mathematical theory of communication". *Bell System Technical Journal*, 27, 379-423 and 623-656.
Doob, J. L. (1953). *Stochastic Processes*. Wiley.
Kolmogorov, A. N. (1931). "Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung". *Mathematische Annalen*, 104, 415-458.
Hayes, B. (2013). "First Links in the Markov Chain". *American Scientist*, 101(2), 92.

History

Discrete-time Markov chains

Continuous-time Markov chains

Classification of states

Stationary distribution

Convergence and mixing times

Hitting times and absorption probabilities

Examples

Applications in AI and machine learning

MCMC and Bayesian inference

PageRank

Reinforcement learning and MDPs

Diffusion models

Connection to information theory

Continuous-state and continuous-time generalisations

Comparison with related concepts

Software

Famous results

References

Improve this article

Related Articles

Markov property

Statistical learning theory

Bayesian statistics

Inductive bias

History

Discrete-time Markov chains

Continuous-time Markov chains

Classification of states

Stationary distribution

Convergence and mixing times

Hitting times and absorption probabilities

Examples

Applications in AI and machine learning

MCMC and Bayesian inference

PageRank

Reinforcement learning and MDPs

Diffusion models

Connection to information theory

Continuous-state and continuous-time generalisations

Comparison with related concepts

Software

Famous results

References

Related Articles

Markov property

Statistical learning theory

Bayesian statistics

Inductive bias