# Markov chain

> Source: https://aiwiki.ai/wiki/markov_chain
> Updated: 2026-06-27
> Categories: Mathematics, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **Markov chain** is a stochastic process in which the probability of the next state depends only on the current state and not on the sequence of states that came before it. This defining feature, the conditional independence of the future from the past given the present, is called the [Markov property](/wiki/markov_property), and it is what makes the family analytically tractable [1][3]. Formally, a discrete-time Markov chain is a sequence of random variables X_0, X_1, X_2, ... taking values in a state space S such that P(X_{n+1} = x | X_0, X_1, ..., X_n) = P(X_{n+1} = x | X_n) for every n and every x. A Markov chain is the simplest non-trivial generalisation of an independent and identically distributed sequence: it allows memory of length one while preserving most of the analytic structure of independent sequences. As J. R. Norris opens his standard graduate text, "Markov chains are the simplest mathematical models for random phenomena evolving in time" [3].

Markov chains were introduced by the Russian mathematician Andrey Markov in a 1906 paper that extended the law of large numbers to sequences of dependent random variables [1][24]. They are the foundation for [hidden Markov models](/wiki/hidden_markov_model), [Markov decision processes](/wiki/markov_decision_process_mdp), [Markov chain Monte Carlo](/wiki/markov_chain_monte_carlo) (MCMC) sampling, the Google PageRank algorithm, n-gram language models, queuing theory, population genetics, and a long list of recent constructions including diffusion-based generative models [9][11][16][17]. Almost every problem in [reinforcement learning](/wiki/reinforcement_learning) is, at heart, a problem about a Markov chain or its decision-making cousin, the Markov decision process [8].

## ELI5: what is a Markov chain?

Imagine a board game where the only thing that matters for your next move is the square you are standing on right now, not how you got there. A Markov chain is exactly that kind of "memoryless" rule: the future depends only on where you are, never on the path you took to arrive. Weather that is just "sunny or rainy, and tomorrow depends only on today" is a Markov chain. So is a frog hopping between lily pads at random, or the way a phone keyboard guesses your next word from the current one. Because the rule is so simple, mathematicians can predict the long-run behaviour: what fraction of days are sunny, how long until the frog visits every pad, and where things settle down over time.

## When were Markov chains invented?

Andrey Markov (1856 to 1922) studied under Pafnuty Chebyshev at Saint Petersburg University and worked on probability and number theory. Chebyshev had proved a law of large numbers for independent random variables; Markov set out to show that the future need only be conditionally independent of the past given the present. The 1906 paper "Distribution of the law of large numbers for quantities depending on one another" established this generalisation and is the founding document of Markov-chain theory [1][24].

Markov was responding in part to a public dispute with Pavel Nekrasov, a Moscow University mathematician who had argued on theological grounds that the law of large numbers required statistical independence and therefore proved the existence of free will. Markov constructed his chains specifically to refute this: here was a class of dependent processes for which the law of large numbers still held [24].

In 1913 Markov gave a now-famous application of his theory at an address to the Imperial Academy of Sciences in Saint Petersburg. He took the first 20,000 letters of Alexander Pushkin's verse novel *Eugene Onegin*, stripped out the spaces and punctuation, and classified each letter as a vowel or a consonant. About 43 percent of the letters were vowels and 57 percent consonants. Counting transitions, he found 1,104 vowel-to-vowel pairs, 3,827 consonant-to-consonant pairs, and 15,069 mixed pairs [2][24]. The transition counts departed clearly from what one would expect under independence, which Markov used to estimate the parameters of a two-state chain on the alphabet. The Eugene Onegin study was one of the first probabilistic analyses of natural language. Claude Shannon drew directly on Markov's framework in his 1948 paper *A Mathematical Theory of Communication*, generating English-like text from n-gram statistics [21].

Andrey Kolmogorov introduced the continuous-time framework in 1931; Sydney Chapman had independently derived the equation bearing both their names in 1928 [23]. Postwar work generalised the theory further: Kemeny and Snell's 1960 monograph *Finite Markov Chains* gave the modern textbook treatment of finite chains [4]; Doob's *Stochastic Processes* (1953) embedded Markov processes in the broader theory of measurable processes [22]; texts by Norris (1997), Grimmett and Stirzaker (2001), and Levin, Peres, and Wilmer (2008) became standard graduate references [3][5][6].

## What is the Markov property?

The Markov property is the requirement that, given the present state, the future of the process is conditionally independent of its entire past. In Norris's words, the characteristic property of a Markov process is that it retains "no memory of where it has been in the past" [3]: the past and the future are independent given the present. This single assumption is what reduces an otherwise intractable joint distribution over an infinite sequence to a product of one-step transition probabilities. Enlarging the state space to include enough recent history can make almost any process Markovian, which is why the property is so widely usable in modelling [3][5].

## What is a discrete-time Markov chain?

A discrete-time Markov chain is specified by a state space S, a transition function p(i, j), and an initial distribution mu_0. When S is finite or countable, the transition probabilities form a matrix P with P_{ij} = P(X_{n+1} = j | X_n = i). Each row of P is non-negative and sums to one, which makes P a row-stochastic matrix [3]. The distribution of X_n is mu_0 P^n. The (i, j) entry of P^n is the probability of moving from i to j in exactly n steps, and the relation generalises to the Chapman-Kolmogorov equation P^{m+n} = P^m P^n [23].

Three illustrative examples: a two-state weather chain (P_{ss} = 0.8, P_{sr} = 0.2, P_{rs} = 0.5, P_{rr} = 0.5) settles to a long-run distribution roughly 71 percent sunny; the simple random walk on Z with P_{i, i+1} = p, P_{i, i-1} = 1 - p is recurrent if and only if p = 1/2; the Ehrenfest urn model places N balls in two boxes and moves a uniformly chosen ball at each step, giving a birth-death chain on {0, 1, ..., N} with reflecting boundaries [3]. For a finite chain, P is a finite-dimensional linear operator and most questions reduce to linear algebra.

## What is a continuous-time Markov chain?

A continuous-time Markov chain is a process (X_t)_{t >= 0} on a discrete state space S such that, for s < t, the conditional distribution of X_t given the history up to s depends only on X_s. It is described by a generator matrix Q whose off-diagonal entries q_{ij} are non-negative transition rates from i to j and whose diagonal entries q_{ii} = -sum_{j != i} q_{ij} make rows sum to zero [3].

The holding time in state i is exponentially distributed with rate -q_{ii}, and on leaving state i the chain jumps to j with probability q_{ij} / (-q_{ii}). The transition semigroup P(t) satisfies Kolmogorov's forward and backward equations dP/dt = PQ and dP/dt = QP, with solution P(t) = exp(Qt) for finite state spaces [23]. Continuous-time chains are the natural setting for queuing theory (M/M/1 and M/M/c queues), reliability analysis, and continuous-time models of molecular evolution.

## How are the states of a Markov chain classified?

Two states i and j *communicate* if each can be reached from the other in finitely many steps; communication is an equivalence relation whose classes are called communicating classes. A chain is *irreducible* if there is only one communicating class [3].

| Property | Definition | Class property? |
|----------|------------|-----------------|
| Communicating | i and j can each reach the other | Yes (defines classes) |
| Recurrent | Chain returns to state i with probability 1 | Yes |
| Transient | Positive probability of never returning | Yes |
| Positive recurrent | Expected return time to i is finite | Yes |
| Null recurrent | Recurrent but expected return time is infinite | Yes |
| Periodic with period d | Returns to i possible only at multiples of d > 1 | Yes |
| Aperiodic | Period equals 1 | Yes |
| Absorbing | P_{ii} = 1, no transitions out of i | Trivially |
| Ergodic | Aperiodic and positive recurrent | Yes |

For a finite chain every state is transient or positive recurrent, and a finite irreducible chain is automatically positive recurrent [3]. For infinite chains the picture is richer. The simple symmetric random walk on Z is null recurrent: it returns with probability 1 but the expected time to return is infinite. The asymmetric random walk on Z (p not equal to 1/2) is transient and drifts to infinity.

George Polya (1921) proved that the simple symmetric random walk on Z^d is recurrent for d = 1 and d = 2 and transient for d >= 3 [20]. The dimensional cutoff is sometimes summarised by the phrase "a drunk man finds his way home but a drunk bird may get lost forever".

## What is a stationary distribution?

A distribution pi on S is *stationary* (or invariant) for the chain with transition matrix P if pi P = pi, that is sum_i pi_i P_{ij} = pi_j for every j. Equivalently, pi is a left eigenvector of P with eigenvalue 1. If the chain starts in pi it remains in pi forever: P(X_n = j) = pi_j for every n [3][4].

The central convergence result is the *ergodic theorem*: for an irreducible, aperiodic, positive recurrent chain on a countable state space, the stationary distribution pi exists and is unique, and for every initial distribution mu_0 the distribution mu_0 P^n converges to pi in total variation distance [3][6]. For finite chains, existence of pi follows from the Perron-Frobenius theorem: an irreducible non-negative matrix has a largest eigenvalue of multiplicity one with a strictly positive eigenvector.

A chain is *reversible* with respect to pi if the *detailed balance* equations pi_i P_{ij} = pi_j P_{ji} hold for every pair (i, j). Detailed balance is sufficient (but not necessary) for stationarity: sum the equations over i to get pi P = pi. The local equations are usually much easier to solve than the global stationarity equations, and almost every MCMC algorithm constructs a reversible chain whose stationary distribution is the target [7][14].

For a finite irreducible chain there is also a clean characterisation: the stationary probability of state i equals 1 / E_i[T_i], where T_i is the first return time to i (the *Kac formula*) [6].

## How fast does a Markov chain converge (mixing times)?

The ergodic theorem says mu_0 P^n converges to pi but not how fast. Quantifying the speed of convergence is the subject of *mixing-time theory*, an active area since the 1980s [6]. The standard measure of distance to stationarity is the total variation distance d(n) = max_{mu_0} || mu_0 P^n - pi ||_TV; the *mixing time* with tolerance epsilon is t_mix(epsilon) = min{n : d(n) <= epsilon}, with epsilon = 1/4 a common choice.

For a reversible chain the eigenvalues of P are real and lie in [-1, 1], and the rate of convergence is governed by the second-largest eigenvalue in absolute value, lambda_*. The *spectral gap* gamma = 1 - lambda_* bounds the mixing time: roughly t_mix is on the order of (1/gamma) log(1/pi_min), where pi_min is the smallest stationary probability [6]. Lower bounds use conductance (the Cheeger inequality) and canonical paths.

The *cutoff phenomenon*, studied since the work of Aldous, Diaconis, and Shahshahani in the early 1980s, occurs in many natural families of chains: the distance to stationarity stays close to 1 for a long time and then drops abruptly to nearly 0 in a window much shorter than the mixing time itself [6]. The classical example is the riffle-shuffle Markov chain, where seven shuffles suffice to mix a 52-card deck (Bayer and Diaconis, 1992) [19].

## Hitting times and absorption probabilities

For a state j the *hitting time* T_j is the first time the chain visits j; hitting times satisfy a linear system solvable efficiently for finite chains [3]. For an absorbing chain, arrange P = `[[Q, R], [0, I]]` with Q the transition matrix among transient states and R from transient to absorbing states. The fundamental matrix N = (I - Q)^{-1} has entry N_{ij} equal to the expected number of visits to transient state j starting from i, and absorption probabilities are given by NR [4]. The classical *gambler's ruin*, in which a gambler with fortune k bets on fair coin tosses until reaching 0 (ruin) or N (target), is the canonical example: the probability of reaching N before 0 is k / N for fair play.

## Examples

| Chain | State space | Transition rule | Notable property |
|-------|-------------|-----------------|------------------|
| Random walk on Z | Integers | +1 or -1 each step | Recurrent if symmetric, transient otherwise |
| Random walk on Z^2 | Integer lattice in plane | +-1 in a coordinate | Null recurrent (Polya) |
| Random walk on Z^d, d >= 3 | Integer lattice | +-1 in a coordinate | Transient (Polya) |
| Gambler's ruin | {0, 1, ..., N} | +-1, absorbed at 0 and N | Absorption probability = k / N when fair |
| Ehrenfest urn | {0, 1, ..., N} | Move random ball | Reversible, stationary is binomial(N, 1/2) |
| Birth-death chain | Z or {0, 1, 2, ...} | +-1 with state-dependent rates | Always reversible |
| Branching process (Galton-Watson) | Non-negative integers | Each individual has random offspring | Extinct iff mean offspring <= 1 |
| Wright-Fisher model | {0, 1, ..., N} | Genetic drift in haploid population | Absorbing chain; eventually fixates |
| PageRank random surfer | Web pages | Follow link or teleport | Stationary distribution is PageRank |
| Discrete card-shuffling chains | Permutations | Apply random shuffle | Mixing-time poster children |

The Ehrenfest urn was introduced by Paul and Tatiana Ehrenfest in 1907 to clarify the apparent paradox between microscopic reversibility and macroscopic entropy increase.

## How are Markov chains used in AI and machine learning?

Markov chains underpin a striking share of probabilistic machine learning: hide the state and you get a [hidden Markov model](/wiki/hidden_markov_model); add actions and rewards and you get a [Markov decision process](/wiki/markov_decision_process_mdp), the formal foundation of [reinforcement learning](/wiki/reinforcement_learning); construct a chain whose long-run distribution is a target and you get [Markov chain Monte Carlo](/wiki/markov_chain_monte_carlo); run a chain on the web graph and its stationary distribution is PageRank [8][9][11][14].

| Area | Markov-chain construction | Notes |
|------|--------------------------|-------|
| Hidden Markov models | Latent state is a Markov chain; observations are emissions | Speech recognition (HTK era), POS tagging, gene-finding [11] |
| Markov decision processes | Chain plus actions and rewards | Foundation of [reinforcement learning](/wiki/reinforcement_learning) [8][12][13] |
| Partially observable MDPs | MDP with hidden state | Robotics under uncertainty, dialogue systems |
| MCMC sampling | Construct chain whose stationary is the target | [Markov chain Monte Carlo](/wiki/markov_chain_monte_carlo) for Bayesian inference [7][14][15] |
| PageRank | Random surfer on the web graph | Original Google ranking algorithm [9][10] |
| n-gram language models | Words form a Markov chain of order n - 1 | Pre-neural NLP standard [21] |
| Diffusion generative models | Forward chain adds noise; reverse chain learned | DDPM (Ho et al., 2020); extends Sohl-Dickstein et al. (2015) [16][17] |
| Particle filters | Sequential Monte Carlo on a state-space chain | Tracking, robot localisation |
| Markov random fields | Undirected analogue with local conditional independence | Image denoising, segmentation [15] |
| Wright-Fisher and coalescent | Chains modelling genetic drift | Population genetics |
| M/M/1 and M/M/c queues | Continuous-time birth-death chains | Performance analysis, networking |
| Reliability theory | Chains over component states | Systems engineering |

These are variations on a common theme: model the system as a Markov chain (possibly with actions, rewards, or partial observations) and then analyse it. Most stochastic processes can be rendered Markovian by enlarging the state space to include enough of the recent history [3].

### What is Markov chain Monte Carlo (MCMC)?

Markov chain Monte Carlo (MCMC) is the application that has done the most to spread Markov-chain theory beyond probabilists. The idea, originating with Metropolis and others at Los Alamos in the early 1950s and generalised by Hastings (1970), is to construct a Markov chain whose stationary distribution is some target pi (often the posterior in a Bayesian model) and run the chain long enough to draw approximate samples [14]. The Gibbs sampler (Geman and Geman, 1984) and the Metropolis-Hastings algorithm are the workhorses [15]; Hamiltonian Monte Carlo (Duane et al., 1987) and the No-U-Turn Sampler (Hoffman and Gelman, 2014) underlie Stan and PyMC. Robert and Casella's *Monte Carlo Statistical Methods* (2004) is the standard reference [7]. Almost every MCMC algorithm relies on detailed balance to certify that the chain has the right stationary distribution.

### How is PageRank a Markov chain?

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd introduced PageRank in 1998 as the ranking core of the Google search engine [9][10]. The algorithm models a "random surfer" who at each step either follows a uniformly random outgoing link from the current page (with probability d, the *damping factor*, typically 0.85) or teleports to a uniformly random page (with probability 1 - d) [9]. This is a Markov chain on the set of web pages: in the words of the canonical description, PageRank "can be understood as a Markov chain in which the states are pages, and the transitions are the links between pages" [9]. Teleportation ensures the chain is irreducible and aperiodic, so the stationary distribution exists and is unique by the ergodic theorem; that stationary distribution is PageRank. Computationally PageRank reduces to power iteration on a matrix with billions of rows.

### How do Markov chains relate to reinforcement learning and MDPs?

A Markov decision process generalises a Markov chain by allowing an agent to choose actions at each state that influence the transition probabilities and produce a reward. Richard Bellman introduced MDPs in 1957 in the context of dynamic programming [12]; Howard's 1960 monograph *Dynamic Programming and Markov Processes* gave the first textbook treatment [13]. Sutton and Barto's *Reinforcement Learning: An Introduction* (1998, second edition 2018) made MDPs the central object of modern RL [8]. Value iteration, policy iteration, Q-learning, SARSA, actor-critic, and PPO are all algorithms for learning or computing optimal policies on an MDP. When the state is only partially observed, the model becomes a partially observable MDP (POMDP), studied since Astrom (1965) and Smallwood and Sondik (1973).

### How do diffusion models use Markov chains?

Diffusion-based generative models, which now dominate text-to-image generation, are built on Markov chains in time. Sohl-Dickstein, Weiss, Maheswaranathan, and Ganguli (2015) introduced *deep unsupervised learning using nonequilibrium thermodynamics*, framing image generation as the reverse of a Markov chain that gradually adds Gaussian noise to data [16]. Ho, Jain, and Abbeel (2020) refined this into the Denoising Diffusion Probabilistic Model (DDPM), which trains a neural network to predict the noise added at each step of the forward chain and then runs the learned reverse chain to sample new images; on unconditional CIFAR-10 it reported an Inception score of 9.46 and a FID of 3.17 [17]. Both forward (noising) and reverse (denoising) processes are explicit Markov chains in the time index; DDPM's loss derives from a variational bound on the Markov-chain likelihood. Score-based generative models (Song and Ermon, 2019; Song et al., 2021) are equivalent in continuous time and connect the construction to stochastic differential equations.

## How do Markov chains connect to information theory?

The *entropy rate* of a stationary Markov chain is H(X_1 | X_0) = -sum_i pi_i sum_j P_{ij} log P_{ij}, the average uncertainty about the next state given the current one; this is the long-run information rate of the chain in the sense of Shannon [21]. The asymptotic equipartition property and the Shannon-McMillan-Breiman theorem extend from independent sequences to ergodic Markov chains, with the entropy rate playing the role of per-symbol entropy. In data compression, n-gram models (themselves Markov chains over symbols) appear as workhorses of pre-neural language modelling [21].

## Continuous-state and continuous-time generalisations

Many of the concepts above extend to continuous state spaces, with sums replaced by integrals and matrices replaced by integral kernels. The general framework is the theory of *Markov processes*, which subsumes:

| Process | State space | Time | Notes |
|---------|-------------|------|-------|
| Discrete-time Markov chain | Countable | Discrete | This article's main subject |
| Continuous-time Markov chain | Countable | Continuous | Generator matrix Q |
| Brownian motion | Real numbers (R or R^d) | Continuous | Limit of random walks; foundational continuous-time process |
| Diffusion process | R^d | Continuous | Solution of an SDE; continuous-state generalisation |
| Levy process | R^d | Continuous | Independent stationary increments; includes jumps |
| Markov process on Polish space | Polish | Discrete or continuous | General theoretical framework |

Brownian motion, introduced by Einstein (1905) and rigorously constructed by Wiener (1923), is the canonical continuous-state continuous-time Markov process and is the limit of suitably scaled simple random walks. Stochastic differential equations dX_t = b(X_t) dt + sigma(X_t) dW_t define diffusion processes, which are Markov by construction. Levy processes (Levy 1937) generalise further to allow jumps.

## How does a Markov chain differ from related models?

| Concept | State | Time | Hidden? | Actions? | Notes |
|---------|-------|------|---------|----------|-------|
| [Markov chain](/wiki/markov_chain) | Discrete | Discrete | Observed | None | Defining object |
| Continuous-time Markov chain | Discrete | Continuous | Observed | None | Generator Q |
| [Hidden Markov model](/wiki/hidden_markov_model) | Discrete | Discrete | Hidden | None | Observation conditional on state |
| [Markov decision process (MDP)](/wiki/markov_decision_process_mdp) | Discrete (typically) | Discrete | Observed | Yes | Adds rewards and actions |
| Partially observable MDP (POMDP) | Discrete | Discrete | Hidden | Yes | Belief-state MDP under the hood |
| Markov random field | Graph nodes | None | Observed | None | Undirected graphical model |
| Brownian motion | Continuous (R^d) | Continuous | Observed | None | Continuous-state Markov process |
| [Diffusion model](/wiki/diffusion_model) | Continuous (R^d) | Discrete (typically) | Observed | None | Markov chain in time used for generation |
| Independent sequence | Any | Discrete | Observed | None | Special case (memoryless beyond zeroth order) |

The Markov chain sits at the centre of this taxonomy, with related models extending it: hide the state, add actions, take a continuous-time or continuous-state limit, or drop the directionality. Almost every probabilistic model used in modern machine learning either is a Markov chain, contains one, or is a close graphical-model relative.

## What software implements Markov chains?

| Library or tool | Language | Capabilities |
|-----------------|----------|-------------|
| numpy / scipy | Python | Transition-matrix arithmetic, eigendecompositions |
| hmmlearn | Python | Discrete and Gaussian HMMs |
| pomegranate | Python | HMMs, Bayesian networks, mixture models |
| markovchain | R | Discrete Markov chains, stationary distributions |
| depmixS4 | R | HMMs and dependent mixture models |
| Stan | C++ / R / Python | Hamiltonian Monte Carlo |
| PyMC | Python | NUTS, Metropolis, Gibbs sampling |
| NumPyro | Python (JAX) | NUTS, Hamiltonian Monte Carlo |
| PRISM | Standalone | Probabilistic model checker for DTMCs and CTMCs |
| Stable-Baselines3, RLlib, Tianshou | Python | Reinforcement learning over MDPs |
| Diffusers (Hugging Face) | Python | Implementation of diffusion-model Markov chains |

For small chains numpy is sufficient; for continuous-time chains scipy.linalg computes P(t) = exp(Qt) directly; for very large chains (as in PageRank) sparse matrices and the power method are standard.

## Famous results

| Result | Year | What it says |
|--------|------|---------------|
| Markov's law of large numbers for dependent variables | 1906 | The LLN extends to chains, not just independent sequences |
| Perron-Frobenius theorem | 1907 (Perron), 1912 (Frobenius) | Existence and uniqueness of the principal eigenvector for non-negative matrices |
| Polya's recurrence theorem | 1921 | Simple symmetric random walk is recurrent for d <= 2, transient for d >= 3 |
| Chapman-Kolmogorov equation | 1928 / 1931 | Composition law P^{m+n} = P^m P^n for transition kernels |
| Birkhoff's ergodic theorem | 1931 | Time averages equal space averages for ergodic systems |
| Doeblin's coupling | 1938 | Quantitative ergodic theorem via coupling |
| Kemeny-Snell fundamental matrix | 1960 | N = (I - Q)^{-1} for absorbing chains |
| Cutoff phenomenon for shuffling | 1992 (Bayer and Diaconis) | Seven riffle shuffles suffice for a 52-card deck |
| Coupling from the past | 1996 (Propp and Wilson) | Exact sampling from the stationary distribution |
| Markov chain central limit theorem | various, 20th century | Gaussian fluctuations of time averages for ergodic chains |

## See also

- [Markov property](/wiki/markov_property)
- [Hidden Markov model](/wiki/hidden_markov_model)
- [Markov decision process (MDP)](/wiki/markov_decision_process_mdp)
- [Markov chain Monte Carlo](/wiki/markov_chain_monte_carlo)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Diffusion model](/wiki/diffusion_model)

## References

1. Markov, A. A. (1906). "Distribution of the law of large numbers for quantities depending on one another" (Extension of the law of large numbers to dependent variables). *Izvestiya Fiziko-Matematicheskogo Obshchestva pri Kazanskom Universitete*, 2nd series, 15, 135-156.
2. Markov, A. A. (1913). "An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains". *Bulletin of the Imperial Academy of Sciences*, Saint Petersburg.
3. Norris, J. R. (1997). *Markov Chains*. Cambridge University Press.
4. Kemeny, J. G., and Snell, J. L. (1960). *Finite Markov Chains*. Van Nostrand; reprinted by Springer.
5. Grimmett, G. R., and Stirzaker, D. R. (2001). *Probability and Random Processes*, third edition. Oxford University Press.
6. Levin, D. A., and Peres, Y. (2017). *Markov Chains and Mixing Times*, second edition (with E. L. Wilmer). American Mathematical Society.
7. Robert, C. P., and Casella, G. (2004). *Monte Carlo Statistical Methods*, second edition. Springer.
8. Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction*, second edition. MIT Press.
9. Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). "The PageRank Citation Ranking: Bringing Order to the Web". Stanford InfoLab Technical Report 1999-66.
10. Brin, S., and Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". *Computer Networks and ISDN Systems*, 30(1-7), 107-117.
11. Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications in speech recognition". *Proceedings of the IEEE*, 77(2), 257-286.
12. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
13. Howard, R. A. (1960). *Dynamic Programming and Markov Processes*. MIT Press.
14. Hastings, W. K. (1970). "Monte Carlo sampling methods using Markov chains and their applications". *Biometrika*, 57(1), 97-109.
15. Geman, S., and Geman, D. (1984). "Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images". *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 6(6), 721-741.
16. Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). "Deep unsupervised learning using nonequilibrium thermodynamics". *Proceedings of the 32nd International Conference on Machine Learning*.
17. Ho, J., Jain, A., and Abbeel, P. (2020). "Denoising diffusion probabilistic models". *Advances in Neural Information Processing Systems* 33.
18. Propp, J. G., and Wilson, D. B. (1996). "Exact sampling with coupled Markov chains and applications to statistical mechanics". *Random Structures and Algorithms*, 9(1-2), 223-252.
19. Bayer, D., and Diaconis, P. (1992). "Trailing the dovetail shuffle to its lair". *Annals of Applied Probability*, 2(2), 294-313.
20. Polya, G. (1921). "Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz". *Mathematische Annalen*, 84, 149-160.
21. Shannon, C. E. (1948). "A mathematical theory of communication". *Bell System Technical Journal*, 27, 379-423 and 623-656.
22. Doob, J. L. (1953). *Stochastic Processes*. Wiley.
23. Kolmogorov, A. N. (1931). "Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung". *Mathematische Annalen*, 104, 415-458.
24. Hayes, B. (2013). "First Links in the Markov Chain". *American Scientist*, 101(2), 92.
