The Markov property is a condition on a stochastic process that says the future of the process depends only on its present state, not on the path the process took to reach that state. A process with this property is sometimes described as memoryless. The Markov property is the formal foundation for Markov chains, hidden Markov models, Markov decision processes, and Markov chain Monte Carlo, all of which are central tools in probability, statistics, machine learning, and reinforcement learning.
The property is named after Andrey Andreyevich Markov (1856-1922), a Russian mathematician who introduced what are now called Markov chains in a 1906 paper and demonstrated their empirical use in 1913 by analyzing letter sequences in Alexander Pushkin's verse novel Eugene Onegin. Markov's work showed that long sequences of dependent events could be modeled rigorously without assuming independence and without tracking the entire history of the system.
The simplest way to think about the Markov property is in terms of a board game. Imagine playing Snakes and Ladders. To predict where you will be after your next roll, all you need to know is which square you currently occupy and the rules for the dice. It does not matter whether you reached square 47 by climbing a long ladder or by slogging through a series of small rolls: the next move is decided entirely by the present square and the next dice outcome. The history is summarized by the position; nothing earlier needs to be remembered.
Most real-world processes are not literally memoryless, but many of them can be made approximately Markov by enlarging the state to include enough recent context. A weather model that uses today's temperature and humidity may have a weak Markov property; a model that also includes the past three days, the season, and the current synoptic pattern is closer to genuinely Markov even though it depends on more variables. In language modeling, the same trick converts a model that conditions on one previous word (a bigram) into one that conditions on three or four previous words (a 4-gram). The state grew, but the property held.
Let (X_t)_{t in T} be a stochastic process indexed by a totally ordered set T (typically the non-negative integers for discrete time or the non-negative real numbers for continuous time), with values in a state space S. The process satisfies the Markov property if, for all times s < t in T and every measurable set A in S,
P(X_t in A | F_s) = P(X_t in A | X_s)
where F_s is the sigma-algebra generated by (X_u)_{u <= s}, that is, all the information about the process up to and including time s. Equivalently, the conditional distribution of X_t given the entire past depends on the past only through the value of X_s.
For a discrete-time process taking values in a countable state space, the property reduces to the more familiar form:
P(X_{t+1} = x_{t+1} | X_t = x_t, X_{t-1} = x_{t-1}, ..., X_0 = x_0) = P(X_{t+1} = x_{t+1} | X_t = x_t)
for every history x_0, x_1, ..., x_t in S and every candidate next state x_{t+1}. The left side conditions on the full history; the right side conditions only on the most recent observation. The Markov property says the two are equal.
A process with this property is called a Markov process. When the index set is discrete and the state space is also discrete (finite or countable), the process is called a Markov chain.
A discrete-time Markov chain is time-homogeneous if the one-step transition probability P(X_{t+1} = j | X_t = i) does not depend on t. In that case the dynamics are entirely captured by an initial distribution and a single transition matrix P with entries P_{ij} = P(X_{t+1} = j | X_t = i). The matrix P is row-stochastic: every row sums to 1, and every entry lies in [0, 1]. The n-step transition probability P(X_{t+n} = j | X_t = i) is the (i, j) entry of the matrix power P^n, by the Chapman-Kolmogorov equation.
The Markov property as stated above conditions on a fixed time s. The strong Markov property strengthens this by allowing s to be a random time, provided the random time is a stopping time with respect to the filtration F. A random time tau is a stopping time if for every t the event {tau <= t} is determined by F_t, that is, the decision to stop by time t depends only on information observed by time t.
The strong Markov property says that, given X_tau and given the event {tau < infinity}, the post-tau process (X_{tau+u})_{u >= 0} is independent of F_tau and has the same distribution as the original process started from X_tau. For discrete-time chains the strong Markov property follows from the ordinary Markov property and is usually treated as an immediate corollary. For continuous-time processes, including Brownian motion, the strong Markov property is a non-trivial theorem and is the workhorse of many proofs about hitting times, excursions, and recurrence.
A process satisfies the order-k Markov property (also called a k-th order Markov chain) if the next state depends on the previous k states rather than only the most recent one:
P(X_{t+1} = x_{t+1} | X_t, X_{t-1}, ..., X_0) = P(X_{t+1} = x_{t+1} | X_t, X_{t-1}, ..., X_{t-k+1})
Any order-k Markov chain over a state space S can be rewritten as an order-1 Markov chain over the larger state space S^k by defining the new state at time t to be the tuple (X_{t-k+1}, ..., X_t). This is the standard trick used to prove that the order-k formulation does not extend the expressive power of Markov models, only their convenience.
Andrey Markov was a student of Pafnuty Chebyshev at St. Petersburg University and worked primarily in number theory and probability. In 1906 he published a paper introducing what he called "chains" of dependent random variables and proved a version of the law of large numbers that did not require independence. Earlier limit theorems, including Chebyshev's and Bernoulli's, had assumed independent observations. Markov wanted to show that statistical regularities could emerge from dependent sequences too, which made his chains the first general mathematical model of dependent random events.
In 1913 Markov published an empirical study to demonstrate the practical relevance of his theory. He took the first 20,000 letters of Pushkin's Eugene Onegin, stripped out punctuation and spaces, and arranged the letters in a 200-by-100 grid. He counted vowels and consonants and found that 43 percent were vowels and 57 percent consonants. He then counted pairs of consecutive letters and recorded 1,104 vowel-vowel pairs, 3,827 consonant-consonant pairs, and 15,069 mixed pairs. Under the assumption that consecutive letters were independent, the predicted number of vowel-vowel pairs would be much higher than the observed count. The discrepancy showed that the letters were not independent: a consonant tended to be followed by a vowel and vice versa. Markov used the data to fit a two-state chain and verify that his theorems held. The paper, titled "An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains," was the first empirical application of what we now call a Markov chain. It was not translated into English until 2006.
The theory was extended to continuous time and continuous state spaces in the 1930s by Andrey Kolmogorov, whose 1931 paper "Analytical methods in the theory of probability" established the partial-differential-equation framework still used for diffusion processes. Joseph Doob's 1953 book Stochastic Processes gave the modern measure-theoretic treatment of Markov processes and the strong Markov property. By the 1950s and 1960s the theory had been absorbed into the toolkit of operations research, queueing theory, statistical mechanics, and population genetics.
A Markov chain is the simplest concrete instance of the Markov property and the one most likely to appear in introductory texts. The state space can be finite, countably infinite, or in continuous-time variants, even uncountable. The dynamics are encoded in a transition matrix (or a transition kernel in the continuous case), and the long-run behavior is described by a stationary distribution.
A probability distribution pi on the state space is stationary for a transition matrix P if pi P = pi, that is, applying one step of the dynamics to pi yields pi again. If the chain is started in its stationary distribution, the distribution of X_t is pi for every t. A stationary distribution is also called an invariant or equilibrium distribution.
Not every Markov chain has a stationary distribution, and chains can have multiple stationary distributions. Existence and uniqueness depend on structural properties summarized in the next subsection.
Three structural conditions on a Markov chain control its long-run behavior:
| Property | Definition | Consequence |
|---|---|---|
| Irreducible | For every pair of states i and j there exists n >= 1 with (P^n)_{ij} > 0; that is, every state is reachable from every other state. | Implies a unique stationary distribution exists, when one exists at all. |
| Aperiodic | For every state i, the greatest common divisor of {n >= 1 : (P^n)_{ii} > 0} is 1. | Rules out chains that revisit states only on a fixed cycle. |
| Positive recurrent | The expected return time to each state is finite. | Together with irreducibility, guarantees existence of a unique stationary distribution. |
| Ergodic | Irreducible, aperiodic, and positive recurrent. | The distribution of X_t converges to the unique stationary distribution pi as t goes to infinity, regardless of the initial distribution. |
The ergodic theorem says that for an ergodic chain, time averages along a single sample path converge almost surely to the corresponding spatial averages under pi. This is the result that justifies estimating expectations under pi by simulating a long trajectory of the chain, which is the entire conceptual basis of Markov chain Monte Carlo.
A random walk on a graph is a Markov chain whose state is a vertex and whose transitions choose a neighbor uniformly at random. Random walks on the integers, on the d-dimensional lattice, and on weighted graphs are central examples in probability theory. The simple symmetric random walk on the integers is recurrent in dimensions 1 and 2 and transient in dimensions 3 and higher, a result due to George Polya in 1921. Random walks on graphs underpin algorithms for clustering, community detection, and ranking, including PageRank.
A hidden Markov model (HMM) is a doubly stochastic process in which an underlying Markov chain over hidden states (X_t) is observed only indirectly through an emission process (Y_t). The hidden states satisfy the Markov property; the observations are conditionally independent given the hidden states. Formally, an HMM is specified by an initial distribution over hidden states, a transition matrix among hidden states, and an emission distribution P(Y_t | X_t) for each hidden state.
Three canonical algorithms operate on HMMs and are summarized below:
| Algorithm | Year | Problem solved | Method |
|---|---|---|---|
| Forward-backward | Baum and colleagues, 1960s | Compute the marginal posterior P(X_t | Y_1, ..., Y_T) over hidden states given an observation sequence. |
| Viterbi | Andrew Viterbi, 1967 | Find the most likely sequence of hidden states given the observations, that is, the MAP path through the trellis. | Dynamic programming that retains only the best predecessor at each (state, time) cell. |
| Baum-Welch | Leonard Baum, Ted Petrie, George Soules, Norman Weiss, 1970 | Estimate the transition and emission parameters from an unlabeled observation sequence (unsupervised learning). | Expectation-maximization specialized to HMMs, using forward-backward to compute the E-step. |
Viterbi originally proposed his algorithm for decoding convolutional codes in noisy communication channels, in his 1967 paper "Error bounds for convolutional codes and an asymptotically optimum decoding algorithm." The same algorithm was later recognized to solve the most-probable-state-sequence problem for any HMM and became a standard tool in speech recognition, bioinformatics, and natural language processing.
HMMs were the dominant statistical model for automatic speech recognition from the 1970s through the early 2010s, with seminal contributions from groups at IBM, Bell Labs, and Carnegie Mellon. They were also widely used for part-of-speech tagging, gene-finding (programs such as GENSCAN), profile alignment in bioinformatics (HMMER), and handwriting recognition. Deep learning models such as recurrent neural networks and transformers have largely superseded HMMs in speech and NLP, but HMMs remain important in computational biology and as pedagogical examples of probabilistic graphical models.
A Markov decision process (MDP) extends a Markov chain by adding actions and rewards, providing the standard mathematical framework for sequential decision making under uncertainty and the formal setting for reinforcement learning. An MDP is specified by the tuple (S, A, P, R, gamma):
The Markov property in this context says that the transition kernel and the reward depend only on the current state and action, not on the trajectory by which the agent reached them. A policy pi is a (possibly stochastic) mapping from states to actions. The state-value function under policy pi is
V^pi(s) = E[ sum_{t=0}^{infinity} gamma^t R(S_t, A_t, S_{t+1}) | S_0 = s, A_t ~ pi ]
and the action-value function is
Q^pi(s, a) = E[ sum_{t=0}^{infinity} gamma^t R(S_t, A_t, S_{t+1}) | S_0 = s, A_0 = a, A_t ~ pi for t >= 1 ].
Both value functions satisfy a recursive consistency condition called the Bellman equation, named after Richard Bellman, who introduced dynamic programming in the 1950s. For the state-value function:
V^pi(s) = sum_{a} pi(a | s) sum_{s'} P(s' | s, a) [ R(s, a, s') + gamma V^pi(s') ]
The Bellman optimality equation characterizes the value of an optimal policy:
V*(s) = max_{a} sum_{s'} P(s' | s, a) [ R(s, a, s') + gamma V*(s') ]
Classical solution methods for MDPs include value iteration, policy iteration, and linear programming, all of which exploit the Bellman recursion. When the state and action spaces are too large to enumerate, or when P and R are unknown, model-free reinforcement-learning algorithms such as Q-learning, SARSA, and policy gradient methods estimate the value functions or policy directly from sampled experience. The textbook by Richard Sutton and Andrew Barto, Reinforcement Learning: An Introduction, is the standard reference and treats MDPs in chapters 3 and 4.
In computational linguistics, an n-gram language model assigns a probability to a sequence of words by factoring it as a product of conditional probabilities and applying the order-k Markov assumption to the conditioning context. The general factorization is
P(w_1, w_2, ..., w_T) = prod_{t=1}^{T} P(w_t | w_1, ..., w_{t-1})
and the n-gram approximation truncates the conditioning history to the previous n - 1 words:
P(w_t | w_1, ..., w_{t-1}) approx P(w_t | w_{t-n+1}, ..., w_{t-1})
This is exactly the order-(n - 1) Markov assumption applied to a sequence of words. The relationship between n-gram terminology and Markov-order terminology is summarized below.
| n-gram name | Conditioning context | Markov order |
|---|---|---|
| Unigram | None (independence assumption) | Order 0 |
| Bigram | Previous 1 word | Order 1 |
| Trigram | Previous 2 words | Order 2 |
| 4-gram | Previous 3 words | Order 3 |
| n-gram | Previous n - 1 words | Order n - 1 |
Claude Shannon introduced n-gram models in his 1948 paper "A Mathematical Theory of Communication" as a way to estimate the entropy of English text. He used hand-counted bigram and trigram statistics to show that a Markov model could generate text that became progressively more English-like as the order increased. Shannon's experiment is often cited as the first attempt at statistical language modeling, and it is the direct conceptual ancestor of every probabilistic language model that followed, including modern large language models. The smoothing techniques developed for n-gram models in the 1980s and 1990s (Good-Turing, Katz back-off, Kneser-Ney) were the dominant approach to language modeling for several decades, until neural language models began to outperform them around 2010.
Markov chain Monte Carlo (MCMC) is a class of algorithms for sampling from probability distributions that are difficult to sample from directly. The idea is to construct an ergodic Markov chain whose stationary distribution is the target distribution pi. After running the chain long enough for it to approach equilibrium, the sample path can be treated as approximate samples from pi, and time averages along the path estimate expectations under pi.
The two foundational MCMC algorithms are listed below.
| Algorithm | Year | Authors | Idea |
|---|---|---|---|
| Metropolis | 1953 | Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, Edward Teller | Propose moves from a symmetric distribution and accept with probability min(1, pi(new) / pi(current)). |
| Metropolis-Hastings | 1970 | Wilfred Hastings | Generalizes the Metropolis algorithm to asymmetric proposal distributions, accepting with probability that includes the proposal-distribution ratio. |
| Gibbs sampler | 1984 | Stuart Geman and Donald Geman | Update one coordinate at a time by sampling from its full conditional distribution given the others; especially useful for high-dimensional posteriors with tractable conditionals. |
The original Metropolis paper, written at Los Alamos in the context of statistical mechanics, computed equilibrium properties of interacting particle systems. Hastings's 1970 generalization was published in Biometrika but went largely unnoticed for two decades. Geman and Geman applied the Gibbs sampler to image restoration in 1984 and, a few years later, statisticians realized that the same machinery could solve Bayesian inference problems that had been computationally intractable. The 1990 paper by Alan Gelfand and Adrian Smith catalyzed a Bayesian revolution: instead of fighting analytically intractable posteriors with conjugate priors and approximations, statisticians could simulate from them. Modern probabilistic programming systems such as Stan, PyMC, and JAGS implement Hamiltonian Monte Carlo, the No-U-Turn Sampler, and other MCMC variants, and they are the everyday tool of applied Bayesian statistics.
PageRank is one of the most consequential applications of Markov chain theory in computer science. Sergey Brin and Larry Page introduced it in their 1998 Stanford paper "The PageRank Citation Ranking: Bringing Order to the Web," which became the foundation of the original Google search engine.
The model is a random surfer wandering the web. From the current page, the surfer either follows an outgoing link uniformly at random (with probability 1 - d) or, with probability d (the damping factor, typically around 0.15), teleports to a uniformly random page. The dynamics define a Markov chain on the set of web pages whose transition matrix mixes the link structure with the uniform teleport distribution. The teleport keeps the chain irreducible and aperiodic, so a unique stationary distribution exists by the Perron-Frobenius theorem. The PageRank of a page is its probability under the stationary distribution: pages that the random surfer spends more time at are ranked higher.
Brin and Page did not initially frame PageRank in Markov-chain language; the connection was made explicit in subsequent papers by Langville, Meyer, and others. The teleport term, originally motivated by a need to avoid dangling pages and rank sinks, is exactly the device that makes the chain ergodic and the stationary distribution unique. PageRank-style random walk algorithms are now used widely in network analysis, citation ranking, and recommendation systems.
In a probabilistic graphical model, the Markov blanket of a variable X is a minimal set of other variables such that, conditional on the blanket, X is independent of every variable outside the blanket. The concept formalizes the idea that the blanket carries all the information about X that the rest of the model contains. In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and the other parents of its children (its co-parents); in a Markov random field, it is simply the node's neighbors in the graph.
Judea Pearl introduced and popularized the concept in his 1988 book Probabilistic Reasoning in Intelligent Systems, which established Bayesian networks as a workhorse of uncertain reasoning in artificial intelligence. The Markov blanket is the local generalization of the Markov property: the property says the future depends on the past only through the present, while the blanket says any variable depends on the rest of the model only through its blanket. Markov blankets play a central role in causal discovery, feature selection, and the variational free-energy framework in computational neuroscience.
A conditional random field (CRF) is a discriminative undirected graphical model used for structured prediction, especially sequence labeling. CRFs were introduced by John Lafferty, Andrew McCallum, and Fernando Pereira in their 2001 paper "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," presented at ICML.
The motivation was to address two limitations of HMMs and maximum-entropy Markov models when used for sequence labeling. First, HMMs are generative and require modeling P(observations | labels), which forces strong independence assumptions on the observations to keep training tractable. CRFs model the conditional distribution P(labels | observations) directly and can incorporate arbitrary, overlapping features of the observations without violating any independence assumptions. Second, MEMMs suffer from the so-called label-bias problem, in which states with few outgoing transitions become disproportionately likely; CRFs use a global normalization that avoids this bias.
A linear-chain CRF retains the order-1 Markov assumption on the labels (each label depends only on its predecessor) but completely relaxes the assumption on observations. This combination dominated supervised sequence-labeling benchmarks for tasks such as named entity recognition, shallow parsing, and bioinformatics annotation through the late 2000s and 2010s, and CRFs are still used as the output layer of many neural sequence models.
The Markov property is an assumption, not a fact about the world, and many real systems violate it in important ways. Long-range dependencies, latent state, and non-stationarity are the usual culprits. Three concrete failure modes are worth noting.
First, language has structure that spans much further than any fixed-order Markov model can capture. Subject-verb agreement across long relative clauses, anaphora resolution, discourse coherence, and document-level topical consistency all involve dependencies on words that may be hundreds or thousands of tokens earlier. Classical n-gram language models are explicitly Markov and cannot represent these dependencies; in practice, this is one of the main reasons n-gram language models were displaced first by recurrent neural networks (LSTMs, GRUs) and then by transformer-based language models, which condition on the full context window with self-attention.
Second, hidden state can make a fully observed process appear non-Markov even when an underlying joint process is Markov. The remedy is usually to expand the state to include the hidden variables, but this is only feasible when the hidden state is known and finite. HMMs and partially observable Markov decision processes (POMDPs) provide one way to handle hidden state, at the cost of substantially harder inference and learning.
Third, the world changes. A transition kernel that is correct today may not be correct next year, violating the time-homogeneity that most clean theory assumes. Non-stationary chains, change-point models, and continual-learning techniques try to address this, but the simple form of the Markov property does not.
Despite these limitations, the Markov property remains one of the most useful modeling assumptions in applied probability. It is the right level of abstraction for reinforcement-learning environments, for many physical and biological dynamical systems, and for any setting where the state can plausibly summarize the relevant past. Even modern deep models that violate the Markov property in spirit often respect it in implementation: a transformer language model is not Markov in token space, but if you treat the entire context window plus key-value cache as the state, the next-token distribution depends only on that state, recovering a Markov-like structure at the level of the architecture rather than the data.
Imagine you are playing hide-and-seek in a big house. To decide where to go next, you only need to know what room you are in right now. It does not really matter whether you sneaked there from the kitchen or ran in from the garden. The next move depends on the room you are in. That is the Markov property: the next thing that happens depends only on right now, not on how you got here.
Markov chains are like board games where each square tells you the chances of going to other squares. Hidden Markov models are like a board game where you cannot see the squares; you only see hints, and you have to guess the hidden squares from the hints. Markov decision processes are board games where you also get to choose what to do on each square, and you get points for good choices. All of these ideas use the same trick: the future depends only on the present.