A Hidden Markov Model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. The model produces a sequence of observable outputs that depend probabilistically on the hidden state at each time step, but the states themselves are never directly seen by the observer. HMMs combine two coupled stochastic processes: a hidden state sequence that follows the Markov property, and an observation sequence whose distribution at each step is determined by the current hidden state.
From the late 1960s through roughly 2010, HMMs were one of the dominant tools in machine learning for sequential data. They underpinned almost all production speech recognition systems, were the workhorse of part-of-speech tagging, named entity recognition, and shallow parsing in natural language processing, and they became standard tools in computational biology for gene finding and protein family modeling. Although deep neural sequence models such as the recurrent neural network, the LSTM, and the transformer have largely superseded HMMs in modern speech and language pipelines, HMMs remain widely used in bioinformatics, signal processing, finance, and as interpretable building blocks within larger systems.
An HMM models a doubly stochastic process. There is an underlying Markov chain over a finite set of hidden states, and at each step the chain emits an observation drawn from a state-dependent distribution. Formally, an HMM with N hidden states and M discrete observation symbols is specified by the tuple lambda = (A, B, pi). The hidden state set is S = {s_1, ..., s_N}, and the observation alphabet is V = {v_1, ..., v_M} for discrete-output HMMs, or a continuous space such as R^d for continuous HMMs.
The transition probability matrix A is an N by N matrix where a_{ij} = P(q_{t+1} = s_j | q_t = s_i) is the probability of moving from state s_i at time t to state s_j at time t+1; the rows of A sum to one. The emission distribution B describes how observations are produced: for a discrete HMM, B is an N by M matrix where b_j(k) = P(o_t = v_k | q_t = s_j); for a continuous HMM, b_j(o_t) is a probability density, often modeled as a Gaussian or Gaussian mixture model (GMM) per state. The initial state distribution pi is a vector of length N where pi_i = P(q_1 = s_i) is the probability that the chain starts in state s_i.
Given these parameters, the joint probability of a state sequence Q = (q_1, ..., q_T) and an observation sequence O = (o_1, ..., o_T) is the product of the initial probability and the chains of transitions and emissions:
P(O, Q | lambda) = pi_{q_1} * b_{q_1}(o_1) * prod_{t=2}^{T} a_{q_{t-1} q_t} * b_{q_t}(o_t)
The two key conditional independence assumptions of an HMM are: (1) the next state depends only on the current state (the Markov assumption), and (2) the current observation depends only on the current state, not on past states or observations. These assumptions make exact inference tractable but are also the main source of HMM modeling limitations.
A standard pedagogical example involves an observer who cannot see the weather but can see whether a roommate carries an umbrella. The hidden states are {Rainy, Sunny}, and the observations are {Umbrella, NoUmbrella}. The transition matrix encodes that rainy days tend to follow rainy days and sunny days tend to follow sunny days. The emission matrix encodes that the roommate is much more likely to carry an umbrella when it is rainy. Given a sequence of umbrella observations, the HMM lets you compute the most likely weather sequence, the probability of seeing that observation pattern, and, given many such sequences, learn the matrices A and B from data. Another classic example, due to Rabiner, is the urn-and-ball model: a hidden Markov chain decides which of several urns to draw from at each step, but the observer sees only the color of the ball drawn.
In his foundational 1989 tutorial in the Proceedings of the IEEE, Lawrence Rabiner formalized three canonical problems that any HMM application must solve. These three problems and the algorithms that address them form the operational core of the HMM framework.
| Problem | Question | Algorithm | Complexity |
|---|---|---|---|
| 1. Likelihood (Evaluation) | Given lambda and O, compute P(O | lambda) | Forward algorithm | O(N^2 T) |
| 2. Decoding | Given lambda and O, find the most likely Q | Viterbi algorithm | O(N^2 T) |
| 3. Learning | Given O (and N), estimate lambda | Baum-Welch (Forward-Backward EM) | O(N^2 T) per iter |
A naive approach to any of these problems would require summing or maximizing over all N^T possible state sequences, which is exponential and intractable for any realistic T. Each of the three algorithms uses dynamic programming to fold this exponential sum into a quadratic-in-N computation, which is what makes HMMs practical.
The Forward algorithm computes the total probability of an observation sequence under the model by marginalizing over all possible hidden state sequences. It defines a forward variable alpha_t(i) = P(o_1, o_2, ..., o_t, q_t = s_i | lambda), the joint probability of the partial observation sequence up to time t and being in state s_i at time t.
The recursion proceeds as follows. Initialize alpha_1(i) = pi_i * b_i(o_1) for each state i. Then for t from 2 to T, update alpha_t(j) = [sum over i of alpha_{t-1}(i) * a_{ij}] * b_j(o_t). Finally, terminate by summing alpha_T(i) over all states i to obtain P(O | lambda).
This turns an exponential sum into a series of N matrix-vector products, costing O(N^2 T) operations. The Forward algorithm is the workhorse for likelihood-based HMM applications such as speaker identification, language identification, and any system that scores candidate models against an observation sequence.
The Viterbi algorithm, introduced by Andrew Viterbi in 1967 in the context of decoding convolutional codes, finds the single state sequence Q* that maximizes P(Q | O, lambda). It is structurally similar to the Forward algorithm but replaces the sum over predecessors with a maximum, and stores backpointers so the optimal sequence can be recovered. Define delta_t(i) = max over q_1, ..., q_{t-1} of P(q_1, ..., q_{t-1}, q_t = s_i, o_1, ..., o_t | lambda), the probability of the best path ending in state i at time t. Initialize delta_1(i) = pi_i * b_i(o_1). For each subsequent t and j, compute delta_t(j) = max_i [delta_{t-1}(i) * a_{ij}] * b_j(o_t), and store psi_t(j) = argmax_i [delta_{t-1}(i) * a_{ij}] as a backpointer. After processing all T observations, the most likely final state is q*_T = argmax_i delta_T(i), and the full state sequence is recovered by following backpointers in reverse.
Viterbi is one of the most widely used decoding algorithms in all of engineering. Beyond HMMs, it is used in mobile telephony, deep-space communications, and any application that needs to recover a maximum-likelihood path through a trellis. In NLP, Viterbi decoding produces the output of an HMM part-of-speech tagger or named entity recognizer; in speech recognition, beam-pruned variants of Viterbi search HMM-GMM lattices. A practical concern with both Forward and Viterbi is numerical underflow: long products of small probabilities collapse to zero in floating point, so implementations either rescale alpha values at each step or work in log space, where products become sums and Viterbi remains a max-plus operation.
The Baum-Welch algorithm, also known as the Forward-Backward algorithm, learns the parameters lambda = (A, B, pi) from one or more observation sequences when the hidden states are not labeled. It is a special case of the expectation-maximization (EM) algorithm and was developed by Leonard Baum and colleagues at the Institute for Defense Analyses in the late 1960s, predating the general EM formulation by Dempster, Laird, and Rubin in 1977.
Baum-Welch alternates two steps. The E-step uses the current parameters to compute posterior probabilities of the hidden state sequence: gamma_t(i) = P(q_t = s_i | O, lambda), the probability of being in state i at time t, and xi_t(i, j) = P(q_t = s_i, q_{t+1} = s_j | O, lambda), the probability of being in state i at time t and state j at time t+1. These are computed efficiently using both the Forward variables alpha and a complementary set of Backward variables beta_t(i) = P(o_{t+1}, ..., o_T | q_t = s_i, lambda). The M-step then re-estimates the parameters from these expected counts: the new initial distribution is the expected number of times each state occurs at time 1, the new transition probability a_{ij} is the expected number of transitions from i to j divided by the expected number out of i, and the new emission probability b_j(k) is the expected count of state j occupied while symbol v_k is observed, divided by the expected total time in state j.
Baum and Petrie proved in 1966 that each iteration does not decrease the likelihood P(O | lambda), so the algorithm converges to a local maximum. It is not guaranteed to find the global maximum and is sensitive to initialization, which is why practitioners often run Baum-Welch from many random starts, or initialize from a flat-start segmentation in speech, or from a labeled subset of the data.
The theoretical foundations of HMMs were laid in the mid-1960s by Leonard E. Baum and his collaborators Ted Petrie, George Soules, and Norman Weiss in a series of papers in the Annals of Mathematical Statistics. The 1966 paper by Baum and Petrie introduced the basic framework, and the subsequent papers in 1968, 1970, and 1972 developed the forward-backward parameter re-estimation procedure. The work was carried out at the Institute for Defense Analyses in Princeton, motivated in part by classified problems in cryptanalysis and signal processing. The Baum-Welch algorithm was actually a precursor of the general EM framework: when Dempster, Laird, and Rubin published their canonical EM paper in 1977, they recognized Baum-Welch as a special case. Andrew Viterbi independently developed his decoding algorithm in 1967 for digital communication, and only later was it recognized as the natural decoder for HMMs.
Through the 1970s and 1980s, Frederick Jelinek, James Baker, and the IBM Continuous Speech Recognition Group adopted HMMs as the foundational framework for large-vocabulary speech recognition, replacing template-matching and dynamic time warping. James and Janet Baker founded Dragon Systems and used HMM technology to build Dragon NaturallySpeaking, the first widely successful consumer dictation product. By the late 1980s, HMMs were the dominant paradigm in speech recognition research worldwide. The most-cited single document on HMMs is Lawrence Rabiner's 1989 paper "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," published in the Proceedings of the IEEE. Rabiner's tutorial brought together the mathematical foundations, the three classic problems, and a wealth of practical advice for implementers, and it remains one of the most cited papers in all of natural language processing and signal processing.
In parallel, the bioinformatics community adopted HMMs starting in the early 1990s. David Haussler and his group at UC Santa Cruz, along with Anders Krogh and Sean Eddy, introduced profile HMMs for protein family modeling and multiple sequence alignment, leading to tools such as HMMER and the Pfam database that remain in active use today.
HMMs have been applied across an unusually diverse range of domains, anywhere a hidden discrete process generates an observable signal. The table below summarizes the major application areas.
| Domain | Hidden States | Observations | Notable Tools and Systems |
|---|---|---|---|
| Speech recognition | Phones, sub-phones, triphone tied states | MFCC or filterbank acoustic feature vectors | Sphinx, HTK, Kaldi (HMM-GMM era) |
| Part-of-speech tagging | POS tags (noun, verb, etc.) | Words | Brill tagger, TnT, Stanford POS Tagger |
| Named entity recognition | Entity tags (PER, LOC, ORG, O) | Words | IdentiFinder, Nymble |
| Bioinformatics: gene finding | Exon, intron, intergenic, splice site states | DNA bases A, C, G, T | GENSCAN, GeneMark, Augustus |
| Bioinformatics: protein families | Match, insert, delete columns of a profile | Amino acid residues | HMMER, Pfam, SAM |
| Gesture and activity recognition | Discrete motion phases | Pose feature vectors, accelerometer readings | Various research systems |
| Financial regime detection | Bull, bear, transition regimes | Returns, volatilities | hmmlearn-based regime models |
| Channel modeling and decoding | Channel state, error patterns | Received signal | GSM equalizers, Viterbi decoders |
| Handwriting recognition | Letter or stroke states | Pen trajectory features | HTK-based handwriting systems |
From the mid-1980s until roughly 2012, virtually every production automatic speech recognition (ASR) system was built on HMMs. The standard architecture used HMM-GMM acoustic models, in which each phone (or context-dependent triphone) was modeled by a small left-to-right HMM, typically with three states for the beginning, middle, and end of the phone. The emission distribution at each state was a Gaussian mixture model over MFCC (mel-frequency cepstral coefficient) feature vectors. At recognition time, the system composed the acoustic HMMs with a pronunciation lexicon and a statistical n-gram language model into a giant weighted finite-state transducer, then ran beam-pruned Viterbi search to find the most likely word sequence. Open-source toolkits such as CMU Sphinx, the Cambridge HTK toolkit, and later Kaldi (the most influential HMM-era toolkit) implemented this pipeline.
Deep learning began to displace HMM-GMM acoustic models around 2012, when deep neural networks were used to estimate state posteriors in hybrid HMM-DNN systems, achieving 20 to 30 percent relative error reduction on benchmark tasks. End-to-end neural systems based on the LSTM with connectionist temporal classification, and later transformer and conformer architectures, eventually replaced the HMM framework entirely in most state-of-the-art ASR systems by the late 2010s. The HMM-based pipeline remains in production use in many embedded and low-resource speech systems where the small footprint and well-understood properties of HMMs are valuable.
For part-of-speech tagging, the hidden states are POS tags and the observations are words. The transition matrix encodes tag-to-tag bigram probabilities, and the emission matrix encodes tag-to-word probabilities. Trained on a labeled corpus such as the Penn Treebank using simple maximum-likelihood counts (no Baum-Welch needed when the states are observed), HMM taggers achieve roughly 96 to 97 percent token accuracy on standard English data. Notable systems include TnT by Thorsten Brants (2000), which used a trigram HMM and remained competitive into the 2000s.
For named entity recognition, HMMs were used in classic systems such as BBN's IdentiFinder and Nymble, where the hidden states represent entity tags (PERSON, LOCATION, ORGANIZATION, OTHER) and the model is trained on annotated newswire. HMM-based NER systems were the state of the art in the late 1990s, before being overtaken by maximum entropy Markov models and then conditional random fields in the 2000s. HMMs have also been applied to chunking and shallow parsing, word-sense disambiguation, statistical machine translation alignment (the HMM-based alignment model of Vogel, Ney, and Tillmann from 1996 was widely used), and sequence segmentation more generally.
Bioinformatics is the area where HMMs remain most central in modern practice. Profile HMMs, introduced by Krogh, Brown, Mian, Sjolander, and Haussler in 1994, are HMMs whose state structure mirrors the columns of a multiple sequence alignment, with separate match, insert, and delete states at each column. Trained on a known family of related protein or nucleic acid sequences, a profile HMM captures position-specific patterns of residue conservation and indels and can detect distant homologs in a sequence database. The HMMER software package by Sean Eddy is the standard tool for searching with profile HMMs and powers the Pfam database of protein families; the companion Dfam database uses profile HMMs to model transposable elements in genomes. GENSCAN by Burge and Karlin (1997) used HMMs to predict gene structure in vertebrate genomic sequences, modeling exons, introns, splice sites, promoter regions, and intergenic DNA as a single integrated HMM. Subsequent gene finders such as GeneMark and Augustus extended this framework. HMMs are also used in metagenomics, transmembrane protein topology prediction (TMHMM), long-read sequencing alignment, and many other genomics pipelines.
In gesture and activity recognition, HMMs model the temporal structure of human motion. Hidden states correspond to discrete phases of a gesture (such as the preparation, stroke, and recovery phases of a sign language sign), and observations are pose or accelerometer features. HMMs were the dominant approach in this area through the 2000s before deep convolutional and recurrent models took over. In finance, HMMs are used for regime detection, where the hidden states represent unobserved market regimes such as low-volatility bull, high-volatility bear, and transitional states, and the observations are returns or other financial time series. These models support strategies in quantitative trading, asset allocation, and risk management. In digital communications, the original Viterbi algorithm decodes convolutional codes against a channel HMM, and HMMs model burst-error channels in wireless and storage systems. In handwriting recognition, HMMs were the standard approach for online and offline handwriting through the 1990s and 2000s.
When observations are real-valued vectors rather than discrete symbols, the emission distribution b_j(o_t) is a probability density. The most common choice is a multivariate Gaussian or a Gaussian mixture model with several components per state, which is flexible enough to model the distributions of features such as MFCCs in speech. The Forward, Backward, Viterbi, and Baum-Welch updates all carry over, and Baum-Welch additionally re-estimates the means, covariances, and mixture weights.
A standard HMM uses a first-order Markov assumption; higher-order HMMs let the next state depend on the last k states, which can be reformulated as a first-order HMM over an expanded state space. Auto-regressive HMMs let the next observation depend on past observations as well as the current state. Input-output HMMs condition transitions and emissions on an additional input sequence. Factorial HMMs (Ghahramani and Jordan, 1997) decompose the hidden state into several independent chains that jointly produce the observations, modeling multiple latent factors more compactly than a single combined state space. Hierarchical HMMs model nested temporal structure with sub-HMMs entered as production rules, and hidden semi-Markov models (HSMMs) relax the geometric duration distribution implicit in standard HMMs by letting each state explicitly emit a duration drawn from an arbitrary distribution.
Classic HMMs are generative models that learn the joint distribution P(O, Q | lambda) and use Bayes' rule to obtain the conditional P(Q | O). For tagging tasks, this is wasteful because at decoding time you only need P(Q | O). Maximum Entropy Markov Models (MEMMs), introduced by McCallum, Freitag, and Pereira in 2000, are a discriminative alternative that directly models P(q_t | q_{t-1}, o_t) using a log-linear form over arbitrary features. MEMMs allow rich feature engineering but suffer from the label bias problem, where states with few outgoing transitions concentrate probability and bias decoding. Conditional Random Fields (CRFs), introduced by Lafferty, McCallum, and Pereira in 2001, fix the label bias problem by globally normalizing the conditional distribution P(Q | O) rather than locally per step. Linear-chain CRFs replaced HMMs as the standard model for POS tagging, NER, and shallow parsing in NLP throughout the 2000s and early 2010s, until they were in turn largely replaced by neural sequence models.
HMMs have been substantially displaced from speech and NLP applications by neural sequence models, but they have not disappeared. The table below contrasts HMMs with the dominant modern alternatives.
| Property | HMM | RNN / LSTM | Transformer |
|---|---|---|---|
| Hidden representation | Discrete (one of N states) | Continuous vector h_t | Continuous vectors per position |
| Inference | Exact via Forward / Viterbi | Approximate, sequential | Approximate, parallel over positions |
| Training | Baum-Welch (EM) or supervised counts | Backpropagation through time | Backpropagation, parallel |
| Long-range dependencies | Markov assumption limits | Better with gates, still hard | Direct via attention |
| Interpretability | High (states are meaningful) | Low | Low to medium (attention) |
| Data efficiency | Good with little data | Needs more data | Needs much more data |
| Parallelism in training | Limited | Limited (sequential) | High |
| Typical 2020s use | Bioinformatics, finance, embedded | Some sequence tasks | NLP, speech, vision dominant |
Despite this displacement, HMMs retain meaningful advantages. Their states are interpretable, they require less training data than deep neural models, they support exact probabilistic inference rather than approximate gradient-based decoding, and they have a clean probabilistic semantics that makes them easy to combine with other generative models. In bioinformatics, HMMs remain dominant for profile-based homology search and gene structure prediction, partly because the underlying biology has natural state structure (codons, exons, introns, secondary structure elements) and partly because labeled biological data is often scarce. In modern hybrid systems, HMMs sometimes appear as components within larger neural pipelines: a neural speech recognizer may use a transformer encoder to produce per-frame state posteriors that are then decoded with a Viterbi-like search over an HMM topology, and HMMs are sometimes used to provide weak labels or alignments for training neural models when frame-level supervision is unavailable.
Several mature software libraries support HMM training and inference.
| Tool | Language | Primary Use | Status |
|---|---|---|---|
| hmmlearn | Python | General-purpose HMMs (discrete and Gaussian) | Active, scikit-learn style API |
| pomegranate | Python | HMMs, mixture models, Bayesian networks | Active |
| HMMER | C | Profile HMMs for biological sequences | Active (v3), standard in bioinformatics |
| Kaldi | C++ | Speech recognition (HMM-GMM and hybrid HMM-DNN) | Maintained, widely used |
| HTK | C | Speech recognition toolkit | Legacy, academic use |
| CMU Sphinx | Java / C | Open-source speech recognition | Legacy |
For probabilistic modeling beyond pure HMMs, libraries such as Pyro, NumPyro, and Stan support HMM-like models within broader probabilistic programming frameworks, often with gradient-based inference rather than classical Baum-Welch.
The Markov assumption that the next state depends only on the current state is a strong restriction. Many real-world sequences have long-range dependencies that no finite-state model can capture exactly. The geometric distribution over state durations implicit in standard HMMs is also unrealistic for many phenomena, motivating semi-Markov extensions. The assumption that observations are conditionally independent given the state is similarly strong: in speech, consecutive frames are highly correlated, which motivated the use of temporal context windows, delta features, and ultimately neural acoustic models that learn richer dependencies; in NLP, the conditional independence of words given the tag misses obvious cross-word dependencies.
As a generative model, an HMM must allocate parameters to explaining the observations as well as the labels, which is wasteful when the goal is labeling. The discrete state space limits expressiveness, since each state must summarize all relevant history about the past. Finally, Baum-Welch is an EM algorithm and only finds local maxima of the likelihood. Initialization matters, the algorithm can be slow to converge, and the number of states must be chosen by the user (or by model selection criteria such as BIC), rather than learned from data.
An HMM is a particular kind of dynamic Bayesian network with one discrete latent variable per time step. The Kalman filter is the continuous-state, linear-Gaussian analog of an HMM, linear-chain CRFs are the discriminative analog, and stochastic context-free grammars generalize HMMs from finite-state to context-free structure. Probabilistic finite automata and weighted finite-state transducers form the broader formal language framework in which HMM decoding is typically implemented at scale. For a sequence-to-sequence task such as machine translation or speech recognition, an HMM expresses one particular factorization of the joint distribution over inputs and outputs; modern encoder-decoder neural models such as the transformer express different factorizations and use distributed continuous representations rather than discrete states, but they are tackling the same underlying problem.