A sequence model is a class of machine learning models designed to process, generate, or make predictions from ordered data where the position and context of each element matters. Unlike models that treat inputs as independent observations, sequence models capture dependencies between elements in a series, making them suitable for tasks involving text, speech, time series, biological sequences, and other temporally or sequentially structured data.
Sequence models have evolved from early statistical approaches like hidden Markov models to neural network-based architectures such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and Transformers. These models form the backbone of modern natural language processing, speech recognition, and many other sequential data applications.
Imagine you are reading a story. You understand each sentence because you remember what happened in the sentences before it. A sequence model works the same way. It reads things one piece at a time (like words in a sentence or notes in a song) and remembers what came before so it can understand what comes next. If the story says "The cat sat on the...", the model remembers "cat" and "sat" and guesses the next word might be "mat" or "couch." Some newer sequence models are even smarter: instead of reading one word at a time, they can look at all the words at once and figure out which ones are most related to each other, like being able to see the whole page of a book instead of reading it letter by letter.
The study of sequential data processing has roots stretching back decades, drawing on both neuroscience and statistical theory.
Markov chains, introduced by Andrey Markov in 1906, provided the first mathematical framework for modeling sequences where the probability of each state depends only on the preceding state. In the 1960s, Leonard Baum and colleagues at the Institute for Defense Analyses developed the theory of hidden Markov models (HMMs), which extended Markov chains by introducing unobservable (hidden) states that generate observed outputs through probabilistic emission functions. HMMs became the dominant approach for speech recognition and biological sequence analysis for several decades.
The idea of recurrence in neural networks has biological origins. Santiago Ramon y Cajal observed "recurrent semicircles" in cerebellar cortex structures as early as 1901, and Donald Hebb proposed "reverberating circuits" as short-term memory mechanisms in the 1940s.
In 1986, Michael I. Jordan introduced the Jordan network, one of the first recurrent neural architectures for sequential processing. In 1990, Jeffrey Elman published "Finding Structure in Time," describing the Elman network (also called a simple recurrent network), which introduced recurrent connections between hidden units and context (memory) units. These two architectures demonstrated that neural networks could learn temporal structure in data.
John Hopfield's 1982 application of spin glass theory to neural networks with binary activations also contributed to the foundations of recurrent computation, although Hopfield networks were primarily used for associative memory rather than sequence processing.
Sepp Hochreiter's 1991 diploma thesis formally identified the vanishing gradient problem, explaining why training RNNs on long sequences was so difficult. As gradients propagate backward through many time steps, they tend to shrink exponentially, making it nearly impossible for the network to learn long-range dependencies. This problem motivated the development of gated architectures.
In 1997, Hochreiter and Jurgen Schmidhuber introduced LSTM, which used a system of gates to control the flow of information and maintain a stable cell state over many time steps. LSTM became the default RNN architecture for the next two decades.
Kyunghyun Cho and colleagues introduced the gated recurrent unit (GRU) in 2014, offering a simpler alternative to LSTM with comparable performance on many tasks.
In 2014, Dzmitry Bahdanau, Cho, and Yoshua Bengio introduced the attention mechanism for neural machine translation, allowing the decoder to focus on different parts of the input sequence at each output step rather than relying on a single fixed-length context vector. This resolved the information bottleneck that plagued earlier encoder-decoder models.
In 2017, Vaswani et al. published "Attention Is All You Need," introducing the Transformer architecture, which replaced recurrence entirely with self-attention and enabled full parallelization of sequence processing. The Transformer has since become the foundation of virtually all modern large language models and many other sequence processing systems.
Sequence models span a wide range of architectures, from classical statistical models to modern deep learning systems.
A hidden Markov model (HMM) is a probabilistic model that represents a system as a series of transitions between hidden states, each of which produces an observable output according to an emission probability distribution. The model makes the Markov assumption: the probability of transitioning to the next state depends only on the current state, not on any earlier states.
An HMM is defined by three components:
Three classical algorithms solve the main inference and learning problems for HMMs:
| Algorithm | Problem solved | Description |
|---|---|---|
| Forward algorithm | Evaluation | Computes the probability of an observed sequence given the model |
| Viterbi algorithm | Decoding | Finds the most likely sequence of hidden states for a given observation sequence |
| Baum-Welch algorithm | Learning | Estimates model parameters from observed data using expectation-maximization |
HMMs were the dominant approach in speech recognition from the mid-1970s through the 2010s and remain widely used in bioinformatics for tasks like gene finding and protein family classification.
A recurrent neural network (RNN) is a neural network that processes sequential data by maintaining a hidden state vector that is updated at each time step. At each step t, the network takes the current input x_t and the previous hidden state h_(t-1) to produce a new hidden state h_t:
h_t = f(W_hh * h_(t-1) + W_xh * x_t + b)
where W_hh is the recurrent weight matrix, W_xh is the input weight matrix, b is a bias term, and f is a nonlinear activation function (typically tanh or ReLU).
RNNs are trained using backpropagation through time (BPTT), which unrolls the network across time steps and applies the chain rule to compute gradients. However, standard RNNs suffer from the vanishing and exploding gradient problems, which limit their ability to learn dependencies over long sequences.
Bidirectional RNNs, introduced by Schuster and Paliwal in 1997, process the input sequence in both the forward and backward directions simultaneously. The hidden states from both directions are concatenated at each time step, giving the model access to both past and future context. Bidirectional architectures are useful when the full sequence is available at inference time, as in text classification or named entity recognition, but are unsuitable for real-time or autoregressive tasks where future inputs are not yet known.
Deep RNNs stack multiple recurrent layers on top of each other. The output of one recurrent layer serves as the input to the next, allowing the network to learn increasingly abstract representations of the sequential input. Deep RNNs with two to four layers are common in practice; very deep stacks tend to be difficult to train without residual connections or other stabilization techniques.
LSTM networks, introduced by Hochreiter and Schmidhuber in 1997 and refined by Felix Gers and colleagues in 2000 (who added the forget gate), address the vanishing gradient problem through a carefully designed gating mechanism and a dedicated memory cell.
An LSTM cell contains four main components:
| Component | Function | Equation concept |
|---|---|---|
| Forget gate | Decides what information to discard from the cell state | Sigmoid layer outputs values between 0 (discard) and 1 (keep) |
| Input gate | Determines which new information to store in the cell state | Sigmoid layer selects values; tanh layer creates candidate values |
| Cell state | Carries information across time steps with minimal transformation | Updated by pointwise addition after forget and input gate operations |
| Output gate | Controls what part of the cell state to expose as the hidden state output | Sigmoid layer filters the cell state passed through tanh |
The cell state acts as a "conveyor belt" that runs through the entire chain, with the gates adding or removing information. Because the cell state is updated through additive operations rather than multiplicative ones, gradients can flow through many time steps without vanishing, allowing LSTMs to learn dependencies spanning hundreds of time steps.
Common LSTM variants include:
The gated recurrent unit (GRU), introduced by Cho et al. in 2014, simplifies the LSTM architecture by combining the forget and input gates into a single update gate and merging the cell state and hidden state into one vector. The GRU uses two gates:
| Gate | Function |
|---|---|
| Update gate (z_t) | Controls how much of the previous hidden state to retain versus how much to replace with new candidate information |
| Reset gate (r_t) | Determines how much of the previous hidden state to incorporate when computing the new candidate hidden state |
GRUs have approximately 25% fewer parameters than LSTMs at the same hidden size, making them faster to train and more memory-efficient. Empirical evaluations by Chung et al. (2014) found that GRUs match or outperform LSTMs on many tasks with sequences under 200 to 300 steps, though LSTMs tend to have an advantage on very long sequences (beyond 500 steps).
The encoder-decoder architecture, also known as sequence-to-sequence (seq2seq), was introduced independently by Sutskever, Vinyals, and Le (2014) and by Cho et al. (2014). It maps a variable-length input sequence to a variable-length output sequence using two separate networks:
The original seq2seq models used stacked LSTMs for both the encoder and decoder. On the WMT 2014 English-to-French translation task, Sutskever et al. achieved a BLEU score of 34.8, demonstrating that end-to-end neural approaches could compete with phrase-based statistical systems.
A significant limitation of the basic seq2seq architecture is the information bottleneck: the entire input sequence must be compressed into a single fixed-size vector, causing performance to degrade on long input sequences. The attention mechanism, described below, was designed to address this limitation.
The attention mechanism allows a decoder to access all encoder hidden states rather than relying on a single summary vector. At each decoding step, the model computes a weighted sum of encoder states, where the weights (attention scores) indicate the relevance of each input position to the current output position.
There are several variants of attention:
| Variant | Year | Key idea |
|---|---|---|
| Bahdanau (additive) attention | 2014 | Alignment scores computed via a learned feed-forward network over concatenated encoder and decoder states |
| Luong (multiplicative) attention | 2015 | Alignment scores computed via dot product or bilinear transformation between encoder and decoder states |
| Self-attention | 2017 | Each position in a sequence attends to all other positions in the same sequence |
| Multi-head attention | 2017 | Multiple attention functions run in parallel on different learned projections, then concatenated |
| Cross-attention | 2017 | Attention between two different sequences (e.g., encoder output and decoder input) |
Bahdanau attention resolved the bottleneck problem by allowing the decoder to dynamically focus on different input positions at each output step. For example, when translating a sentence from English to French, the model can attend to the English word "cat" when generating the French word "chat," regardless of sentence length.
The Transformer, introduced by Vaswani et al. in 2017 in "Attention Is All You Need," dispenses with recurrence entirely and relies solely on attention mechanisms and feedforward layers. The architecture uses:
The original Transformer used a stack of 6 encoder layers and 6 decoder layers, totaling approximately 100 million parameters. It achieved a BLEU score of 28.4 on the WMT 2014 English-to-German translation benchmark, surpassing all previous models including ensembles.
The self-attention operation has O(n^2) complexity with respect to sequence length n, which becomes expensive for very long sequences. Various approaches have been proposed to address this, including sparse attention patterns, linear attention approximations, and the structured state space models discussed below.
Transformer-based models now dominate sequence modeling across domains. Encoder-only variants like BERT are used for classification and understanding tasks. Decoder-only variants like GPT are used for text generation. Encoder-decoder variants like T5 handle sequence-to-sequence tasks.
Temporal convolutional networks (TCNs) apply one-dimensional causal convolutions to sequential data. A causal convolution ensures that the output at time t depends only on inputs at time t and earlier, preserving the temporal ordering. TCNs use dilated convolutions to exponentially increase the receptive field without increasing the number of parameters, allowing them to capture long-range dependencies efficiently.
Bai, Kolter, and Koltun (2018) demonstrated that TCNs outperform LSTMs and GRUs on several standard sequence modeling benchmarks, particularly on tasks requiring very long memory spans. TCNs also offer advantages in parallelization, since all output positions can be computed simultaneously during training (unlike the sequential computation required by RNNs).
WaveNet, introduced by DeepMind in 2016, is a notable TCN-based architecture originally designed for raw audio generation. It uses stacked dilated causal convolutions and has been applied to text-to-speech synthesis and music generation.
Structured state space models (SSMs) represent a newer class of sequence models that draw on continuous-time dynamical systems theory. The Structured State Space Sequence model (S4), introduced by Albert Gu et al. in 2021, parameterizes a linear state space system and uses a special HiPPO (High-order Polynomial Projection Operator) initialization to efficiently capture long-range dependencies with linear complexity in sequence length.
Mamba, introduced by Gu and Tri Dao in December 2023, extends S4 by making the state space parameters input-dependent (selective), allowing the model to filter information based on content rather than using fixed dynamics. Mamba achieves 5x higher inference throughput than similarly sized Transformers and scales linearly with sequence length. On language modeling benchmarks, Mamba-3B matches or outperforms Transformers twice its size.
Hybrid architectures combining SSM layers with attention layers (such as Jamba by AI21 Labs) have emerged as a way to combine the efficiency of SSMs with the strong in-context learning capabilities of Transformers.
| Architecture | Year introduced | Key mechanism | Sequence length handling | Parallelizable | Parameters (relative) | Main limitations |
|---|---|---|---|---|---|---|
| HMM | 1960s | Markov transitions + emission probabilities | Fixed-order Markov assumption | Yes (forward algorithm) | Low | Cannot model long-range dependencies; requires independence assumptions |
| Vanilla RNN | 1986/1990 | Recurrent hidden state | Poor on long sequences | No | Low | Vanishing/exploding gradients |
| LSTM | 1997 | Gated memory cell | Good (hundreds of steps) | No | Medium | Sequential processing; slow training on long sequences |
| GRU | 2014 | Update and reset gates | Good (moderate sequences) | No | Medium (fewer than LSTM) | Sequential processing; may underperform LSTM on very long sequences |
| Seq2seq (with attention) | 2014-2015 | Encoder-decoder + attention | Improved over basic seq2seq | Partially | Medium-High | Still relies on sequential encoder/decoder |
| Transformer | 2017 | Self-attention | Quadratic cost, but handles long-range well | Yes | High | O(n^2) memory and compute in sequence length |
| TCN | 2016-2018 | Dilated causal convolutions | Good (via dilation) | Yes | Medium | Fixed receptive field; less flexible than attention |
| SSM (S4/Mamba) | 2021-2023 | Structured state space | Linear complexity | Yes | Medium | Relatively new; less mature ecosystem |
Several specialized techniques are used when training sequence models.
BPTT is the standard algorithm for computing gradients in recurrent networks. It works by unrolling the recurrent computation graph across all time steps and then applying standard backpropagation. The computational and memory costs of BPTT grow linearly with sequence length.
Truncated BPTT addresses scalability by dividing the sequence into shorter segments and computing gradients within each segment. This introduces a bias (the model cannot learn dependencies longer than the segment length) but makes training practical for long sequences.
During training of autoregressive sequence models, teacher forcing feeds the ground-truth output at each time step as input to the next step, rather than using the model's own predictions. This stabilizes training and speeds convergence but can cause a mismatch between training (where the model sees correct inputs) and inference (where it sees its own potentially incorrect predictions). Techniques like scheduled sampling (Bengio et al., 2015) gradually transition from teacher forcing to model predictions during training to reduce this discrepancy.
Gradient clipping limits the magnitude of gradients during training to prevent the exploding gradient problem. When the norm of the gradient vector exceeds a threshold, it is rescaled to the threshold value. Pascanu, Mikolov, and Bengio (2013) showed that gradient clipping is effective for stabilizing RNN training.
CTC, introduced by Graves et al. in 2006, is a training criterion for sequence labeling problems where the alignment between input and output sequences is unknown. CTC marginalizes over all possible alignments, allowing the model to learn directly from unsegmented data. It has been widely adopted in speech recognition and handwriting recognition systems. Unlike HMM-based approaches, CTC does not require pre-segmented training data or an external alignment model.
In curriculum learning, training examples are presented in order of increasing difficulty, starting with shorter or simpler sequences and progressing to longer or more complex ones. This approach can improve convergence speed and final model performance, particularly for sequence models that struggle with long-range dependencies early in training.
Sequence models are applied across a broad range of domains.
Natural language processing (NLP) is the most prominent application area for sequence models. Key tasks include:
Speech recognition systems convert spoken language to text. The field transitioned from HMM-based systems to deep learning approaches over the 2010s:
Text-to-speech synthesis has similarly benefited from sequence models, with systems like WaveNet (2016) and Tacotron (2017) producing natural-sounding speech from text.
Sequence models are used to predict future values in time series data, including:
LSTMs and GRUs have been standard approaches for time series forecasting. More recently, Transformer-based models and foundation models for time series (such as Amazon Chronos, 2024) have shown competitive results, treating time series forecasting as a token prediction problem analogous to language modeling.
Sequence models process biological sequences in multiple ways:
Sequence models generate music and audio by learning patterns in sequential data:
In robotics and autonomous systems, sequence models process sequential sensor data and generate action sequences:
Sequence models are evaluated on domain-specific benchmarks:
| Domain | Benchmark | Metric | Description |
|---|---|---|---|
| Machine translation | WMT | BLEU | Measures n-gram overlap between machine and reference translations |
| Language modeling | WikiText-103, The Pile | Perplexity | Lower perplexity indicates better prediction of held-out text |
| Speech recognition | LibriSpeech | Word Error Rate (WER) | Percentage of incorrectly transcribed words |
| Sentiment analysis | SST-2, IMDb | Accuracy | Correct classification rate on positive/negative reviews |
| Question answering | SQuAD | F1 / Exact Match | Overlap between predicted and ground-truth answer spans |
| Time series | M4, M5 | MASE, sMAPE | Error metrics for forecasting accuracy |
| Long-range | Long Range Arena | Accuracy | Tests model performance on tasks requiring dependencies over 1,000+ steps |
The Long Range Arena benchmark (Tay et al., 2020) was specifically designed to compare sequence models on their ability to handle long-range dependencies, providing standardized tasks across different sequence lengths and modalities.
Several trends are shaping the development of sequence models as of 2025-2026: