Machine learning terms/Sequence Models
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 3,614 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 3,614 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Sequence models are a class of machine learning systems designed to process inputs or produce outputs that have a meaningful order. The order of the elements in a sequence carries information that point estimators or independent classifiers cannot easily capture. Typical sequence inputs include natural language text, speech audio, video frames, biological sequences such as DNA and proteins, sensor readings over time, financial price history, log streams, and user clickstreams. Many of the most influential systems in modern artificial intelligence, including large language models, machine translation systems, speech recognizers, and time series forecasters, are built on sequence modeling foundations.
Formally, a sequence model defines a probability distribution or a function over sequences. A common formulation factorizes the joint distribution of a sequence x_1, x_2, ..., x_T using the chain rule of probability so that p(x_1, ..., x_T) = product over t of p(x_t | x_1, ..., x_{t-1}). This autoregressive factorization underlies N-gram language models, classical recurrent neural networks, and decoder-only Transformer models such as the GPT family. Other sequence formulations include conditional models p(y | x) for tasks such as translation, sequence labeling models for part-of-speech tagging or named entity recognition, and bidirectional encoders such as BERT that capture context from both directions.
Over several decades the dominant architectures for sequence modeling have shifted from probabilistic graphical models such as hidden Markov models and conditional random fields, to recurrent neural networks such as the LSTM and GRU, to attention based encoder-decoders, to the Transformer and its many variants. More recently, structured state space models such as Mamba have re-introduced linear time recurrences that compete with attention on long contexts. This article surveys these families with references to seminal papers and links to related wiki entries.
In many real tasks the meaning of an input depends on the order of its parts. For example, the sentences "the dog bit the man" and "the man bit the dog" have identical bag-of-words representations but very different meanings. Time series such as electrocardiogram signals, stock prices, and weather measurements only make sense as ordered samples. Speech audio is a one dimensional sequence of pressure values. Even tabular data with a timestep index, such as patient records collected over visits, becomes a sequence problem.
A sequence model must therefore represent dependencies among elements that may be near in time, far in time, or both. Architectures differ in how they model these dependencies, in how their compute and memory scale with sequence length, and in how easy they are to train with gradient descent.
Sequence problems come in several shapes, often described in terms of the relationship between input and output sequences:
| pattern | example tasks | typical models |
|---|---|---|
| one to many | image captioning | CNN encoder with RNN or Transformer decoder |
| many to one | sentiment classification, audio classification | RNN, Transformer encoder |
| many to many synced | part of speech tagging, frame level video labeling | bidirectional RNN, Transformer encoder |
| many to many unsynced | machine translation, summarization | encoder-decoder Transformer |
| autoregressive generation | language modeling, text generation | decoder-only Transformer, RNN |
| sequence to scalar over time | next click prediction, time series forecasting | RNN, Transformer, state space model |
A related distinction is causal versus non-causal modeling. Causal models predict element t using only elements 1 through t-1, which is required for autoregressive generation. Non-causal models are free to look at the full sequence, which can improve representation learning when generation is not required.
The simplest non-trivial sequence model is the Markov chain, in which the next state depends only on the current state. A first order Markov model assumes p(x_t | x_1, ..., x_{t-1}) = p(x_t | x_{t-1}). Higher order Markov models extend the dependency to a fixed window.
Hidden Markov models, popularized for speech recognition by Lawrence Rabiner's 1989 tutorial, model an unobserved Markov state that emits observable outputs. HMMs were the dominant approach to acoustic modeling in automatic speech recognition from the 1980s through the early 2010s, often combined with Gaussian mixture models for emission distributions. Standard HMM algorithms include the forward-backward algorithm for inference, the Viterbi algorithm for finding the most likely state sequence, and Baum-Welch for parameter estimation. HMMs remain useful for low resource sequence labeling and bioinformatics.
Conditional random fields, introduced by John Lafferty, Andrew McCallum, and Fernando Pereira in 2001, are discriminative undirected graphical models for sequence labeling. Linear chain CRFs directly model p(y | x) and avoid the label bias problem that affects locally normalized models. CRFs were widely used for named entity recognition, part-of-speech tagging, and shallow parsing through the 2000s. They are still useful as the final layer of neural sequence taggers, where a bidirectional LSTM or Transformer produces emission scores and a CRF layer enforces transition consistency.
N-gram language models estimate p(w_t | w_{t-n+1}, ..., w_{t-1}) by counting word sequences in a corpus. A bigram model conditions on the previous word, while a trigram model conditions on two preceding words. Smoothing methods such as Kneser-Ney and Good-Turing reduce the impact of unseen sequences. N-gram models powered statistical machine translation systems, query auto-completion, and many speech systems for decades. Although they have been largely replaced by neural language models, they remain valuable for fast scoring, low memory budgets, and as features inside larger systems.
A recurrent neural network processes a sequence one element at a time while maintaining a hidden state that summarizes the history. At each timestep t, the network computes h_t = phi(W_h h_{t-1} + W_x x_t + b), where phi is a nonlinearity such as tanh. Outputs may be produced at every step or only at the end.
The Elman network, described by Jeffrey Elman in 1990 in "Finding Structure in Time," is the canonical simple RNN. Michael Jordan independently proposed a related architecture in 1986 that fed the previous output, rather than the previous hidden state, back into the network. Vanilla RNNs are conceptually elegant but suffer from severe optimization difficulties on long sequences.
Training an RNN by backpropagation through time requires multiplying many Jacobians, one per timestep. When the spectral radius of the recurrent weight matrix is less than one the gradient norm shrinks geometrically, producing the vanishing gradient problem analyzed by Sepp Hochreiter in his 1991 diploma thesis and by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. When the spectral radius exceeds one, gradients can grow without bound, the exploding gradient problem. Vanishing gradients prevent learning long range dependencies, while exploding gradients destabilize training.
A standard remedy for explosion is gradient clipping, proposed by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in 2013. Gradient clipping rescales the gradient vector whenever its norm exceeds a threshold, which keeps updates bounded without changing their direction. The cure for vanishing gradients was architectural: gated RNNs.
The Long Short-Term Memory (LSTM), introduced by Sepp Hochreiter and Juergen Schmidhuber in 1997, replaced the simple recurrence with a memory cell controlled by gates. The classical LSTM has an input gate, an output gate, and a forget gate added by Felix Gers, Juergen Schmidhuber, and Fred Cummins in 2000. The cell state c_t is updated additively, c_t = f_t * c_{t-1} + i_t * g_t, which lets gradients flow through many steps without vanishing.
LSTMs dominated sequence modeling from roughly 2014 to 2018. They powered Google Translate after its 2016 neural rewrite, large speech recognition systems at major laboratories, and many text classification and labeling pipelines. The shorthand LSTM is universally understood inside the field.
The gated recurrent unit, or GRU, was introduced by Kyunghyun Cho and colleagues in 2014. The GRU merges the forget and input gates into a single update gate and has no separate cell state, which yields fewer parameters than an LSTM. Empirical comparisons by Junyoung Chung and colleagues found GRUs and LSTMs to be roughly comparable on many tasks, with the GRU slightly faster and the LSTM occasionally more expressive on very long sequences.
A bidirectional RNN, introduced by Mike Schuster and Kuldip Paliwal in 1997, runs one RNN forward and another backward over the input, then concatenates their hidden states. This gives every position access to both past and future context and is well suited to non-generative tasks such as labeling and reading comprehension. Stacking multiple recurrent layers yields deep RNNs, which can capture hierarchical structure but increase optimization difficulty.
The encoder-decoder, or seq2seq, framework was proposed in two near-simultaneous papers in 2014: Ilya Sutskever, Oriol Vinyals, and Quoc Le's "Sequence to Sequence Learning with Neural Networks" and Kyunghyun Cho and colleagues' "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." An encoder RNN reads the input sequence into a fixed length vector, and a decoder RNN generates the output sequence conditioned on that vector. This single architecture handles translation, summarization, and many other text to text tasks.
A fixed length vector becomes a bottleneck on long inputs. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed attention in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate." Their attention layer lets the decoder, at each output step, compute a weighted average of all encoder hidden states, with weights produced by a learned alignment model. Minh-Thang Luong, Hieu Pham, and Christopher Manning later proposed simpler dot-product variants in 2015. Attention dramatically improved translation quality and provided interpretable alignments between source and target tokens.
In 2017 Ashish Vaswani and colleagues at Google published "Attention Is All You Need," introducing the Transformer. The Transformer removed recurrence entirely and replaced it with multi-head self-attention and feed-forward layers. Self-attention computes pairwise interactions between all tokens, with queries, keys, and values produced by linear projections of the input. Positional information is restored through positional encodings, originally a sinusoidal scheme.
The original paper presented an encoder-decoder Transformer for machine translation, but the same building blocks support encoder-only models, decoder-only models, and many hybrids. Self-attention is parallelizable across sequence positions, which made Transformers far easier to scale on modern accelerators than RNNs.
Three main Transformer families have emerged. Encoder-only models such as BERT, introduced by Jacob Devlin and colleagues in 2018, use bidirectional self-attention and are trained with masked language modeling. They excel at understanding tasks such as classification and question answering. Decoder-only models such as the GPT series from OpenAI, beginning with Alec Radford and colleagues' 2018 GPT-1, use causal self-attention and are trained with next token prediction. They excel at generation and have become the dominant architecture for large language models. Encoder-decoder models such as T5, introduced by Colin Raffel and colleagues in 2019, frame all tasks as text to text problems and are widely used for translation, summarization, and instruction following.
Scaling laws established by Jared Kaplan and colleagues in 2020 and refined by the Chinchilla paper from Jordan Hoffmann and colleagues in 2022 showed that model loss decreases predictably with parameters, data, and compute. This empirical regularity drove the rise of frontier models such as GPT-3, GPT-4, Claude, Gemini, and Llama, which can be viewed as very large autoregressive sequence models. Variants such as Mixture of Experts, used in Switch Transformer and Mixtral, scale parameter counts while keeping per-token compute roughly constant.
Although Transformers are powerful, their attention layer has cost that scales quadratically in sequence length, which limits very long contexts. A line of research on structured state space models revives linear time recurrences with strong long range performance.
Albert Gu and Tri Dao introduced Mamba in late 2023 in "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Mamba uses input-dependent state space parameters, which make the recurrence selective in a sense analogous to gating. Mamba-2, presented by Tri Dao and Albert Gu in 2024 in "Transformers are SSMs," connects state space models to a class of structured attention and improves training efficiency. The Hyena architecture, by Michael Poli and colleagues in 2023, replaces attention with implicit long convolutions and a gated structure.
Hybrid models combine attention with state space layers. AI21 Labs released Jamba in 2024, a production hybrid that interleaves Mamba blocks, Transformer blocks, and Mixture of Experts. Hybrids aim to combine the modeling strength of attention on local patterns with the linear scaling of state space recurrences on long contexts.
Sequence modeling for numerical time series has a partially separate lineage from natural language modeling. Classical methods include autoregressive integrated moving average (ARIMA) models, exponential smoothing, and state space approaches such as the Kalman filter.
Prophet, released by Facebook in 2017 by Sean Taylor and Benjamin Letham, is a structural time series tool that decomposes a series into trend, seasonality, and holidays. It is widely used for business forecasting because it is robust to missing data and easy to tune.
More recently, foundation models for time series have emerged. TimeGPT was released by Nixtla in 2023 as a closed source forecasting API trained on a large corpus of series. Chronos, from Amazon researchers Abdul Fatir Ansari and colleagues in 2024, tokenizes time series values and trains a Transformer language model on them. Lag-Llama, by Kashif Rasul and colleagues in 2023, is an open foundation model that conditions on lagged values. Moirai, from Salesforce researchers Gerald Woo and colleagues in 2024, is a masked encoder Transformer trained on a large multivariate corpus. PatchTST and N-BEATS represent earlier strong neural baselines.
Connectionist Temporal Classification (CTC), introduced by Alex Graves and colleagues in 2006, allows a sequence model to be trained on input-output pairs of different lengths without explicit alignment. CTC introduces a blank symbol and sums over all valid alignments. CTC remains a core building block for speech recognition, lip reading, and other tasks where output is shorter than input.
The RNN transducer, also from Alex Graves in 2012, extends CTC with a separate prediction network that conditions on previous outputs. RNN-T is widely used in production speech recognizers because it streams naturally and supports robust on-device inference.
Whisper, released by OpenAI in 2022 by Alec Radford and colleagues, is an encoder-decoder Transformer trained on hundreds of thousands of hours of weakly supervised audio. It performs multilingual speech recognition, translation, and language identification in a single model. Other notable speech sequence models include Conformer (Anmol Gulati and colleagues, 2020), wav2vec 2.0 (Alexei Baevski and colleagues, 2020), and HuBERT (Wei-Ning Hsu and colleagues, 2021), which use self-supervised learning on raw audio.
Sequence models are typically trained with stochastic gradient descent and variants such as Adam. Several practical concerns deserve attention.
Teacher forcing feeds the ground truth previous token to the decoder during training, which speeds up convergence but can mismatch the autoregressive setting at inference. Scheduled sampling, proposed by Samy Bengio and colleagues in 2015, mixes ground truth with predictions during training to reduce this exposure bias.
Tokenization choices, such as byte pair encoding by Rico Sennrich and colleagues in 2016 or SentencePiece by Taku Kudo and John Richardson in 2018, determine the alphabet over which a language model operates. Subword tokenization handles rare words gracefully and is now standard for text models.
Long context training requires careful memory management. Techniques such as gradient checkpointing, FlashAttention from Tri Dao and colleagues in 2022, and sequence parallelism allow Transformers to be trained on contexts of tens or hundreds of thousands of tokens. State space models avoid the quadratic memory cost of attention but introduce their own engineering challenges.
Sequence model evaluation depends on the task. Language models report perplexity, defined as the exponential of the average negative log likelihood per token. Translation quality is measured with BLEU, METEOR, chrF, and learned metrics such as COMET. Summarization uses ROUGE and human ratings. Speech recognition reports word error rate or character error rate. Time series forecasting uses mean absolute error, root mean squared error, and mean absolute percentage error, often with seasonal naive baselines.
For long range modeling, benchmarks such as the Long Range Arena introduced by Yi Tay and colleagues in 2020 measure how well a model handles sequences of thousands of tokens. For instruction following and reasoning, benchmarks such as MMLU, GSM8K, BIG-Bench, and HELM exercise large language models on diverse tasks.
Choosing a sequence architecture depends on context length, latency budget, training data, and deployment target. Short sequences with strong supervised signal often work well with bidirectional Transformer encoders or LSTMs. Long autoregressive generation favors decoder-only Transformers, possibly augmented with state space layers for very long contexts. Streaming applications such as on-device speech recognition benefit from RNN transducers and chunked attention. Time series forecasting with limited data may still favor classical methods like Prophet and ARIMA over large foundation models.
Positional encoding choices matter for long contexts. Rotary positional embeddings, introduced by Jianlin Su and colleagues in 2021, and ALiBi, introduced by Ofir Press and colleagues in 2021, generalize better to longer sequences than the original sinusoidal scheme. Mixture of context strategies, retrieval augmented generation, and external memory modules let models reach beyond their nominal context window.
The following pages on this wiki cover related sequence model concepts. This index preserves the original list of links from this gateway page.