Machine learning terms/Sequence Models

Machine Learning Model Architecture

18 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

33 citations

Revision

v4 · 3,609 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

introduction

Sequence models are a class of machine learning systems designed to process inputs or produce outputs that have a meaningful order. The order of the elements in a sequence carries information that point estimators or independent classifiers cannot easily capture. Typical sequence inputs include natural language text, speech audio, video frames, biological sequences such as DNA and proteins, sensor readings over time, financial price history, log streams, and user clickstreams. Many of the most influential systems in modern artificial intelligence, including large language models, machine translation systems, speech recognizers, and time series forecasters, are built on sequence modeling foundations.

Formally, a sequence model defines a probability distribution or a function over sequences. A common formulation factorizes the joint distribution of a sequence $x_1, x_2, \ldots, x_T$ using the chain rule of probability so that $p(x_1, \ldots, x_T) = \prod_t p(x_t \mid x_1, \ldots, x_{t-1})$ . This autoregressive factorization underlies N-gram language models, classical recurrent neural networks, and decoder-only Transformer models such as the GPT family. Other sequence formulations include conditional models $p(y \mid x)$ for tasks such as translation, sequence labeling models for part-of-speech tagging or named entity recognition, and bidirectional encoders such as BERT that capture context from both directions.

Over several decades the dominant architectures for sequence modeling have shifted from probabilistic graphical models such as hidden Markov models and conditional random fields, to recurrent neural networks such as the LSTM and GRU, to attention based encoder-decoders, to the Transformer and its many variants. More recently, structured state space models such as Mamba have re-introduced linear time recurrences that compete with attention on long contexts. This article surveys these families with references to seminal papers and links to related wiki entries.

why order matters

In many real tasks the meaning of an input depends on the order of its parts. For example, the sentences "the dog bit the man" and "the man bit the dog" have identical bag-of-words representations but very different meanings. Time series such as electrocardiogram signals, stock prices, and weather measurements only make sense as ordered samples. Speech audio is a one dimensional sequence of pressure values. Even tabular data with a timestep index, such as patient records collected over visits, becomes a sequence problem.

A sequence model must therefore represent dependencies among elements that may be near in time, far in time, or both. Architectures differ in how they model these dependencies, in how their compute and memory scale with sequence length, and in how easy they are to train with gradient descent.

taxonomy of sequence tasks

Sequence problems come in several shapes, often described in terms of the relationship between input and output sequences:

pattern	example tasks	typical models
one to many	image captioning	CNN encoder with RNN or Transformer decoder
many to one	sentiment classification, audio classification	RNN, Transformer encoder
many to many synced	part of speech tagging, frame level video labeling	bidirectional RNN, Transformer encoder
many to many unsynced	machine translation, summarization	encoder-decoder Transformer
autoregressive generation	language modeling, text generation	decoder-only Transformer, RNN
sequence to scalar over time	next click prediction, time series forecasting	RNN, Transformer, state space model

A related distinction is causal versus non-causal modeling. Causal models predict element t using only elements 1 through t-1, which is required for autoregressive generation. Non-causal models are free to look at the full sequence, which can improve representation learning when generation is not required.

classical sequence models

markov models and hidden markov models

The simplest non-trivial sequence model is the Markov chain, in which the next state depends only on the current state. A first order Markov model assumes $p(x_t \mid x_1, \ldots, x_{t-1}) = p(x_t \mid x_{t-1})$ . Higher order Markov models extend the dependency to a fixed window.

Hidden Markov models, popularized for speech recognition by Lawrence Rabiner's 1989 tutorial, model an unobserved Markov state that emits observable outputs.^[33] HMMs were the dominant approach to acoustic modeling in automatic speech recognition from the 1980s through the early 2010s, often combined with Gaussian mixture models for emission distributions. Standard HMM algorithms include the forward-backward algorithm for inference, the Viterbi algorithm for finding the most likely state sequence, and Baum-Welch for parameter estimation. HMMs remain useful for low resource sequence labeling and bioinformatics.

conditional random fields

Conditional random fields, introduced by John Lafferty, Andrew McCallum, and Fernando Pereira in 2001, are discriminative undirected graphical models for sequence labeling.^[6] Linear chain CRFs directly model $p(y \mid x)$ and avoid the label bias problem that affects locally normalized models. CRFs were widely used for named entity recognition, part-of-speech tagging, and shallow parsing through the 2000s. They are still useful as the final layer of neural sequence taggers, where a bidirectional LSTM or Transformer produces emission scores and a CRF layer enforces transition consistency.

n-gram language models

N-gram language models estimate $p(w_t \mid w_{t-n+1}, \ldots, w_{t-1})$ by counting word sequences in a corpus. A bigram model conditions on the previous word, while a trigram model conditions on two preceding words. Smoothing methods such as Kneser-Ney and Good-Turing reduce the impact of unseen sequences. N-gram models powered statistical machine translation systems, query auto-completion, and many speech systems for decades. Although they have been largely replaced by neural language models, they remain valuable for fast scoring, low memory budgets, and as features inside larger systems.

recurrent neural networks

vanilla rnn and the elman network

A recurrent neural network processes a sequence one element at a time while maintaining a hidden state that summarizes the history. At each timestep t, the network computes $h_t = \phi(W_h h_{t-1} + W_x x_t + b)$ , where $\phi$ is a nonlinearity such as $\tanh$ . Outputs may be produced at every step or only at the end.

The Elman network, described by Jeffrey Elman in 1990 in "Finding Structure in Time," is the canonical simple RNN.^[1] Michael Jordan independently proposed a related architecture in 1986 that fed the previous output, rather than the previous hidden state, back into the network. Vanilla RNNs are conceptually elegant but suffer from severe optimization difficulties on long sequences.

vanishing and exploding gradients

Training an RNN by backpropagation through time requires multiplying many Jacobians, one per timestep. When the spectral radius of the recurrent weight matrix is less than one the gradient norm shrinks geometrically, producing the vanishing gradient problem analyzed by Sepp Hochreiter in his 1991 diploma thesis and by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994.^[2]^[3] When the spectral radius exceeds one, gradients can grow without bound, the exploding gradient problem. Vanishing gradients prevent learning long range dependencies, while exploding gradients destabilize training.

A standard remedy for explosion is gradient clipping, proposed by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in 2013.^[9] Gradient clipping rescales the gradient vector whenever its norm exceeds a threshold, which keeps updates bounded without changing their direction. The cure for vanishing gradients was architectural: gated RNNs.

long short-term memory

The Long Short-Term Memory (LSTM), introduced by Sepp Hochreiter and Juergen Schmidhuber in 1997, replaced the simple recurrence with a memory cell controlled by gates.^[4] The classical LSTM has an input gate, an output gate, and a forget gate added by Felix Gers, Juergen Schmidhuber, and Fred Cummins in 2000.^[7] The cell state $c_t$ is updated additively, $c_t = f_t c_{t-1} + i_t g_t$ , which lets gradients flow through many steps without vanishing.

LSTMs dominated sequence modeling from roughly 2014 to 2018. They powered Google Translate after its 2016 neural rewrite, large speech recognition systems at major laboratories, and many text classification and labeling pipelines. The shorthand LSTM is universally understood inside the field.

gated recurrent unit

The gated recurrent unit, or GRU, was introduced by Kyunghyun Cho and colleagues in 2014.^[10] The GRU merges the forget and input gates into a single update gate and has no separate cell state, which yields fewer parameters than an LSTM. Empirical comparisons by Junyoung Chung and colleagues found GRUs and LSTMs to be roughly comparable on many tasks, with the GRU slightly faster and the LSTM occasionally more expressive on very long sequences.^[13]

bidirectional and deep rnns

A bidirectional RNN, introduced by Mike Schuster and Kuldip Paliwal in 1997, runs one RNN forward and another backward over the input, then concatenates their hidden states.^[5] This gives every position access to both past and future context and is well suited to non-generative tasks such as labeling and reading comprehension. Stacking multiple recurrent layers yields deep RNNs, which can capture hierarchical structure but increase optimization difficulty.

encoder-decoder and attention

the seq2seq framework

The encoder-decoder, or seq2seq, framework was proposed in two near-simultaneous papers in 2014: Ilya Sutskever, Oriol Vinyals, and Quoc Le's "Sequence to Sequence Learning with Neural Networks"^[11] and Kyunghyun Cho and colleagues' "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation."^[10] An encoder RNN reads the input sequence into a fixed length vector, and a decoder RNN generates the output sequence conditioned on that vector. This single architecture handles translation, summarization, and many other text to text tasks.

attention mechanism

A fixed length vector becomes a bottleneck on long inputs. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed attention in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate."^[12] Their attention layer lets the decoder, at each output step, compute a weighted average of all encoder hidden states, with weights produced by a learned alignment model. Minh-Thang Luong, Hieu Pham, and Christopher Manning later proposed simpler dot-product variants in 2015.^[14] Attention dramatically improved translation quality and provided interpretable alignments between source and target tokens.

the transformer

architecture

In 2017 Ashish Vaswani and colleagues at Google published "Attention Is All You Need," introducing the Transformer.^[16] The Transformer removed recurrence entirely and replaced it with multi-head self-attention and feed-forward layers. Self-attention computes pairwise interactions between all tokens, with queries, keys, and values produced by linear projections of the input. Positional information is restored through positional encodings, originally a sinusoidal scheme.

The original paper presented an encoder-decoder Transformer for machine translation, but the same building blocks support encoder-only models, decoder-only models, and many hybrids. Self-attention is parallelizable across sequence positions, which made Transformers far easier to scale on modern accelerators than RNNs.

encoder-only, decoder-only, and encoder-decoder variants

Three main Transformer families have emerged. Encoder-only models such as BERT, introduced by Jacob Devlin and colleagues in 2018, use bidirectional self-attention and are trained with masked language modeling.^[18] They excel at understanding tasks such as classification and question answering. Decoder-only models such as the GPT series from OpenAI, beginning with Alec Radford and colleagues' 2018 GPT-1, use causal self-attention and are trained with next token prediction.^[19] They excel at generation and have become the dominant architecture for large language models. Encoder-decoder models such as T5, introduced by Colin Raffel and colleagues in 2019, frame all tasks as text to text problems and are widely used for translation, summarization, and instruction following.^[20]

scaling and pretraining

Scaling laws established by Jared Kaplan and colleagues in 2020 and refined by the Chinchilla paper from Jordan Hoffmann and colleagues in 2022 showed that model loss decreases predictably with parameters, data, and compute.^[21]^[25] This empirical regularity drove the rise of frontier models such as GPT-3, GPT-4, Claude, Gemini, and Llama, which can be viewed as very large autoregressive sequence models. Variants such as Mixture of Experts, used in Switch Transformer and Mixtral, scale parameter counts while keeping per-token compute roughly constant.

state space models and linear sequence models

Although Transformers are powerful, their attention layer has cost that scales quadratically in sequence length, which limits very long contexts. A line of research on structured state space models revives linear time recurrences with strong long range performance.

Albert Gu and Tri Dao introduced Mamba in late 2023 in "Mamba: Linear-Time Sequence Modeling with Selective State Spaces."^[29] Mamba uses input-dependent state space parameters, which make the recurrence selective in a sense analogous to gating. Mamba-2, presented by Tri Dao and Albert Gu in 2024 in "Transformers are SSMs," connects state space models to a class of structured attention and improves training efficiency.^[30] The Hyena architecture, by Michael Poli and colleagues in 2023, replaces attention with implicit long convolutions and a gated structure.^[28]

Hybrid models combine attention with state space layers. AI21 Labs released Jamba in 2024, a production hybrid that interleaves Mamba blocks, Transformer blocks, and Mixture of Experts. Hybrids aim to combine the modeling strength of attention on local patterns with the linear scaling of state space recurrences on long contexts.

time series forecasting models

Sequence modeling for numerical time series has a partially separate lineage from natural language modeling. Classical methods include autoregressive integrated moving average (ARIMA) models, exponential smoothing, and state space approaches such as the Kalman filter.

Prophet, released by Facebook in 2017 by Sean Taylor and Benjamin Letham, is a structural time series tool that decomposes a series into trend, seasonality, and holidays.^[17] It is widely used for business forecasting because it is robust to missing data and easy to tune.

More recently, foundation models for time series have emerged. TimeGPT was released by Nixtla in 2023 as a closed source forecasting API trained on a large corpus of series. Chronos, from Amazon researchers Abdul Fatir Ansari and colleagues in 2024, tokenizes time series values and trains a Transformer language model on them.^[31] Lag-Llama, by Kashif Rasul and colleagues in 2023, is an open foundation model that conditions on lagged values. Moirai, from Salesforce researchers Gerald Woo and colleagues in 2024, is a masked encoder Transformer trained on a large multivariate corpus.^[32] PatchTST and N-BEATS represent earlier strong neural baselines.

speech and audio sequence models

connectionist temporal classification

Connectionist Temporal Classification (CTC), introduced by Alex Graves and colleagues in 2006, allows a sequence model to be trained on input-output pairs of different lengths without explicit alignment.^[8] CTC introduces a blank symbol and sums over all valid alignments. CTC remains a core building block for speech recognition, lip reading, and other tasks where output is shorter than input.

rnn transducer

The RNN transducer, also from Alex Graves in 2012, extends CTC with a separate prediction network that conditions on previous outputs. RNN-T is widely used in production speech recognizers because it streams naturally and supports robust on-device inference.

whisper and modern speech models

Whisper, released by OpenAI in 2022 by Alec Radford and colleagues, is an encoder-decoder Transformer trained on hundreds of thousands of hours of weakly supervised audio.^[27] It performs multilingual speech recognition, translation, and language identification in a single model. Other notable speech sequence models include Conformer (Anmol Gulati and colleagues, 2020), wav2vec 2.0 (Alexei Baevski and colleagues, 2020), and HuBERT (Wei-Ning Hsu and colleagues, 2021), which use self-supervised learning on raw audio.

training sequence models

Sequence models are typically trained with stochastic gradient descent and variants such as Adam. Several practical concerns deserve attention.

Teacher forcing feeds the ground truth previous token to the decoder during training, which speeds up convergence but can mismatch the autoregressive setting at inference. Scheduled sampling, proposed by Samy Bengio and colleagues in 2015, mixes ground truth with predictions during training to reduce this exposure bias.

Tokenization choices, such as byte pair encoding by Rico Sennrich and colleagues in 2016^[15] or SentencePiece by Taku Kudo and John Richardson in 2018, determine the alphabet over which a language model operates. Subword tokenization handles rare words gracefully and is now standard for text models.

Long context training requires careful memory management. Techniques such as gradient checkpointing, FlashAttention from Tri Dao and colleagues in 2022,^[26] and sequence parallelism allow Transformers to be trained on contexts of tens or hundreds of thousands of tokens. State space models avoid the quadratic memory cost of attention but introduce their own engineering challenges.

evaluating sequence models

Sequence model evaluation depends on the task. Language models report perplexity, defined as the exponential of the average negative log likelihood per token. Translation quality is measured with BLEU, METEOR, chrF, and learned metrics such as COMET. Summarization uses ROUGE and human ratings. Speech recognition reports word error rate or character error rate. Time series forecasting uses mean absolute error, root mean squared error, and mean absolute percentage error, often with seasonal naive baselines.

For long range modeling, benchmarks such as the Long Range Arena introduced by Yi Tay and colleagues in 2020 measure how well a model handles sequences of thousands of tokens.^[22] For instruction following and reasoning, benchmarks such as MMLU, GSM8K, BIG-Bench, and HELM exercise large language models on diverse tasks.

practical considerations

Choosing a sequence architecture depends on context length, latency budget, training data, and deployment target. Short sequences with strong supervised signal often work well with bidirectional Transformer encoders or LSTMs. Long autoregressive generation favors decoder-only Transformers, possibly augmented with state space layers for very long contexts. Streaming applications such as on-device speech recognition benefit from RNN transducers and chunked attention. Time series forecasting with limited data may still favor classical methods like Prophet and ARIMA over large foundation models.

Positional encoding choices matter for long contexts. Rotary positional embeddings, introduced by Jianlin Su and colleagues in 2021,^[23] and ALiBi, introduced by Ofir Press and colleagues in 2021,^[24] generalize better to longer sequences than the original sinusoidal scheme. Mixture of context strategies, retrieval augmented generation, and external memory modules let models reach beyond their nominal context window.

index of sequence model wiki pages

The following pages on this wiki cover related sequence model concepts. This index preserves the original list of links from this gateway page.

references

Elman, Jeffrey L. (1990). "Finding Structure in Time." Cognitive Science 14(2): 179-211. ↩
Hochreiter, Sepp (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitaet Muenchen. ↩
Bengio, Yoshua, Patrice Simard, and Paolo Frasconi (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks 5(2): 157-166. ↩
Hochreiter, Sepp, and Juergen Schmidhuber (1997). "Long Short-Term Memory." Neural Computation 9(8): 1735-1780. ↩
Schuster, Mike, and Kuldip K. Paliwal (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing 45(11): 2673-2681. ↩
Lafferty, John, Andrew McCallum, and Fernando Pereira (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML. ↩
Gers, Felix A., Juergen Schmidhuber, and Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation 12(10): 2451-2471. ↩
Graves, Alex, Santiago Fernandez, Faustino Gomez, and Juergen Schmidhuber (2006). "Connectionist Temporal Classification." ICML. ↩
Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. ↩
Cho, Kyunghyun, et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP. ↩
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS. ↩
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473. ↩
Chung, Junyoung, et al. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." arXiv:1412.3555. ↩
Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP. ↩
Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL. ↩
Vaswani, Ashish, et al. (2017). "Attention Is All You Need." NeurIPS. ↩
Taylor, Sean J., and Benjamin Letham (2017). "Forecasting at Scale." American Statistician 72(1): 37-45. ↩
Devlin, Jacob, et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805. ↩
Radford, Alec, et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI technical report. ↩
Raffel, Colin, et al. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv:1910.10683. ↩
Kaplan, Jared, et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. ↩
Tay, Yi, et al. (2020). "Long Range Arena: A Benchmark for Efficient Transformers." arXiv:2011.04006. ↩
Su, Jianlin, et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. ↩
Press, Ofir, Noah A. Smith, and Mike Lewis (2021). "Train Short, Test Long: Attention with Linear Biases." arXiv:2108.12409. ↩
Hoffmann, Jordan, et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556. ↩
Dao, Tri, et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS. ↩
Radford, Alec, et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356. ↩
Poli, Michael, et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." ICML. ↩
Gu, Albert, and Tri Dao (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. ↩
Dao, Tri, and Albert Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML. ↩
Ansari, Abdul Fatir, et al. (2024). "Chronos: Learning the Language of Time Series." arXiv:2403.07815. ↩
Woo, Gerald, et al. (2024). "Unified Training of Universal Time Series Forecasting Transformers." ICML. ↩
Rabiner, Lawrence R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of the IEEE 77(2): 257-286. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Machine learning terms Terms

introduction

why order matters

taxonomy of sequence tasks

classical sequence models

markov models and hidden markov models

conditional random fields

n-gram language models

recurrent neural networks

vanilla rnn and the elman network

vanishing and exploding gradients

long short-term memory

gated recurrent unit

bidirectional and deep rnns

encoder-decoder and attention

the seq2seq framework

attention mechanism

the transformer

architecture

encoder-only, decoder-only, and encoder-decoder variants

scaling and pretraining

state space models and linear sequence models

time series forecasting models

speech and audio sequence models

connectionist temporal classification

rnn transducer

whisper and modern speech models

training sequence models

evaluating sequence models

practical considerations

index of sequence model wiki pages

see also

references

Improve this article

Related Articles

Graph Machine Learning Models

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Tower

What links here

Related Articles

Graph Machine Learning Models

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Tower

What links here