# Machine learning terms/Sequence Models

> Source: https://aiwiki.ai/wiki/machine_learning_terms_sequence_models
> Updated: 2026-05-09
> Categories: Machine Learning, Model Architecture
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## introduction

Sequence models are a class of [machine learning](/wiki/machine_learning) systems designed to process inputs or produce outputs that have a meaningful order. The order of the elements in a sequence carries information that point estimators or independent classifiers cannot easily capture. Typical sequence inputs include natural language text, speech audio, video frames, biological sequences such as DNA and proteins, sensor readings over time, financial price history, log streams, and user clickstreams. Many of the most influential systems in modern [artificial intelligence](/wiki/artificial_intelligence), including [large language models](/wiki/large_language_model), [machine translation](/wiki/machine_translation) systems, speech recognizers, and time series forecasters, are built on sequence modeling foundations.

Formally, a sequence model defines a probability distribution or a function over sequences. A common formulation factorizes the joint distribution of a sequence x_1, x_2, ..., x_T using the chain rule of probability so that p(x_1, ..., x_T) = product over t of p(x_t | x_1, ..., x_{t-1}). This autoregressive factorization underlies [N-gram](/wiki/n-gram) language models, classical [recurrent neural networks](/wiki/recurrent_neural_network), and decoder-only [Transformer](/wiki/transformer) models such as the [GPT](/wiki/gpt) family. Other sequence formulations include conditional models p(y | x) for tasks such as translation, sequence labeling models for [part-of-speech tagging](/wiki/part_of_speech_tagging) or [named entity recognition](/wiki/named_entity_recognition), and bidirectional encoders such as [BERT](/wiki/bert) that capture context from both directions.

Over several decades the dominant architectures for sequence modeling have shifted from probabilistic graphical models such as [hidden Markov models](/wiki/hidden_markov_model) and [conditional random fields](/wiki/conditional_random_field), to recurrent neural networks such as the [LSTM](/wiki/long_short-term_memory_lstm) and [GRU](/wiki/gated_recurrent_unit), to attention based encoder-decoders, to the [Transformer](/wiki/transformer) and its many variants. More recently, structured [state space models](/wiki/state_space_model) such as [Mamba](/wiki/mamba) have re-introduced linear time recurrences that compete with attention on long contexts. This article surveys these families with references to seminal papers and links to related wiki entries.

## why order matters

In many real tasks the meaning of an input depends on the order of its parts. For example, the sentences "the dog bit the man" and "the man bit the dog" have identical bag-of-words representations but very different meanings. Time series such as electrocardiogram signals, stock prices, and weather measurements only make sense as ordered samples. Speech audio is a one dimensional sequence of pressure values. Even tabular data with a [timestep](/wiki/timestep) index, such as patient records collected over visits, becomes a sequence problem.

A sequence model must therefore represent dependencies among elements that may be near in time, far in time, or both. Architectures differ in how they model these dependencies, in how their compute and memory scale with sequence length, and in how easy they are to train with [gradient descent](/wiki/gradient_descent).

## taxonomy of sequence tasks

Sequence problems come in several shapes, often described in terms of the relationship between input and output sequences:

| pattern | example tasks | typical models |
|---|---|---|
| one to many | image captioning | CNN encoder with RNN or Transformer decoder |
| many to one | sentiment classification, audio classification | RNN, Transformer encoder |
| many to many synced | part of speech tagging, frame level video labeling | bidirectional RNN, Transformer encoder |
| many to many unsynced | machine translation, summarization | encoder-decoder Transformer |
| autoregressive generation | language modeling, text generation | decoder-only Transformer, RNN |
| sequence to scalar over time | next click prediction, time series forecasting | RNN, Transformer, state space model |

A related distinction is causal versus non-causal modeling. Causal models predict element t using only elements 1 through t-1, which is required for autoregressive generation. Non-causal models are free to look at the full sequence, which can improve representation learning when generation is not required.

## classical sequence models

### markov models and hidden markov models

The simplest non-trivial sequence model is the Markov chain, in which the next state depends only on the current state. A first order Markov model assumes p(x_t | x_1, ..., x_{t-1}) = p(x_t | x_{t-1}). Higher order Markov models extend the dependency to a fixed window.

Hidden Markov models, popularized for speech recognition by Lawrence Rabiner's 1989 tutorial, model an unobserved Markov state that emits observable outputs. HMMs were the dominant approach to acoustic modeling in [automatic speech recognition](/wiki/automatic_speech_recognition) from the 1980s through the early 2010s, often combined with Gaussian mixture models for emission distributions. Standard HMM algorithms include the forward-backward algorithm for inference, the [Viterbi algorithm](/wiki/viterbi_algorithm) for finding the most likely state sequence, and Baum-Welch for parameter estimation. HMMs remain useful for low resource sequence labeling and bioinformatics.

### conditional random fields

[Conditional random fields](/wiki/conditional_random_field), introduced by John Lafferty, Andrew McCallum, and Fernando Pereira in 2001, are discriminative undirected graphical models for sequence labeling. Linear chain CRFs directly model p(y | x) and avoid the label bias problem that affects locally normalized models. CRFs were widely used for [named entity recognition](/wiki/named_entity_recognition), [part-of-speech tagging](/wiki/part_of_speech_tagging), and shallow parsing through the 2000s. They are still useful as the final layer of neural sequence taggers, where a bidirectional LSTM or Transformer produces emission scores and a CRF layer enforces transition consistency.

### n-gram language models

[N-gram](/wiki/n-gram) language models estimate p(w_t | w_{t-n+1}, ..., w_{t-1}) by counting word sequences in a corpus. A [bigram](/wiki/bigram) model conditions on the previous word, while a [trigram](/wiki/trigram) model conditions on two preceding words. Smoothing methods such as Kneser-Ney and Good-Turing reduce the impact of unseen sequences. N-gram models powered statistical machine translation systems, query auto-completion, and many speech systems for decades. Although they have been largely replaced by neural language models, they remain valuable for fast scoring, low memory budgets, and as features inside larger systems.

## recurrent neural networks

### vanilla rnn and the elman network

A [recurrent neural network](/wiki/recurrent_neural_network) processes a sequence one element at a time while maintaining a hidden state that summarizes the history. At each [timestep](/wiki/timestep) t, the network computes h_t = phi(W_h h_{t-1} + W_x x_t + b), where phi is a nonlinearity such as tanh. Outputs may be produced at every step or only at the end.

The Elman network, described by Jeffrey Elman in 1990 in "Finding Structure in Time," is the canonical simple RNN. Michael Jordan independently proposed a related architecture in 1986 that fed the previous output, rather than the previous hidden state, back into the network. Vanilla RNNs are conceptually elegant but suffer from severe optimization difficulties on long sequences.

### vanishing and exploding gradients

Training an RNN by [backpropagation through time](/wiki/backpropagation_through_time) requires multiplying many Jacobians, one per timestep. When the spectral radius of the recurrent weight matrix is less than one the gradient norm shrinks geometrically, producing the [vanishing gradient problem](/wiki/vanishing_gradient_problem) analyzed by Sepp Hochreiter in his 1991 diploma thesis and by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. When the spectral radius exceeds one, gradients can grow without bound, the [exploding gradient problem](/wiki/exploding_gradient_problem). Vanishing gradients prevent learning long range dependencies, while exploding gradients destabilize training.

A standard remedy for explosion is [gradient clipping](/wiki/gradient_clipping), proposed by Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio in 2013. Gradient clipping rescales the gradient vector whenever its norm exceeds a threshold, which keeps updates bounded without changing their direction. The cure for vanishing gradients was architectural: gated RNNs.

### long short-term memory

The [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm), introduced by Sepp Hochreiter and Juergen Schmidhuber in 1997, replaced the simple recurrence with a memory cell controlled by gates. The classical LSTM has an input gate, an output gate, and a [forget gate](/wiki/forget_gate) added by Felix Gers, Juergen Schmidhuber, and Fred Cummins in 2000. The cell state c_t is updated additively, c_t = f_t * c_{t-1} + i_t * g_t, which lets gradients flow through many steps without vanishing.

LSTMs dominated sequence modeling from roughly 2014 to 2018. They powered Google Translate after its 2016 neural rewrite, large speech recognition systems at major laboratories, and many text classification and labeling pipelines. The shorthand [LSTM](/wiki/lstm) is universally understood inside the field.

### gated recurrent unit

The [gated recurrent unit](/wiki/gated_recurrent_unit), or GRU, was introduced by Kyunghyun Cho and colleagues in 2014. The GRU merges the forget and input gates into a single update gate and has no separate cell state, which yields fewer parameters than an LSTM. Empirical comparisons by Junyoung Chung and colleagues found GRUs and LSTMs to be roughly comparable on many tasks, with the GRU slightly faster and the LSTM occasionally more expressive on very long sequences.

### bidirectional and deep rnns

A [bidirectional RNN](/wiki/bidirectional_rnn), introduced by Mike Schuster and Kuldip Paliwal in 1997, runs one RNN forward and another backward over the input, then concatenates their hidden states. This gives every position access to both past and future context and is well suited to non-generative tasks such as labeling and reading comprehension. Stacking multiple recurrent layers yields deep RNNs, which can capture hierarchical structure but increase optimization difficulty.

## encoder-decoder and attention

### the seq2seq framework

The encoder-decoder, or seq2seq, framework was proposed in two near-simultaneous papers in 2014: Ilya Sutskever, Oriol Vinyals, and Quoc Le's "Sequence to Sequence Learning with Neural Networks" and Kyunghyun Cho and colleagues' "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." An encoder RNN reads the input sequence into a fixed length vector, and a decoder RNN generates the output sequence conditioned on that vector. This single architecture handles translation, summarization, and many other text to text tasks.

### attention mechanism

A fixed length vector becomes a bottleneck on long inputs. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio proposed [attention](/wiki/attention) in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate." Their attention layer lets the decoder, at each output step, compute a weighted average of all encoder hidden states, with weights produced by a learned alignment model. Minh-Thang Luong, Hieu Pham, and Christopher Manning later proposed simpler dot-product variants in 2015. Attention dramatically improved translation quality and provided interpretable alignments between source and target tokens.

## the transformer

### architecture

In 2017 Ashish Vaswani and colleagues at Google published "Attention Is All You Need," introducing the [Transformer](/wiki/transformer). The Transformer removed recurrence entirely and replaced it with multi-head [self-attention](/wiki/self_attention) and feed-forward layers. Self-attention computes pairwise interactions between all tokens, with queries, keys, and values produced by linear projections of the input. Positional information is restored through positional encodings, originally a sinusoidal scheme.

The original paper presented an encoder-decoder Transformer for machine translation, but the same building blocks support encoder-only models, decoder-only models, and many hybrids. Self-attention is parallelizable across sequence positions, which made Transformers far easier to scale on modern accelerators than RNNs.

### encoder-only, decoder-only, and encoder-decoder variants

Three main Transformer families have emerged. Encoder-only models such as [BERT](/wiki/bert), introduced by Jacob Devlin and colleagues in 2018, use bidirectional self-attention and are trained with masked language modeling. They excel at understanding tasks such as classification and question answering. Decoder-only models such as the [GPT](/wiki/gpt) series from OpenAI, beginning with Alec Radford and colleagues' 2018 GPT-1, use causal self-attention and are trained with next token prediction. They excel at generation and have become the dominant architecture for [large language models](/wiki/large_language_model). Encoder-decoder models such as [T5](/wiki/t5), introduced by Colin Raffel and colleagues in 2019, frame all tasks as text to text problems and are widely used for translation, summarization, and instruction following.

### scaling and pretraining

Scaling laws established by Jared Kaplan and colleagues in 2020 and refined by the Chinchilla paper from Jordan Hoffmann and colleagues in 2022 showed that model loss decreases predictably with parameters, data, and compute. This empirical regularity drove the rise of frontier models such as GPT-3, GPT-4, Claude, Gemini, and Llama, which can be viewed as very large autoregressive sequence models. Variants such as Mixture of Experts, used in Switch Transformer and Mixtral, scale parameter counts while keeping per-token compute roughly constant.

## state space models and linear sequence models

Although Transformers are powerful, their attention layer has cost that scales quadratically in sequence length, which limits very long contexts. A line of research on structured [state space models](/wiki/state_space_model) revives linear time recurrences with strong long range performance.

Albert Gu and Tri Dao introduced [Mamba](/wiki/mamba) in late 2023 in "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Mamba uses input-dependent state space parameters, which make the recurrence selective in a sense analogous to gating. Mamba-2, presented by Tri Dao and Albert Gu in 2024 in "Transformers are SSMs," connects state space models to a class of structured attention and improves training efficiency. The Hyena architecture, by Michael Poli and colleagues in 2023, replaces attention with implicit long convolutions and a gated structure.

Hybrid models combine attention with state space layers. AI21 Labs released Jamba in 2024, a production hybrid that interleaves Mamba blocks, Transformer blocks, and Mixture of Experts. Hybrids aim to combine the modeling strength of attention on local patterns with the linear scaling of state space recurrences on long contexts.

## time series forecasting models

Sequence modeling for numerical time series has a partially separate lineage from natural language modeling. Classical methods include autoregressive integrated moving average (ARIMA) models, exponential smoothing, and state space approaches such as the Kalman filter.

Prophet, released by Facebook in 2017 by Sean Taylor and Benjamin Letham, is a structural time series tool that decomposes a series into trend, seasonality, and holidays. It is widely used for business forecasting because it is robust to missing data and easy to tune.

More recently, foundation models for time series have emerged. TimeGPT was released by Nixtla in 2023 as a closed source forecasting API trained on a large corpus of series. Chronos, from Amazon researchers Abdul Fatir Ansari and colleagues in 2024, tokenizes time series values and trains a Transformer language model on them. Lag-Llama, by Kashif Rasul and colleagues in 2023, is an open foundation model that conditions on lagged values. Moirai, from Salesforce researchers Gerald Woo and colleagues in 2024, is a masked encoder Transformer trained on a large multivariate corpus. PatchTST and N-BEATS represent earlier strong neural baselines.

## speech and audio sequence models

### connectionist temporal classification

Connectionist Temporal Classification (CTC), introduced by Alex Graves and colleagues in 2006, allows a sequence model to be trained on input-output pairs of different lengths without explicit alignment. CTC introduces a blank symbol and sums over all valid alignments. CTC remains a core building block for speech recognition, lip reading, and other tasks where output is shorter than input.

### rnn transducer

The RNN transducer, also from Alex Graves in 2012, extends CTC with a separate prediction network that conditions on previous outputs. RNN-T is widely used in production speech recognizers because it streams naturally and supports robust on-device inference.

### whisper and modern speech models

Whisper, released by OpenAI in 2022 by Alec Radford and colleagues, is an encoder-decoder Transformer trained on hundreds of thousands of hours of weakly supervised audio. It performs multilingual speech recognition, translation, and language identification in a single model. Other notable speech sequence models include Conformer (Anmol Gulati and colleagues, 2020), wav2vec 2.0 (Alexei Baevski and colleagues, 2020), and HuBERT (Wei-Ning Hsu and colleagues, 2021), which use self-supervised learning on raw audio.

## training sequence models

Sequence models are typically trained with stochastic gradient descent and variants such as [Adam](/wiki/adam_optimizer). Several practical concerns deserve attention.

Teacher forcing feeds the ground truth previous token to the decoder during training, which speeds up convergence but can mismatch the autoregressive setting at inference. Scheduled sampling, proposed by Samy Bengio and colleagues in 2015, mixes ground truth with predictions during training to reduce this exposure bias.

Tokenization choices, such as byte pair encoding by Rico Sennrich and colleagues in 2016 or SentencePiece by Taku Kudo and John Richardson in 2018, determine the alphabet over which a language model operates. Subword tokenization handles rare words gracefully and is now standard for text models.

Long context training requires careful memory management. Techniques such as gradient checkpointing, FlashAttention from Tri Dao and colleagues in 2022, and sequence parallelism allow Transformers to be trained on contexts of tens or hundreds of thousands of tokens. State space models avoid the quadratic memory cost of attention but introduce their own engineering challenges.

## evaluating sequence models

Sequence model evaluation depends on the task. Language models report perplexity, defined as the exponential of the average negative log likelihood per token. Translation quality is measured with BLEU, METEOR, chrF, and learned metrics such as COMET. Summarization uses ROUGE and human ratings. Speech recognition reports word error rate or character error rate. Time series forecasting uses mean absolute error, root mean squared error, and mean absolute percentage error, often with seasonal naive baselines.

For long range modeling, benchmarks such as the Long Range Arena introduced by Yi Tay and colleagues in 2020 measure how well a model handles sequences of thousands of tokens. For instruction following and reasoning, benchmarks such as MMLU, GSM8K, BIG-Bench, and HELM exercise [large language models](/wiki/large_language_model) on diverse tasks.

## practical considerations

Choosing a sequence architecture depends on context length, latency budget, training data, and deployment target. Short sequences with strong supervised signal often work well with bidirectional Transformer encoders or LSTMs. Long autoregressive generation favors decoder-only Transformers, possibly augmented with state space layers for very long contexts. Streaming applications such as on-device speech recognition benefit from RNN transducers and chunked attention. Time series forecasting with limited data may still favor classical methods like Prophet and ARIMA over large foundation models.

Positional encoding choices matter for long contexts. Rotary positional embeddings, introduced by Jianlin Su and colleagues in 2021, and ALiBi, introduced by Ofir Press and colleagues in 2021, generalize better to longer sequences than the original sinusoidal scheme. Mixture of context strategies, retrieval augmented generation, and external memory modules let models reach beyond their nominal context window.

## index of sequence model wiki pages

The following pages on this wiki cover related sequence model concepts. This index preserves the original list of links from this gateway page.

- [bigram](/wiki/bigram)
- [exploding gradient problem](/wiki/exploding_gradient_problem)
- [forget gate](/wiki/forget_gate)
- [gradient clipping](/wiki/gradient_clipping)
- [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm)
- [LSTM](/wiki/lstm)
- [N-gram](/wiki/n-gram)
- [recurrent neural network](/wiki/recurrent_neural_network)
- [RNN](/wiki/rnn)
- [sequence model](/wiki/sequence_model)
- [timestep](/wiki/timestep)
- [trigram](/wiki/trigram)
- [vanishing gradient problem](/wiki/vanishing_gradient_problem)

## see also

- [Machine learning terms](/wiki/machine_learning_terms)
- [Transformer](/wiki/transformer)
- [self-attention](/wiki/self_attention)
- [attention](/wiki/attention)
- [large language model](/wiki/large_language_model)
- [BERT](/wiki/bert)
- [GPT](/wiki/gpt)
- [T5](/wiki/t5)
- [hidden Markov model](/wiki/hidden_markov_model)
- [conditional random field](/wiki/conditional_random_field)
- [gated recurrent unit](/wiki/gated_recurrent_unit)
- [bidirectional RNN](/wiki/bidirectional_rnn)
- [backpropagation through time](/wiki/backpropagation_through_time)
- [machine translation](/wiki/machine_translation)
- [automatic speech recognition](/wiki/automatic_speech_recognition)
- [state space model](/wiki/state_space_model)
- [Mamba](/wiki/mamba)
- [named entity recognition](/wiki/named_entity_recognition)
- [part-of-speech tagging](/wiki/part_of_speech_tagging)

## references

- Elman, Jeffrey L. (1990). "Finding Structure in Time." Cognitive Science 14(2): 179-211.
- Hochreiter, Sepp (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitaet Muenchen.
- Bengio, Yoshua, Patrice Simard, and Paolo Frasconi (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks 5(2): 157-166.
- Hochreiter, Sepp, and Juergen Schmidhuber (1997). "Long Short-Term Memory." Neural Computation 9(8): 1735-1780.
- Schuster, Mike, and Kuldip K. Paliwal (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing 45(11): 2673-2681.
- Lafferty, John, Andrew McCallum, and Fernando Pereira (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." ICML.
- Gers, Felix A., Juergen Schmidhuber, and Fred Cummins (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation 12(10): 2451-2471.
- Graves, Alex, Santiago Fernandez, Faustino Gomez, and Juergen Schmidhuber (2006). "Connectionist Temporal Classification." ICML.
- Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML.
- Cho, Kyunghyun, et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP.
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS.
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.
- Chung, Junyoung, et al. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." arXiv:1412.3555.
- Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP.
- Sennrich, Rico, Barry Haddow, and Alexandra Birch (2016). "Neural Machine Translation of Rare Words with Subword Units." ACL.
- Vaswani, Ashish, et al. (2017). "Attention Is All You Need." NeurIPS.
- Taylor, Sean J., and Benjamin Letham (2017). "Forecasting at Scale." American Statistician 72(1): 37-45.
- Devlin, Jacob, et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.
- Radford, Alec, et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI technical report.
- Raffel, Colin, et al. (2019). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv:1910.10683.
- Kaplan, Jared, et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
- Tay, Yi, et al. (2020). "Long Range Arena: A Benchmark for Efficient Transformers." arXiv:2011.04006.
- Su, Jianlin, et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864.
- Press, Ofir, Noah A. Smith, and Mike Lewis (2021). "Train Short, Test Long: Attention with Linear Biases." arXiv:2108.12409.
- Hoffmann, Jordan, et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.
- Dao, Tri, et al. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS.
- Radford, Alec, et al. (2022). "Robust Speech Recognition via Large-Scale Weak Supervision." arXiv:2212.04356.
- Poli, Michael, et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." ICML.
- Gu, Albert, and Tri Dao (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
- Dao, Tri, and Albert Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML.
- Ansari, Abdul Fatir, et al. (2024). "Chronos: Learning the Language of Time Series." arXiv:2403.07815.
- Woo, Gerald, et al. (2024). "Unified Training of Universal Time Series Forecasting Transformers." ICML.
- Rabiner, Lawrence R. (1989). "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition." Proceedings of the IEEE 77(2): 257-286.

