Sequence Model

A sequence model is a class of machine learning models designed to process, generate, or make predictions from ordered data where the position and context of each element matters. Unlike models that treat inputs as independent observations, sequence models capture dependencies between elements in a series, making them suitable for tasks involving text, speech, time series, biological sequences, and other temporally or sequentially structured data.

Sequence models have evolved from early statistical approaches like hidden Markov models to neural network-based architectures such as recurrent neural networks (RNNs), long short-term memory (LSTM) networks, and Transformers. These models form the backbone of modern natural language processing, speech recognition, and many other sequential data applications.

ELI5 (Explain like I'm 5)

Imagine you are reading a story. You understand each sentence because you remember what happened in the sentences before it. A sequence model works the same way. It reads things one piece at a time (like words in a sentence or notes in a song) and remembers what came before so it can understand what comes next. If the story says "The cat sat on the...", the model remembers "cat" and "sat" and guesses the next word might be "mat" or "couch." Some newer sequence models are even smarter: instead of reading one word at a time, they can look at all the words at once and figure out which ones are most related to each other, like being able to see the whole page of a book instead of reading it letter by letter.

Historical background

The study of sequential data processing has roots stretching back decades, drawing on both neuroscience and statistical theory.

Early statistical models

Markov chains, introduced by Andrey Markov in 1906, provided the first mathematical framework for modeling sequences where the probability of each state depends only on the preceding state. In the 1960s, Leonard Baum and colleagues at the Institute for Defense Analyses developed the theory of hidden Markov models (HMMs), which extended Markov chains by introducing unobservable (hidden) states that generate observed outputs through probabilistic emission functions. HMMs became the dominant approach for speech recognition and biological sequence analysis for several decades.

Neural network approaches

The idea of recurrence in neural networks has biological origins. Santiago Ramon y Cajal observed "recurrent semicircles" in cerebellar cortex structures as early as 1901, and Donald Hebb proposed "reverberating circuits" as short-term memory mechanisms in the 1940s.

In 1986, Michael I. Jordan introduced the Jordan network, one of the first recurrent neural architectures for sequential processing. In 1990, Jeffrey Elman published "Finding Structure in Time," describing the Elman network (also called a simple recurrent network), which introduced recurrent connections between hidden units and context (memory) units. These two architectures demonstrated that neural networks could learn temporal structure in data.

John Hopfield's 1982 application of spin glass theory to neural networks with binary activations also contributed to the foundations of recurrent computation, although Hopfield networks were primarily used for associative memory rather than sequence processing.

Addressing the vanishing gradient

Sepp Hochreiter's 1991 diploma thesis formally identified the vanishing gradient problem, explaining why training RNNs on long sequences was so difficult. As gradients propagate backward through many time steps, they tend to shrink exponentially, making it nearly impossible for the network to learn long-range dependencies. This problem motivated the development of gated architectures.

In 1997, Hochreiter and Jurgen Schmidhuber introduced LSTM, which used a system of gates to control the flow of information and maintain a stable cell state over many time steps. LSTM became the default RNN architecture for the next two decades.

Kyunghyun Cho and colleagues introduced the gated recurrent unit (GRU) in 2014, offering a simpler alternative to LSTM with comparable performance on many tasks.

The attention revolution

In 2014, Dzmitry Bahdanau, Cho, and Yoshua Bengio introduced the attention mechanism for neural machine translation, allowing the decoder to focus on different parts of the input sequence at each output step rather than relying on a single fixed-length context vector. This resolved the information bottleneck that plagued earlier encoder-decoder models.

In 2017, Vaswani et al. published "Attention Is All You Need," introducing the Transformer architecture, which replaced recurrence entirely with self-attention and enabled full parallelization of sequence processing. The Transformer has since become the foundation of virtually all modern large language models and many other sequence processing systems.

Types of sequence models

Sequence models span a wide range of architectures, from classical statistical models to modern deep learning systems.

Hidden Markov models

A hidden Markov model (HMM) is a probabilistic model that represents a system as a series of transitions between hidden states, each of which produces an observable output according to an emission probability distribution. The model makes the Markov assumption: the probability of transitioning to the next state depends only on the current state, not on any earlier states.

An HMM is defined by three components:

Transition probabilities: the likelihood of moving from one hidden state to another.
Emission probabilities: the likelihood of generating a particular observation from a given hidden state.
Initial state distribution: the probability of starting in each hidden state.

Three classical algorithms solve the main inference and learning problems for HMMs:

Algorithm	Problem solved	Description
Forward algorithm	Evaluation	Computes the probability of an observed sequence given the model
Viterbi algorithm	Decoding	Finds the most likely sequence of hidden states for a given observation sequence
Baum-Welch algorithm	Learning	Estimates model parameters from observed data using expectation-maximization

HMMs were the dominant approach in speech recognition from the mid-1970s through the 2010s and remain widely used in bioinformatics for tasks like gene finding and protein family classification.

Recurrent neural networks

A recurrent neural network (RNN) is a neural network that processes sequential data by maintaining a hidden state vector that is updated at each time step. At each step t, the network takes the current input x_t and the previous hidden state h_(t-1) to produce a new hidden state h_t:

h_t = f(W_hh * h_(t-1) + W_xh * x_t + b)

where W_hh is the recurrent weight matrix, W_xh is the input weight matrix, b is a bias term, and f is a nonlinear activation function (typically tanh or ReLU).

RNNs are trained using backpropagation through time (BPTT), which unrolls the network across time steps and applies the chain rule to compute gradients. However, standard RNNs suffer from the vanishing and exploding gradient problems, which limit their ability to learn dependencies over long sequences.

Bidirectional RNNs

Bidirectional RNNs, introduced by Schuster and Paliwal in 1997, process the input sequence in both the forward and backward directions simultaneously. The hidden states from both directions are concatenated at each time step, giving the model access to both past and future context. Bidirectional architectures are useful when the full sequence is available at inference time, as in text classification or named entity recognition, but are unsuitable for real-time or autoregressive tasks where future inputs are not yet known.

Deep RNNs

Deep RNNs stack multiple recurrent layers on top of each other. The output of one recurrent layer serves as the input to the next, allowing the network to learn increasingly abstract representations of the sequential input. Deep RNNs with two to four layers are common in practice; very deep stacks tend to be difficult to train without residual connections or other stabilization techniques.

Long short-term memory (LSTM)

LSTM networks, introduced by Hochreiter and Schmidhuber in 1997 and refined by Felix Gers and colleagues in 2000 (who added the forget gate), address the vanishing gradient problem through a carefully designed gating mechanism and a dedicated memory cell.

An LSTM cell contains four main components:

Component	Function	Equation concept
Forget gate	Decides what information to discard from the cell state	Sigmoid layer outputs values between 0 (discard) and 1 (keep)
Input gate	Determines which new information to store in the cell state	Sigmoid layer selects values; tanh layer creates candidate values
Cell state	Carries information across time steps with minimal transformation	Updated by pointwise addition after forget and input gate operations
Output gate	Controls what part of the cell state to expose as the hidden state output	Sigmoid layer filters the cell state passed through tanh

The cell state acts as a "conveyor belt" that runs through the entire chain, with the gates adding or removing information. Because the cell state is updated through additive operations rather than multiplicative ones, gradients can flow through many time steps without vanishing, allowing LSTMs to learn dependencies spanning hundreds of time steps.

Common LSTM variants include:

Peephole connections: allow the gates to access the cell state directly (Gers and Schmidhuber, 2000).
Coupled forget and input gates: tie the forget and input gate decisions together so the model only forgets when it has new information to store.
Bidirectional LSTM: processes the sequence in both directions, widely used in speech recognition and natural language understanding.

Gated recurrent unit (GRU)

The gated recurrent unit (GRU), introduced by Cho et al. in 2014, simplifies the LSTM architecture by combining the forget and input gates into a single update gate and merging the cell state and hidden state into one vector. The GRU uses two gates:

Gate	Function
Update gate (z_t)	Controls how much of the previous hidden state to retain versus how much to replace with new candidate information
Reset gate (r_t)	Determines how much of the previous hidden state to incorporate when computing the new candidate hidden state

GRUs have approximately 25% fewer parameters than LSTMs at the same hidden size, making them faster to train and more memory-efficient. Empirical evaluations by Chung et al. (2014) found that GRUs match or outperform LSTMs on many tasks with sequences under 200 to 300 steps, though LSTMs tend to have an advantage on very long sequences (beyond 500 steps).

Encoder-decoder (seq2seq) models

The encoder-decoder architecture, also known as sequence-to-sequence (seq2seq), was introduced independently by Sutskever, Vinyals, and Le (2014) and by Cho et al. (2014). It maps a variable-length input sequence to a variable-length output sequence using two separate networks:

Encoder: reads the input sequence and compresses it into a fixed-length context vector (the final hidden state).
Decoder: generates the output sequence one token at a time, conditioned on the context vector and previously generated tokens.

The original seq2seq models used stacked LSTMs for both the encoder and decoder. On the WMT 2014 English-to-French translation task, Sutskever et al. achieved a BLEU score of 34.8, demonstrating that end-to-end neural approaches could compete with phrase-based statistical systems.

A significant limitation of the basic seq2seq architecture is the information bottleneck: the entire input sequence must be compressed into a single fixed-size vector, causing performance to degrade on long input sequences. The attention mechanism, described below, was designed to address this limitation.

Attention mechanisms

The attention mechanism allows a decoder to access all encoder hidden states rather than relying on a single summary vector. At each decoding step, the model computes a weighted sum of encoder states, where the weights (attention scores) indicate the relevance of each input position to the current output position.

There are several variants of attention:

Variant	Year	Key idea
Bahdanau (additive) attention	2014	Alignment scores computed via a learned feed-forward network over concatenated encoder and decoder states
Luong (multiplicative) attention	2015	Alignment scores computed via dot product or bilinear transformation between encoder and decoder states
Self-attention	2017	Each position in a sequence attends to all other positions in the same sequence
Multi-head attention	2017	Multiple attention functions run in parallel on different learned projections, then concatenated
Cross-attention	2017	Attention between two different sequences (e.g., encoder output and decoder input)

Bahdanau attention resolved the bottleneck problem by allowing the decoder to dynamically focus on different input positions at each output step. For example, when translating a sentence from English to French, the model can attend to the English word "cat" when generating the French word "chat," regardless of sentence length.

Transformer

The Transformer, introduced by Vaswani et al. in 2017 in "Attention Is All You Need," dispenses with recurrence entirely and relies solely on attention mechanisms and feedforward layers. The architecture uses:

Positional encoding: since the model has no recurrence or convolution, it adds positional information to the input embeddings using sinusoidal functions or learned embeddings.
Multi-head self-attention: each layer computes self-attention with multiple parallel attention heads, allowing the model to capture different types of relationships simultaneously.
Layer normalization and residual connections: stabilize training and allow information to flow through deep stacks of layers.
Position-wise feedforward networks: two-layer feedforward networks applied independently at each position.

The original Transformer used a stack of 6 encoder layers and 6 decoder layers, totaling approximately 100 million parameters. It achieved a BLEU score of 28.4 on the WMT 2014 English-to-German translation benchmark, surpassing all previous models including ensembles.

The self-attention operation has O(n^2) complexity with respect to sequence length n, which becomes expensive for very long sequences. Various approaches have been proposed to address this, including sparse attention patterns, linear attention approximations, and the structured state space models discussed below.

Transformer-based models now dominate sequence modeling across domains. Encoder-only variants like BERT are used for classification and understanding tasks. Decoder-only variants like GPT are used for text generation. Encoder-decoder variants like T5 handle sequence-to-sequence tasks.

Temporal convolutional networks

Temporal convolutional networks (TCNs) apply one-dimensional causal convolutions to sequential data. A causal convolution ensures that the output at time t depends only on inputs at time t and earlier, preserving the temporal ordering. TCNs use dilated convolutions to exponentially increase the receptive field without increasing the number of parameters, allowing them to capture long-range dependencies efficiently.

Bai, Kolter, and Koltun (2018) demonstrated that TCNs outperform LSTMs and GRUs on several standard sequence modeling benchmarks, particularly on tasks requiring very long memory spans. TCNs also offer advantages in parallelization, since all output positions can be computed simultaneously during training (unlike the sequential computation required by RNNs).

WaveNet, introduced by DeepMind in 2016, is a notable TCN-based architecture originally designed for raw audio generation. It uses stacked dilated causal convolutions and has been applied to text-to-speech synthesis and music generation.

State space models

Structured state space models (SSMs) represent a newer class of sequence models that draw on continuous-time dynamical systems theory. The Structured State Space Sequence model (S4), introduced by Albert Gu et al. in 2021, parameterizes a linear state space system and uses a special HiPPO (High-order Polynomial Projection Operator) initialization to efficiently capture long-range dependencies with linear complexity in sequence length.

Mamba, introduced by Gu and Tri Dao in December 2023, extends S4 by making the state space parameters input-dependent (selective), allowing the model to filter information based on content rather than using fixed dynamics. Mamba achieves 5x higher inference throughput than similarly sized Transformers and scales linearly with sequence length. On language modeling benchmarks, Mamba-3B matches or outperforms Transformers twice its size.

Hybrid architectures combining SSM layers with attention layers (such as Jamba by AI21 Labs) have emerged as a way to combine the efficiency of SSMs with the strong in-context learning capabilities of Transformers.

Comparison of sequence model architectures

Architecture	Year introduced	Key mechanism	Sequence length handling	Parallelizable	Parameters (relative)	Main limitations
HMM	1960s	Markov transitions + emission probabilities	Fixed-order Markov assumption	Yes (forward algorithm)	Low	Cannot model long-range dependencies; requires independence assumptions
Vanilla RNN	1986/1990	Recurrent hidden state	Poor on long sequences	No	Low	Vanishing/exploding gradients
LSTM	1997	Gated memory cell	Good (hundreds of steps)	No	Medium	Sequential processing; slow training on long sequences
GRU	2014	Update and reset gates	Good (moderate sequences)	No	Medium (fewer than LSTM)	Sequential processing; may underperform LSTM on very long sequences
Seq2seq (with attention)	2014-2015	Encoder-decoder + attention	Improved over basic seq2seq	Partially	Medium-High	Still relies on sequential encoder/decoder
Transformer	2017	Self-attention	Quadratic cost, but handles long-range well	Yes	High	O(n^2) memory and compute in sequence length
TCN	2016-2018	Dilated causal convolutions	Good (via dilation)	Yes	Medium	Fixed receptive field; less flexible than attention
SSM (S4/Mamba)	2021-2023	Structured state space	Linear complexity	Yes	Medium	Relatively new; less mature ecosystem

Training techniques

Several specialized techniques are used when training sequence models.

Backpropagation through time (BPTT)

BPTT is the standard algorithm for computing gradients in recurrent networks. It works by unrolling the recurrent computation graph across all time steps and then applying standard backpropagation. The computational and memory costs of BPTT grow linearly with sequence length.

Truncated BPTT addresses scalability by dividing the sequence into shorter segments and computing gradients within each segment. This introduces a bias (the model cannot learn dependencies longer than the segment length) but makes training practical for long sequences.

Teacher forcing

During training of autoregressive sequence models, teacher forcing feeds the ground-truth output at each time step as input to the next step, rather than using the model's own predictions. This stabilizes training and speeds convergence but can cause a mismatch between training (where the model sees correct inputs) and inference (where it sees its own potentially incorrect predictions). Techniques like scheduled sampling (Bengio et al., 2015) gradually transition from teacher forcing to model predictions during training to reduce this discrepancy.

Gradient clipping

Gradient clipping limits the magnitude of gradients during training to prevent the exploding gradient problem. When the norm of the gradient vector exceeds a threshold, it is rescaled to the threshold value. Pascanu, Mikolov, and Bengio (2013) showed that gradient clipping is effective for stabilizing RNN training.

Connectionist temporal classification (CTC)

CTC, introduced by Graves et al. in 2006, is a training criterion for sequence labeling problems where the alignment between input and output sequences is unknown. CTC marginalizes over all possible alignments, allowing the model to learn directly from unsegmented data. It has been widely adopted in speech recognition and handwriting recognition systems. Unlike HMM-based approaches, CTC does not require pre-segmented training data or an external alignment model.

Curriculum learning

In curriculum learning, training examples are presented in order of increasing difficulty, starting with shorter or simpler sequences and progressing to longer or more complex ones. This approach can improve convergence speed and final model performance, particularly for sequence models that struggle with long-range dependencies early in training.

Applications

Sequence models are applied across a broad range of domains.

Natural language processing

Natural language processing (NLP) is the most prominent application area for sequence models. Key tasks include:

Machine translation: translating text from one language to another. Early neural machine translation systems used LSTM-based seq2seq models with attention; modern systems use Transformer-based architectures.
Language modeling: predicting the next word in a sequence. Language models trained on large corpora form the basis of GPT, BERT, and other large language models.
Sentiment analysis: classifying the emotional tone of text. LSTMs and Transformers both achieve strong results on this task.
Named entity recognition: identifying and classifying entities (people, organizations, locations) in text. Bidirectional LSTMs with CRF output layers were the standard approach before Transformer-based models.
Text summarization: generating concise summaries of longer documents using encoder-decoder architectures.
Question answering: reading a passage and answering questions about it, typically using Transformer-based models fine-tuned on QA datasets.

Speech and audio processing

Speech recognition systems convert spoken language to text. The field transitioned from HMM-based systems to deep learning approaches over the 2010s:

Deep neural network-HMM hybrid systems replaced Gaussian mixture models as the acoustic model component in the early 2010s.
End-to-end models using CTC loss (e.g., DeepSpeech by Baidu, 2014) and attention-based encoder-decoder models (e.g., Listen, Attend and Spell, 2015) eliminated the need for separate alignment.
Whisper (OpenAI, 2022) uses a Transformer encoder-decoder architecture trained on 680,000 hours of multilingual audio data.

Text-to-speech synthesis has similarly benefited from sequence models, with systems like WaveNet (2016) and Tacotron (2017) producing natural-sounding speech from text.

Time series forecasting

Sequence models are used to predict future values in time series data, including:

Financial forecasting: predicting stock prices, exchange rates, and other financial indicators.
Weather prediction: forecasting temperature, precipitation, and other meteorological variables.
Energy demand: predicting electricity consumption and generation from renewable sources.
Supply chain: forecasting product demand for inventory management.

LSTMs and GRUs have been standard approaches for time series forecasting. More recently, Transformer-based models and foundation models for time series (such as Amazon Chronos, 2024) have shown competitive results, treating time series forecasting as a token prediction problem analogous to language modeling.

Bioinformatics and genomics

Sequence models process biological sequences in multiple ways:

Gene finding: HMMs identify protein-coding regions in DNA sequences, using hidden states to represent exons, introns, and intergenic regions.
Protein family classification: profile HMMs characterize protein families, allowing new sequences to be classified based on their similarity to known families.
Protein structure prediction: AlphaFold (DeepMind, 2020) uses attention mechanisms on amino acid sequences and multiple sequence alignments to predict three-dimensional protein structures with experimental-level accuracy.
Protein language models: models like ESM-2 (Meta AI) treat amino acid sequences as text and learn representations useful for predicting protein function, stability, and variant effects.

Music and audio generation

Sequence models generate music and audio by learning patterns in sequential data:

WaveNet generates raw audio waveforms sample by sample using dilated causal convolutions.
Music Transformer (Huang et al., 2018) uses relative self-attention to generate long piano performances with coherent structure.
More recent systems combine Transformer architectures with diffusion models or codec-based tokenization to generate high-fidelity music and sound effects.

Robotics and control

In robotics and autonomous systems, sequence models process sequential sensor data and generate action sequences:

Autonomous driving systems use LSTMs and Transformers to process sequences of camera frames, lidar scans, and radar readings for perception and trajectory prediction.
Robot manipulation tasks use sequence models to learn action policies from demonstration trajectories.
Reinforcement learning agents use recurrent architectures to maintain memory of past observations in partially observable environments.

Evaluation and benchmarks

Sequence models are evaluated on domain-specific benchmarks:

Domain	Benchmark	Metric	Description
Machine translation	WMT	BLEU	Measures n-gram overlap between machine and reference translations
Language modeling	WikiText-103, The Pile	Perplexity	Lower perplexity indicates better prediction of held-out text
Speech recognition	LibriSpeech	Word Error Rate (WER)	Percentage of incorrectly transcribed words
Sentiment analysis	SST-2, IMDb	Accuracy	Correct classification rate on positive/negative reviews
Question answering	SQuAD	F1 / Exact Match	Overlap between predicted and ground-truth answer spans
Time series	M4, M5	MASE, sMAPE	Error metrics for forecasting accuracy
Long-range	Long Range Arena	Accuracy	Tests model performance on tasks requiring dependencies over 1,000+ steps

The Long Range Arena benchmark (Tay et al., 2020) was specifically designed to compare sequence models on their ability to handle long-range dependencies, providing standardized tasks across different sequence lengths and modalities.

Current trends and future directions

Several trends are shaping the development of sequence models as of 2025-2026:

Efficiency at scale: structured state space models like Mamba offer linear-time alternatives to the quadratic self-attention in Transformers, enabling processing of very long sequences (hundreds of thousands of tokens) at reduced computational cost.
Hybrid architectures: models combining SSM layers with attention layers seek to capture the best properties of both approaches: efficient long-range modeling from SSMs and strong in-context learning from attention.
Foundation models for time series: models like Amazon Chronos-2 (2025) and TimesFM (Google, 2024) apply the foundation model paradigm to time series data, treating forecasting as a token prediction task and enabling zero-shot transfer across different time series domains.
Multimodal sequence processing: modern Transformer-based models increasingly handle sequences of mixed modalities (text, images, audio, video) within a single architecture, using tokenization to convert all data types into a common sequential format.
Long-context models: advances in positional encoding (e.g., RoPE, ALiBi) and attention mechanisms have extended the effective context windows of Transformer models to millions of tokens, reducing the need for specialized long-range architectures in some settings.
Protein and genomic language models: sequence models trained on biological data continue to advance, with protein language models enabling zero-shot prediction of mutation effects, protein function, and structure.

References

Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP 2014*.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." *arXiv preprint arXiv:1409.0473*.
Sutskever, I., Vinyals, O., and Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems 27 (NeurIPS 2014)*.
Elman, J.L. (1990). "Finding Structure in Time." *Cognitive Science*, 14(2), 179-211.
Graves, A., Fernandez, S., Gomez, F., and Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." *Proceedings of ICML 2006*.
Bai, S., Kolter, J.Z., and Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." *arXiv preprint arXiv:1803.01271*.
Gu, A., Goel, K., and Re, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." *International Conference on Learning Representations (ICLR 2022)*.
Gu, A. and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." *arXiv preprint arXiv:2312.00752*.
Schuster, M. and Paliwal, K.K. (1997). "Bidirectional Recurrent Neural Networks." *IEEE Transactions on Signal Processing*, 45(11), 2673-2681.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." *arXiv preprint arXiv:1412.3555*.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. (2016). "WaveNet: A Generative Model for Raw Audio." *arXiv preprint arXiv:1609.03499*.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." *Proceedings of ICML 2013*.

ELI5 (Explain like I'm 5)

Historical background

Early statistical models

Neural network approaches

Addressing the vanishing gradient

The attention revolution

Types of sequence models

Hidden Markov models

Recurrent neural networks

Bidirectional RNNs

Deep RNNs

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Encoder-decoder (seq2seq) models

Attention mechanisms

Transformer

Temporal convolutional networks

State space models

Comparison of sequence model architectures

Training techniques

Backpropagation through time (BPTT)

Teacher forcing

Gradient clipping

Connectionist temporal classification (CTC)

Curriculum learning

Applications

Natural language processing

Speech and audio processing

Time series forecasting

Bioinformatics and genomics

Music and audio generation

Robotics and control

Evaluation and benchmarks

Current trends and future directions

See also

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

ELI5 (Explain like I'm 5)

Historical background

Early statistical models

Neural network approaches

Addressing the vanishing gradient

The attention revolution

Types of sequence models

Hidden Markov models

Recurrent neural networks

Bidirectional RNNs

Deep RNNs

Long short-term memory (LSTM)

Gated recurrent unit (GRU)

Encoder-decoder (seq2seq) models

Attention mechanisms

Transformer

Temporal convolutional networks

State space models

Comparison of sequence model architectures

Training techniques

Backpropagation through time (BPTT)

Teacher forcing

Gradient clipping

Connectionist temporal classification (CTC)

Curriculum learning

Applications

Natural language processing

Speech and audio processing

Time series forecasting

Bioinformatics and genomics

Music and audio generation

Robotics and control

Evaluation and benchmarks

Current trends and future directions

See also

References