Recurrent Neural Network

See also: Machine learning terms

A recurrent neural network (RNN) is a class of artificial neural network designed to process sequential data by maintaining an internal hidden state that persists information across time steps. Unlike feedforward neural networks, which map a fixed-size input to a fixed-size output in a single pass, RNNs contain cyclic connections that allow information to flow from one step of a computation to the next. This makes them naturally suited for tasks where the order of inputs matters, such as language modeling, speech recognition, and time series forecasting.

RNNs were among the first neural architectures to handle variable-length sequences, and they were among the most widely used architectures in natural language processing and sequence modeling from the late 1980s through the mid-2010s. Although transformer-based models have largely supplanted RNNs in many domains since 2017, the core ideas behind recurrent computation remain influential in deep learning, and newer architectures such as state space models are revisiting recurrent principles with modern techniques. Recurrent networks also continue to see use in resource-constrained settings and certain real-time processing tasks.

ELI5: Explain like I'm 5

Imagine you are reading a story one word at a time. After each word, you update a little summary in your head of what the story is about so far. When you reach the next word, you use both that new word and your running summary to understand what is happening. A recurrent neural network works the same way. It reads a sequence of inputs (like words) one at a time, keeps a "memory" of what it has seen, and uses that memory along with each new input to make predictions. The memory is not perfect, though. If the story is very long, the network might forget details from the beginning, which is why researchers invented improved versions like LSTM and GRU that are better at remembering important things over long stretches. This same idea helps computers understand sentences or patterns in music, and even predict what might come next.

History and development

The concept of recurrent connections in neural networks dates back to the early days of connectionist research. John Hopfield introduced the Hopfield network in 1982, a form of recurrent network used as an associative memory. While not a sequence model in the modern sense, the Hopfield network demonstrated that recurrent connections could store and retrieve patterns.

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published their influential work on backpropagation, which laid the groundwork for training multilayer networks. Michael Jordan proposed the "Jordan network" in 1986, where the output layer rather than the hidden layer was fed back as context. Shortly after, in 1990, Jeffrey Elman introduced the "Elman network" (sometimes called the "simple recurrent network"), which added a context layer that fed the previous hidden state back into the network at each time step. This architecture became the prototype for what is now called the "vanilla RNN." Both Elman and Jordan networks established the basic principle of using recurrence to model sequences.

The 1990s saw growing awareness of the difficulties in training RNNs on long sequences, particularly the vanishing gradient problem identified by Sepp Hochreiter in his 1991 diploma thesis and later formalized by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. This led to the invention of Long Short-Term Memory (LSTM) networks by Hochreiter and Jurgen Schmidhuber in 1997, which became the dominant RNN variant for over a decade.

The Gated Recurrent Unit (GRU) was introduced by Kyunghyun Cho and colleagues in 2014 as a simpler alternative to LSTM. Around the same time, the development of sequence-to-sequence models by Ilya Sutskever, Oriol Vinyals, and Quoc Le (2014), along with the attention mechanism by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio (2014), brought RNN-based architectures to new heights of performance in machine translation and other tasks.

The publication of "Attention Is All You Need" by Vaswani et al. in 2017 introduced the Transformer, which replaced recurrence entirely with self-attention. This marked the beginning of a shift away from RNNs in most large-scale NLP applications.

How RNNs work

At each time step t, an RNN receives an input vector x_t and combines it with the hidden state h_{t-1} from the previous time step to produce a new hidden state h_t. The hidden state acts as the network's memory: it encodes a compressed summary of all inputs the network has processed so far. An optional output y_t can be computed from the hidden state at any time step.

Conceptually, the same set of weights is reused at every time step. This weight sharing is what gives RNNs their ability to generalize across different positions in a sequence regardless of sequence length, and is a defining characteristic of the architecture.

Mathematical formulation

The simplest (vanilla) RNN computes the hidden state and output as follows:

h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = g(W_hy * h_t + b_y)

Where:

Symbol	Meaning
x_t	Input vector at time step t
h_t	Hidden state at time step t
h_{t-1}	Hidden state from the previous time step
W_xh	Weight matrix from input to hidden layer
W_hh	Weight matrix from hidden layer to hidden layer (recurrent weights)
W_hy	Weight matrix from hidden layer to output
b_h, b_y	Bias vectors
f	Activation function, typically tanh or ReLU
g	Output activation (e.g., softmax for classification)

The hidden state h_0 is usually initialized to a zero vector at the start of a sequence.

Hidden state as memory

The hidden state h(t) is a vector of fixed dimensionality (chosen as a hyperparameter) that summarizes all information from the input sequence up to time step t. In theory, this allows the RNN to capture arbitrarily long-range dependencies. In practice, vanilla RNNs struggle to retain information over many time steps due to the vanishing gradient problem, which motivated the development of gated architectures like LSTM and GRU.

The hidden state is initialized (typically to a zero vector) at the start of each sequence. As the network processes each element, the hidden state is progressively updated, building a compressed representation of the sequence history.

Unrolling through time

To understand how an RNN processes a sequence, it helps to "unroll" (or "unfold") the network across time steps. Unrolling replaces the single recurrent cell with a chain of identical cells, one for each time step. Each cell receives the input at its time step and the hidden state from the previous cell, and passes its own hidden state to the next cell. For a sequence of length T, the unrolled network has T copies of the same recurrent cell, connected sequentially, with all copies sharing the same weights. When unrolled, an RNN resembles a very deep feedforward network with shared weights at each layer, which is the perspective used during training with backpropagation through time.

Training RNNs

Backpropagation through time (BPTT)

RNNs are trained using backpropagation through time (BPTT), a direct extension of the standard backpropagation algorithm to sequences. The process works as follows:

Forward pass: The input sequence is fed through the network one time step at a time, computing hidden states and outputs for each step.
Loss computation: A loss function (such as cross-entropy for classification or mean squared error for regression) is computed at each output time step, and the individual losses are summed or averaged over the sequence.
Backward pass: The total loss is backpropagated through the unrolled network, computing gradients with respect to all weights. Because the same parameters appear at every time step, the gradient with respect to any weight is the sum of the gradients contributed by each time step. For example, computing the gradient at time step t = 4 requires backpropagating through three earlier steps and summing the resulting gradient contributions. This summation over time steps is what distinguishes BPTT from ordinary backpropagation.
Weight update: The accumulated gradients are used to update the weights via an optimizer such as stochastic gradient descent or Adam.

The computational cost of BPTT is O(T) in both time and memory, where T is the sequence length. For very long sequences, this can become prohibitively expensive.

Truncated BPTT

To reduce memory and computation costs, practitioners often use truncated backpropagation through time. Instead of unrolling the entire sequence, truncated BPTT divides the sequence into shorter segments (for example, 20 or 50 time steps) and backpropagates gradients only within each segment. The hidden state is carried forward from one segment to the next (maintaining continuity), but gradients are not propagated across segment boundaries.

Truncated BPTT introduces a tradeoff: it reduces memory usage and speeds up training, but it limits the network's ability to learn dependencies longer than the truncation window. In practice, this is often an acceptable compromise, especially for tasks where the most relevant context is relatively local. Truncated BPTT was widely used in practice for training language models and other RNN applications on long documents or continuous data streams.

Teacher forcing

Teacher forcing is a training strategy commonly used with RNNs that generate sequences, such as language model decoders and machine translation systems. During training, instead of feeding the model's own output from the previous time step as the next input, teacher forcing supplies the ground-truth token from the training data.

This approach speeds up convergence because the model does not have to recover from its own early mistakes during training. However, it introduces a mismatch between training and inference known as exposure bias: at inference time, the model must rely on its own predictions, which may differ from the ground-truth tokens it was trained on. Small prediction errors can compound over a long generated sequence, leading to degraded output quality.

Several techniques have been proposed to mitigate exposure bias, including scheduled sampling (gradually transitioning from teacher forcing to model predictions during training) and professor forcing, which uses adversarial training to align the hidden-state dynamics of teacher-forced and free-running modes.

The vanishing and exploding gradient problems

The most significant challenge in training vanilla RNNs is the vanishing gradient problem. When gradients are backpropagated through many time steps, they are repeatedly multiplied by the recurrent weight matrix W_hh and the derivative of the activation function. If the largest eigenvalue (or singular value) of W_hh is less than 1, gradients shrink exponentially with the number of time steps, effectively preventing the network from learning long-range dependencies. This means the network cannot learn dependencies that span many time steps, because the error signal from distant outputs does not reach earlier hidden states. Conversely, if the largest eigenvalue exceeds 1, gradients can grow exponentially, a condition known as the exploding gradient problem. Exploding gradients cause numerical instability and can make training diverge entirely.

These problems were formally analyzed by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. Their analysis showed that for vanilla RNNs, the influence of an input on the hidden state decays (or explodes) exponentially with the temporal distance, and that the ability to learn long-range dependencies decreases exponentially with the length of the dependency. This analysis motivated the development of gated RNN architectures.

Solutions to gradient problems

Technique	Description	Addresses
Gradient clipping	Rescale gradients when their norm exceeds a threshold	Exploding gradients
LSTM / GRU gates	Gating mechanisms that control information flow and maintain stable gradients	Vanishing gradients
Orthogonal initialization	Initialize W_hh as an orthogonal matrix so eigenvalues start near 1	Both
Skip connections	Add direct connections across multiple time steps	Vanishing gradients
Batch normalization / layer normalization	Normalize activations to stabilize training dynamics	Both
Gradient regularization	Add a penalty to encourage gradients to remain in a stable range	Both
Truncated BPTT	Limit the number of time steps for backpropagation	Exploding gradients (indirectly)

Gradient clipping, proposed by Thomas Mikolov, is one of the simplest and most widely used techniques. It sets a maximum threshold for the gradient norm; if the gradient exceeds this threshold, it is scaled down proportionally. This prevents the extreme parameter updates caused by exploding gradients but does not solve the vanishing gradient problem. Of the listed techniques, gated architectures (LSTM and GRU) have proven the most effective and widely adopted solution to vanishing gradients.

Long Short-Term Memory (LSTM)

LSTM is a gated RNN architecture introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 to address the vanishing gradient problem. The key innovation is a cell state (also called the memory cell) that runs through the entire sequence like a conveyor belt, with information added or removed through learned gating mechanisms. This design allows LSTMs to learn dependencies spanning hundreds or even thousands of time steps.

LSTM architecture

An LSTM cell contains three gates and a cell state. Each gate is implemented as a sigmoid layer followed by element-wise multiplication:

Component	Function	Equation
Forget gate (f_t)	Decides what information to discard from the cell state	f_t = sigma(W_f * [h_{t-1}, x_t] + b_f)
Input gate (i_t)	Decides what new information to store in the cell state	i_t = sigma(W_i * [h_{t-1}, x_t] + b_i)
Candidate values	Creates a vector of new candidate values	C_tilde_t = tanh(W_C * [h_{t-1}, x_t] + b_C)
Cell state update	Combines old cell state with new candidates	C_t = f_t * C_{t-1} + i_t * C_tilde_t
Output gate (o_t)	Decides what part of the cell state to output	o_t = sigma(W_o * [h_{t-1}, x_t] + b_o)
Hidden state	Filtered version of the cell state	h_t = o_t * tanh(C_t)

The forget gate outputs values between 0 and 1 for each element of the cell state; a value of 1 means "keep this entirely" and 0 means "discard this completely." The input gate and candidate values together determine what new information is written to the cell state. The output gate controls which parts of the cell state are exposed as the hidden state for downstream computation.

Because the cell state update involves only element-wise multiplication and addition (no matrix multiplication by W_hh), gradients can flow through the cell state with minimal attenuation. The forget gate allows the network to explicitly "reset" parts of the cell state when they are no longer relevant. This is the core mechanism that mitigates the vanishing gradient problem in LSTMs.

LSTM variants

Several variants of the original LSTM have been proposed:

Peephole connections (Gers and Schmidhuber, 2000): Allow the gates to look at the cell state directly, not just the hidden state.
Coupled forget and input gates: Replace the separate forget and input gates with a single gate, so the cell forgets exactly as much as it remembers new information (f_t = 1 - i_t).
LSTM with projection layers: Adds a linear projection after the output gate to reduce the dimensionality of the hidden state, reducing the number of parameters.
xLSTM (Hochreiter et al., 2024): An extended LSTM with exponential gating, scalar memory (sLSTM), and matrix memory (mLSTM) variants, designed to compete with transformers at scale.

Research by Klaus Greff et al. (2017) systematically evaluated LSTM variants and found that the standard LSTM architecture performed well across tasks, with no single variant consistently outperforming it. The forget gate bias initialization (setting it to 1 so the network defaults to remembering) was found to be one of the most important practical considerations.

Gated Recurrent Unit (GRU)

The Gated Recurrent Unit (GRU) was introduced by Kyunghyun Cho and colleagues in 2014 as a simpler alternative to LSTM. GRUs achieve comparable performance to LSTMs on many tasks while using fewer parameters and less computation. The GRU merges the cell state and hidden state into a single state vector and uses only two gates instead of three.

GRU architecture

A GRU has two gates:

Component	Function	Equation
Update gate (z_t)	Controls how much of the previous hidden state to retain	z_t = sigma(W_z * [h_{t-1}, x_t] + b_z)
Reset gate (r_t)	Controls how much of the previous hidden state to forget when computing the candidate	r_t = sigma(W_r * [h_{t-1}, x_t] + b_r)
Candidate hidden state	Computes a proposed new hidden state	h_tilde_t = tanh(W * [r_t * h_{t-1}, x_t] + b)
Hidden state update	Interpolates between old and candidate hidden states	h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde_t

Notably, the GRU does not maintain a separate cell state. All memory is stored directly in the hidden state. The update gate serves a role similar to both the forget and input gates of the LSTM, while the reset gate determines how much past information to incorporate into the candidate computation. When the reset gate is close to 0, the network behaves as if it is reading the first symbol of a sequence, effectively allowing it to drop irrelevant past information. The update gate z_t acts as a direct interpolation between the old state and the candidate state: when z_t is close to 0, the hidden state is largely copied from the previous step; when z_t is close to 1, the hidden state is replaced by the new candidate.

LSTM vs. GRU comparison

Feature	LSTM	GRU
Number of gates	3 (forget, input, output)	2 (update, reset)
Separate cell state	Yes	No
Parameters per unit	More (roughly 4x hidden size^2)	Fewer (roughly 3x hidden size^2)
Training speed	Slower	Faster
Long-range dependencies	Generally better on complex sequences	Comparable on shorter or simpler sequences
Memory usage	Higher	Lower

The GRU has fewer parameters than the LSTM, which makes it faster to train and less prone to overfitting on small datasets. Empirical studies have found that neither architecture consistently outperforms the other across all tasks. GRUs tend to do better when data is limited or sequences are shorter, while LSTMs often have an edge on tasks requiring very long-term memory. In practice, the choice between GRU and LSTM is often made based on computational budget and dataset size.

Comparison of RNN variants

Feature	Vanilla RNN	LSTM	GRU
Year introduced	~1990 (Elman)	1997 (Hochreiter & Schmidhuber)	2014 (Cho et al.)
Number of gates	0	3 (forget, input, output)	2 (update, reset)
Separate cell state	No	Yes	No
Parameters per unit	Low	High	Medium
Long-range dependencies	Poor	Strong	Good
Training speed	Fast	Slow	Moderate
Risk of vanishing gradients	High	Low	Low
Best suited for	Short sequences, simple patterns	Long sequences, complex dependencies	Medium-length sequences, limited compute
Common use cases	Toy problems, educational examples	Machine translation, speech recognition, text generation	Similar to LSTM; preferred when speed matters

RNN architectures and variants

Elman networks and Jordan networks

The earliest practical RNN architectures were simple recurrent networks (SRNs). Jeffrey Elman introduced the Elman network in 1990, where the hidden state from the previous time step is copied to a set of "context units" that feed back into the hidden layer. Michael Jordan proposed the Jordan network in 1986, which instead feeds the output from the previous time step back to the hidden layer through context units.

Architecture	Recurrent connection	Year
Jordan network	Output to hidden (via context units)	1986
Elman network	Hidden to hidden (via context units)	1990

Both architectures were foundational in demonstrating that recurrent networks could learn temporal structure, but they struggled with long sequences due to the vanishing gradient problem.

Bidirectional RNNs

A bidirectional RNN (BiRNN), introduced by Mike Schuster and Kuldip Paliwal in 1997, processes the input sequence in both the forward and backward directions using two separate hidden states. The forward hidden state captures information from past inputs, while the backward hidden state captures information from future inputs. At each time step, the two hidden states are typically concatenated to form the final representation.

Bidirectional RNNs are particularly useful for tasks where the entire input sequence is available before making predictions, such as named entity recognition, part-of-speech tagging, and speech recognition. The motivation for bidirectional processing is that in many tasks, the meaning of a particular element depends on both its past and future context. For example, the correct part-of-speech tag for a word may depend on words that appear both before and after it.

Bidirectional RNNs can use any recurrent cell type (vanilla RNN, LSTM, or GRU) in each direction. Bidirectional LSTMs (BiLSTMs) became the dominant architecture for many NLP tasks in the mid-2010s, including named entity recognition, sentiment analysis, question answering, and syntactic parsing.

A limitation of bidirectional RNNs is that they cannot be used in settings that require causal (left-to-right) generation or real-time/streaming applications, since the backward pass requires access to future inputs.

Deep (stacked) RNNs

Stacking multiple RNN layers on top of each other creates a deep RNN (also called a stacked RNN). The hidden state output of one RNN layer serves as the input sequence for the next layer. Deep RNNs can learn hierarchical representations of sequential data: lower layers might capture local patterns (like phonemes in speech), while higher layers capture more abstract features (like words or phrases).

In practice, deep RNNs with 2 to 4 layers often outperform single-layer RNNs, but performance gains diminish with additional depth. Residual connections, highway connections, and layer normalization are commonly used between layers of deep RNNs to stabilize training and facilitate gradient flow, borrowing ideas from deep feedforward and convolutional neural network architectures. Deep RNNs were found to improve performance on tasks such as speech recognition and machine translation, though they are more difficult to train than single-layer models.

Sequence-to-sequence (Seq2Seq) models

The sequence model known as sequence-to-sequence (Seq2Seq) was introduced independently by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le (2014) and by Kyunghyun Cho et al. (2014). A Seq2Seq model maps variable-length input sequences to variable-length output sequences and consists of two RNNs:

Encoder: An RNN that reads the input sequence and compresses it into a fixed-length context vector (typically the final hidden state of the encoder).
Decoder: An RNN that generates the output sequence one element at a time, conditioned on the context vector and its own previous outputs.

Seq2Seq models achieved breakthrough results in machine translation, text summarization, and conversational AI, and became the foundation for neural machine translation (NMT) systems. They were also applied to dialogue systems and code generation. However, the fixed-length context vector creates an information bottleneck: for long input sequences, the encoder must compress all relevant information into a single vector, which inevitably loses detail.

Attention mechanisms with RNNs

The attention mechanism, proposed by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in 2014, addressed the context vector bottleneck in Seq2Seq models. Instead of relying on a single fixed context vector, the decoder is allowed to attend to all encoder hidden states at each decoding step, computing a weighted sum where the weights reflect the relevance of each encoder state to the current decoding position.

Bahdanau attention uses an additive scoring function: the decoder hidden state and each encoder hidden state are passed through separate linear layers, summed, and fed through a tanh activation to produce alignment scores. These scores are then normalized with softmax to produce attention weights.

This innovation dramatically improved translation quality for long sentences and became a standard component of RNN-based Seq2Seq systems. It also provided interpretability, as the attention weights could be visualized to show which input elements the model was focusing on. The success of attention in RNN-based models directly inspired the development of the self-attention mechanism in transformers, which dispensed with recurrence altogether.

Applications

RNNs (and their gated variants) have been applied to a wide range of sequential tasks:

Application domain	Examples	Typical architecture
Natural language processing	Language model training, sentiment analysis, text classification	LSTM / GRU with attention
Machine translation	Translating between languages	Seq2Seq with Bahdanau or Luong attention
Speech recognition	Converting audio waveforms to text	Bidirectional LSTM, CTC loss
Time series forecasting	Stock prices, weather prediction, energy demand	Stacked LSTM / GRU
Music generation	Composing melodies and harmonies	Character-level LSTM
Handwriting recognition	Converting handwritten text to digital characters	Bidirectional LSTM with CTC
Video analysis	Activity recognition, video captioning	CNN encoder + LSTM decoder
Bioinformatics	Protein structure prediction, gene sequence analysis	Bidirectional GRU / LSTM

Language modeling

A language model assigns probabilities to sequences of words. RNN-based language models process text one word (or character) at a time, using the hidden state to maintain context. At each step, the model predicts the probability distribution over the next word given all previous words.

RNN language models significantly outperformed traditional n-gram models, particularly for capturing longer-range dependencies. Tomas Mikolov's work on RNN-based language models (2010-2012) demonstrated large improvements in perplexity (a standard metric for language models) compared to n-gram baselines.

Before the rise of Transformer-based models like GPT and BERT, LSTM-based language models represented the state of the art. OpenAI's early language model research used LSTMs before switching to Transformers.

Machine translation

Neural machine translation using RNN-based seq2seq models with attention became the dominant approach to machine translation from 2014 to 2017. Google's Neural Machine Translation system (GNMT), deployed in 2016, used a deep LSTM encoder-decoder with attention and achieved near-human-level translation quality on several language pairs. This system replaced Google's previous phrase-based statistical machine translation system.

Speech recognition

RNNs are well suited for speech recognition because speech is inherently sequential. Deep bidirectional LSTMs became the standard acoustic model in automatic speech recognition (ASR) systems. Connectionist Temporal Classification (CTC), a training criterion designed for sequence labeling with RNNs, was introduced by Alex Graves et al. in 2006 and became widely used in end-to-end speech recognition.

Baidu's Deep Speech (2014) and Deep Speech 2 (2015) systems demonstrated that deep RNNs trained on large datasets could achieve competitive speech recognition performance with simplified pipelines. Google's voice search and dictation systems also used LSTM-based models extensively.

Time series forecasting

RNNs have been applied to forecasting future values in time series data, including stock prices, energy demand, weather patterns, and sensor readings. LSTMs are particularly popular for time series tasks because of their ability to capture both short-term and long-term patterns. However, more recent approaches using Transformers and specialized architectures (such as N-BEATS and Temporal Fusion Transformers) have shown competitive or superior performance on many time series benchmarks.

Other applications

Additional applications of RNNs include:

Text generation: Character-level and word-level RNNs can generate text that mimics the style of a training corpus. Andrej Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" popularized this application.
Music generation: RNNs can learn patterns in musical sequences and produce new compositions. Google's Magenta project used LSTM networks for music generation.
Handwriting recognition and generation: LSTMs have been used for both recognizing handwritten text and generating realistic handwriting.
Video analysis: RNNs combined with convolutional neural networks have been used for video captioning, action recognition, and video prediction.
Anomaly detection: RNNs can learn the normal patterns in sequential data and flag deviations as anomalies, useful in network security and industrial monitoring.

Historical importance in NLP

From roughly 2013 to 2018, RNNs (particularly LSTMs and BiLSTMs) were the dominant architecture for most NLP tasks. Key milestones during this period include:

Word embeddings: While not RNN-specific, the development of Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) provided dense input representations that greatly improved RNN performance.
Neural machine translation: The seq2seq model with attention (2014-2015) made neural approaches competitive with and then superior to statistical machine translation.
ELMo (2018): Embeddings from Language Models, developed by Peters et al. at the Allen Institute for AI, used a bidirectional LSTM to produce contextualized word representations. ELMo demonstrated that pre-trained language model features could dramatically improve downstream NLP tasks, foreshadowing the success of BERT and GPT.
ULMFiT (2018): Howard and Ruder proposed Universal Language Model Fine-tuning, which demonstrated that pre-training an LSTM-based language model on a large corpus and then fine-tuning it on a specific task could achieve strong performance with limited labeled data. This approach was a direct precursor to the fine-tuning paradigm popularized by BERT.

RNNs also played a central role in early neural approaches to parsing, text classification, question answering, and dialogue systems.

RNNs vs. transformers

The introduction of the transformer architecture by Vaswani et al. in 2017 ("Attention Is All You Need") fundamentally shifted the landscape of sequence modeling. Transformers replaced the recurrent computation of RNNs with self-attention, which computes relationships between all positions in a sequence simultaneously.

Aspect	RNNs	Transformers
Processing order	Sequential (one step at a time)	Parallel (all positions at once)
Training parallelism	Limited by sequential dependency	Fully parallelizable on GPUs/TPUs
Long-range dependencies	Difficult (vanishing gradients, even with LSTM)	Handled natively by self-attention
Computational complexity per layer	O(T * d^2)	O(T^2 * d)
Memory at inference	O(d) constant state	O(T * d) grows with context length (KV cache)
Scalability	Hard to scale beyond ~1B parameters	Scales to hundreds of billions of parameters
Inductive bias	Strong sequential bias	No inherent sequential bias (uses positional encoding)

The key advantage of transformers is parallelization during training. RNNs must process each token sequentially because computing h_t requires h_{t-1}. Transformers compute attention over all tokens in parallel, enabling efficient use of modern GPU hardware and making it practical to train on far larger datasets with far larger models. This parallelization advantage is the primary reason transformers enabled the jump from hundreds-of-millions-parameter models to trillion-parameter models. Self-attention also connects every position in the sequence directly to every other position, avoiding the long gradient paths that make it difficult for RNNs to learn distant dependencies.

By 2019, Transformer-based models had surpassed RNNs on virtually all major NLP benchmarks. Large language models (LLMs) such as GPT-2, GPT-3, and BERT are all based on the Transformer architecture. In speech recognition and machine translation, Transformers also gradually replaced RNN-based systems.

However, RNNs retain an advantage at inference time for autoregressive generation: they maintain a fixed-size hidden state, giving them O(1) memory per step, whereas transformers must store and attend over a key-value cache that grows linearly with context length.

The recurrent revival: state space models and beyond

Despite the dominance of transformers, the inference-time efficiency of recurrent models has motivated a resurgence of interest in recurrent-style architectures.

Structured state space models (S4)

The S4 model (Gu et al., 2021) introduced structured state space models that use linear recurrence with carefully parameterized state matrices. S4 resolved the gradient stability issues of classical RNNs by leveraging HiPPO (High-order Polynomial Projection Operators) initialization, enabling stable modeling of sequences with tens of thousands of steps.

Mamba

Mamba (Gu and Dao, 2023) built on S4 by introducing selective state spaces, where the state transition matrices B and C are conditioned on the current input token. This gives the model content-aware filtering, analogous to a gating mechanism. Mamba achieves linear-time complexity O(T), 5x higher throughput than comparably sized transformers, and strong performance across language, audio, and genomics. The Mamba-3B model outperformed transformers of the same size on several benchmarks.

Mamba-2 (2024) introduced the State Space Duality (SSD) framework, proving mathematically that SSMs and attention are dual representations of the same underlying computation on structured matrices.

RWKV

RWKV (Receptance Weighted Key Value, 2023) combines the parallelizable training of transformers with the efficient inference of RNNs. It uses a linear attention mechanism and can be formulated as either a transformer (for parallel training) or an RNN (for efficient inference). RWKV-7 "Goose" offers linear-time complexity, constant memory at inference, and competitive performance with full-attention models in the 0.7B to 1.5B parameter range.

xLSTM

Sepp Hochreiter, co-inventor of the original LSTM, returned to the architecture with xLSTM (2024). xLSTM incorporates exponential gating, state expansion, and normalization techniques, offering two specialized modules: sLSTM (scalar memory) and mLSTM (matrix memory). While still in the research phase, xLSTM demonstrates that the core LSTM principles remain viable when modernized.

Linear attention models

Various approaches have also emerged that replace softmax attention with linear approximations, recovering RNN-like recurrence during inference. These developments suggest that the boundary between recurrent and attention-based models is blurring, with hybrid and linear-recurrent architectures seeking the best of both paradigms.

Notable implementations and frameworks

RNNs are supported by all major deep learning frameworks:

Framework	RNN support	Key features
PyTorch	`torch.nn.RNN`, `torch.nn.LSTM`, `torch.nn.GRU`	Dynamic computation graphs, cuDNN acceleration, packed sequences for variable-length inputs
TensorFlow / Keras	`tf.keras.layers.LSTM`, `tf.keras.layers.GRU`, `tf.keras.layers.SimpleRNN`	Static and dynamic graphs, TensorFlow Lite for mobile deployment
JAX / Flax	Custom implementations via `jax.lax.scan`	Functional transformations, JIT compilation, TPU support
ONNX Runtime	LSTM and GRU operators	Cross-framework model deployment and optimization

Historically, Theano (developed at the University of Montreal) was one of the first frameworks to support efficient RNN training and was used in much of the foundational RNN research, including the original seq2seq and attention papers. Torch (the Lua-based predecessor to PyTorch) was also widely used in early RNN research.

Practical considerations

When working with RNNs, several practical considerations influence performance:

Gradient clipping is almost always necessary to prevent exploding gradients. A common threshold is to clip the global gradient norm to 1.0 or 5.0.
Forget gate bias initialization: For LSTMs, initializing the forget gate bias to 1.0 (so the gate starts open) significantly improves learning, as recommended by Jozefowicz et al. (2015).
Dropout: Standard dropout applied to recurrent connections can hurt performance. Variational dropout (Gal and Ghahramani, 2016), which uses the same dropout mask at every time step, is preferred for regularization in RNNs.
Learning rate schedules: RNNs often benefit from learning rate warmup and decay schedules. Reducing the learning rate when validation performance plateaus is a common strategy.
Sequence length and batching: Sequences of different lengths in a batch must be padded (with masking) or packed. Efficient batching strategies group sequences of similar length to minimize wasted computation on padding.
Hidden state size: Typical hidden state dimensions range from 128 to 1024. Larger hidden states increase model capacity but also increase computational cost and the risk of overfitting.

Historical timeline

Year	Milestone	Researchers
1982	Hopfield network: recurrent network for associative memory	John Hopfield
1986	Jordan network: first simple recurrent network	Michael Jordan
1989	Teacher forcing introduced for RNN training	Ronald J. Williams, David Zipser
1990	Elman network: SRN with hidden-to-hidden recurrence	Jeffrey Elman
1990	Backpropagation through time formalized	Paul Werbos
1991	Vanishing gradient problem identified	Sepp Hochreiter (diploma thesis)
1994	Formal analysis of vanishing gradient problem in RNNs	Yoshua Bengio, Patrice Simard, Paolo Frasconi
1997	Long Short-Term Memory (LSTM) introduced	Sepp Hochreiter, Jurgen Schmidhuber
1997	Bidirectional RNNs proposed	Mike Schuster, Kuldip Paliwal
2000	LSTM with peephole connections	Felix Gers, Jurgen Schmidhuber
2006	Connectionist Temporal Classification (CTC)	Alex Graves et al.
2010	RNN-based language models demonstrate large perplexity gains	Tomas Mikolov et al.
2014	GRU introduced; Seq2Seq models; Bahdanau attention	Cho et al.; Sutskever et al.; Bahdanau et al.
2016	Google Neural Machine Translation (GNMT) deployed	Google
2017	Transformer introduced, beginning RNN decline	Vaswani et al.
2018	ELMo and ULMFiT demonstrate LSTM-based pretraining	Peters et al.; Howard, Ruder
2021	S4 structured state space model	Albert Gu et al.
2023	Mamba: selective state spaces with linear-time inference; RWKV	Albert Gu, Tri Dao; Peng et al.
2024	xLSTM; Mamba-2; RWKV v6/v7	Hochreiter et al.; Gu and Dao; Peng et al.

References

Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities." *Proceedings of the National Academy of Sciences*, 79(8), 2554-2558.
Jordan, M. I. (1986). "Serial order: A parallel distributed processing approach." *Technical Report 8604*, Institute for Cognitive Science, UC San Diego.
Elman, J. L. (1990). "Finding structure in time." *Cognitive Science*, 14(2), 179-211.
Williams, R. J., & Zipser, D. (1989). "A learning algorithm for continually running fully recurrent neural networks." *Neural Computation*, 1(2), 270-280.
Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." *IEEE Transactions on Neural Networks*, 5(2), 157-166.
Hochreiter, S., & Schmidhuber, J. (1997). "Long short-term memory." *Neural Computation*, 9(8), 1735-1780.
Schuster, M., & Paliwal, K. K. (1997). "Bidirectional recurrent neural networks." *IEEE Transactions on Signal Processing*, 45(11), 2673-2681.
Graves, A., Fernandez, S., Gomez, F., & Schmidhuber, J. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks." *Proceedings of ICML*.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent Neural Network Based Language Model." *Proceedings of Interspeech*.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning phrase representations using RNN encoder-decoder for statistical machine translation." *Proceedings of EMNLP*.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to sequence learning with neural networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 27.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural machine translation by jointly learning to align and translate." *Proceedings of ICLR 2015* (arXiv:1409.0473).
Karpathy, A. (2015). "The Unreasonable Effectiveness of Recurrent Neural Networks." Blog post.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention is all you need." *Advances in Neural Information Processing Systems*, 30.
Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL*.
Howard, J., & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." *Proceedings of ACL*.
Gu, A., Goel, K., & Re, C. (2021). "Efficiently modeling long sequences with structured state spaces." *Proceedings of ICLR 2022*.
Gu, A., & Dao, T. (2023). "Mamba: Linear-time sequence modeling with selective state spaces." *arXiv preprint arXiv:2312.00752*.
Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *Findings of EMNLP 2023*.
Beck, M., Poppel, K., Spanring, M., et al. (2024). "xLSTM: Extended Long Short-Term Memory." *Advances in Neural Information Processing Systems*, 37.

ELI5: Explain like I'm 5

History and development

How RNNs work

Mathematical formulation

Hidden state as memory

Unrolling through time

Training RNNs

Backpropagation through time (BPTT)

Truncated BPTT

Teacher forcing

The vanishing and exploding gradient problems

Solutions to gradient problems

Long Short-Term Memory (LSTM)

LSTM architecture

LSTM variants

Gated Recurrent Unit (GRU)

GRU architecture

LSTM vs. GRU comparison

Comparison of RNN variants

RNN architectures and variants

Elman networks and Jordan networks

Bidirectional RNNs

Deep (stacked) RNNs

Sequence-to-sequence (Seq2Seq) models

Attention mechanisms with RNNs

Applications

Language modeling

Machine translation

Speech recognition

Time series forecasting

Other applications

Historical importance in NLP

RNNs vs. transformers

The recurrent revival: state space models and beyond

Structured state space models (S4)

Mamba

RWKV

xLSTM

Linear attention models

Notable implementations and frameworks

Practical considerations

Historical timeline

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Long Short-Term Memory (LSTM)

RWKV

Sparse autoencoder

ARC-AGI 2

ELI5: Explain like I'm 5

History and development

How RNNs work

Mathematical formulation

Hidden state as memory

Unrolling through time

Training RNNs

Backpropagation through time (BPTT)

Truncated BPTT

Teacher forcing

The vanishing and exploding gradient problems

Solutions to gradient problems

Long Short-Term Memory (LSTM)

LSTM architecture

LSTM variants

Gated Recurrent Unit (GRU)

GRU architecture

LSTM vs. GRU comparison

Comparison of RNN variants

RNN architectures and variants

Elman networks and Jordan networks

Bidirectional RNNs

Deep (stacked) RNNs

Sequence-to-sequence (Seq2Seq) models

Attention mechanisms with RNNs

Applications

Language modeling

Machine translation

Speech recognition