See also: Machine learning terms
A recurrent neural network (RNN) is a class of artificial neural network designed to process sequential data by maintaining an internal hidden state that persists information across time steps. Unlike feedforward neural networks, which map a fixed-size input to a fixed-size output in a single pass, RNNs contain cyclic connections that allow information to flow from one step of a computation to the next. This makes them naturally suited for tasks where the order of inputs matters, such as language modeling, speech recognition, and time series forecasting.
RNNs were among the first neural architectures to handle variable-length sequences, and they were among the most widely used architectures in natural language processing and sequence modeling from the late 1980s through the mid-2010s. Although transformer-based models have largely supplanted RNNs in many domains since 2017, the core ideas behind recurrent computation remain influential in deep learning, and newer architectures such as state space models are revisiting recurrent principles with modern techniques. Recurrent networks also continue to see use in resource-constrained settings and certain real-time processing tasks.
Imagine you are reading a story one word at a time. After each word, you update a little summary in your head of what the story is about so far. When you reach the next word, you use both that new word and your running summary to understand what is happening. A recurrent neural network works the same way. It reads a sequence of inputs (like words) one at a time, keeps a "memory" of what it has seen, and uses that memory along with each new input to make predictions. The memory is not perfect, though. If the story is very long, the network might forget details from the beginning, which is why researchers invented improved versions like LSTM and GRU that are better at remembering important things over long stretches. This same idea helps computers understand sentences or patterns in music, and even predict what might come next.
The concept of recurrent connections in neural networks dates back to the early days of connectionist research. John Hopfield introduced the Hopfield network in 1982, a form of recurrent network used as an associative memory. While not a sequence model in the modern sense, the Hopfield network demonstrated that recurrent connections could store and retrieve patterns.
In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published their influential work on backpropagation, which laid the groundwork for training multilayer networks. Michael Jordan proposed the "Jordan network" in 1986, where the output layer rather than the hidden layer was fed back as context. Shortly after, in 1990, Jeffrey Elman introduced the "Elman network" (sometimes called the "simple recurrent network"), which added a context layer that fed the previous hidden state back into the network at each time step. This architecture became the prototype for what is now called the "vanilla RNN." Both Elman and Jordan networks established the basic principle of using recurrence to model sequences.
The 1990s saw growing awareness of the difficulties in training RNNs on long sequences, particularly the vanishing gradient problem identified by Sepp Hochreiter in his 1991 diploma thesis and later formalized by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. This led to the invention of Long Short-Term Memory (LSTM) networks by Hochreiter and Jurgen Schmidhuber in 1997, which became the dominant RNN variant for over a decade.
The Gated Recurrent Unit (GRU) was introduced by Kyunghyun Cho and colleagues in 2014 as a simpler alternative to LSTM. Around the same time, the development of sequence-to-sequence models by Ilya Sutskever, Oriol Vinyals, and Quoc Le (2014), along with the attention mechanism by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio (2014), brought RNN-based architectures to new heights of performance in machine translation and other tasks.
The publication of "Attention Is All You Need" by Vaswani et al. in 2017 introduced the Transformer, which replaced recurrence entirely with self-attention. This marked the beginning of a shift away from RNNs in most large-scale NLP applications.
At each time step t, an RNN receives an input vector x_t and combines it with the hidden state h_{t-1} from the previous time step to produce a new hidden state h_t. The hidden state acts as the network's memory: it encodes a compressed summary of all inputs the network has processed so far. An optional output y_t can be computed from the hidden state at any time step.
Conceptually, the same set of weights is reused at every time step. This weight sharing is what gives RNNs their ability to generalize across different positions in a sequence regardless of sequence length, and is a defining characteristic of the architecture.
The simplest (vanilla) RNN computes the hidden state and output as follows:
h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = g(W_hy * h_t + b_y)
Where:
| Symbol | Meaning |
|---|---|
| x_t | Input vector at time step t |
| h_t | Hidden state at time step t |
| h_{t-1} | Hidden state from the previous time step |
| W_xh | Weight matrix from input to hidden layer |
| W_hh | Weight matrix from hidden layer to hidden layer (recurrent weights) |
| W_hy | Weight matrix from hidden layer to output |
| b_h, b_y | Bias vectors |
| f | Activation function, typically tanh or ReLU |
| g | Output activation (e.g., softmax for classification) |
The hidden state h_0 is usually initialized to a zero vector at the start of a sequence.
The hidden state h(t) is a vector of fixed dimensionality (chosen as a hyperparameter) that summarizes all information from the input sequence up to time step t. In theory, this allows the RNN to capture arbitrarily long-range dependencies. In practice, vanilla RNNs struggle to retain information over many time steps due to the vanishing gradient problem, which motivated the development of gated architectures like LSTM and GRU.
The hidden state is initialized (typically to a zero vector) at the start of each sequence. As the network processes each element, the hidden state is progressively updated, building a compressed representation of the sequence history.
To understand how an RNN processes a sequence, it helps to "unroll" (or "unfold") the network across time steps. Unrolling replaces the single recurrent cell with a chain of identical cells, one for each time step. Each cell receives the input at its time step and the hidden state from the previous cell, and passes its own hidden state to the next cell. For a sequence of length T, the unrolled network has T copies of the same recurrent cell, connected sequentially, with all copies sharing the same weights. When unrolled, an RNN resembles a very deep feedforward network with shared weights at each layer, which is the perspective used during training with backpropagation through time.
RNNs are trained using backpropagation through time (BPTT), a direct extension of the standard backpropagation algorithm to sequences. The process works as follows:
The computational cost of BPTT is O(T) in both time and memory, where T is the sequence length. For very long sequences, this can become prohibitively expensive.
To reduce memory and computation costs, practitioners often use truncated backpropagation through time. Instead of unrolling the entire sequence, truncated BPTT divides the sequence into shorter segments (for example, 20 or 50 time steps) and backpropagates gradients only within each segment. The hidden state is carried forward from one segment to the next (maintaining continuity), but gradients are not propagated across segment boundaries.
Truncated BPTT introduces a tradeoff: it reduces memory usage and speeds up training, but it limits the network's ability to learn dependencies longer than the truncation window. In practice, this is often an acceptable compromise, especially for tasks where the most relevant context is relatively local. Truncated BPTT was widely used in practice for training language models and other RNN applications on long documents or continuous data streams.
Teacher forcing is a training strategy commonly used with RNNs that generate sequences, such as language model decoders and machine translation systems. During training, instead of feeding the model's own output from the previous time step as the next input, teacher forcing supplies the ground-truth token from the training data.
This approach speeds up convergence because the model does not have to recover from its own early mistakes during training. However, it introduces a mismatch between training and inference known as exposure bias: at inference time, the model must rely on its own predictions, which may differ from the ground-truth tokens it was trained on. Small prediction errors can compound over a long generated sequence, leading to degraded output quality.
Several techniques have been proposed to mitigate exposure bias, including scheduled sampling (gradually transitioning from teacher forcing to model predictions during training) and professor forcing, which uses adversarial training to align the hidden-state dynamics of teacher-forced and free-running modes.
The most significant challenge in training vanilla RNNs is the vanishing gradient problem. When gradients are backpropagated through many time steps, they are repeatedly multiplied by the recurrent weight matrix W_hh and the derivative of the activation function. If the largest eigenvalue (or singular value) of W_hh is less than 1, gradients shrink exponentially with the number of time steps, effectively preventing the network from learning long-range dependencies. This means the network cannot learn dependencies that span many time steps, because the error signal from distant outputs does not reach earlier hidden states. Conversely, if the largest eigenvalue exceeds 1, gradients can grow exponentially, a condition known as the exploding gradient problem. Exploding gradients cause numerical instability and can make training diverge entirely.
These problems were formally analyzed by Yoshua Bengio, Patrice Simard, and Paolo Frasconi in 1994. Their analysis showed that for vanilla RNNs, the influence of an input on the hidden state decays (or explodes) exponentially with the temporal distance, and that the ability to learn long-range dependencies decreases exponentially with the length of the dependency. This analysis motivated the development of gated RNN architectures.
| Technique | Description | Addresses |
|---|---|---|
| Gradient clipping | Rescale gradients when their norm exceeds a threshold | Exploding gradients |
| LSTM / GRU gates | Gating mechanisms that control information flow and maintain stable gradients | Vanishing gradients |
| Orthogonal initialization | Initialize W_hh as an orthogonal matrix so eigenvalues start near 1 | Both |
| Skip connections | Add direct connections across multiple time steps | Vanishing gradients |
| Batch normalization / layer normalization | Normalize activations to stabilize training dynamics | Both |
| Gradient regularization | Add a penalty to encourage gradients to remain in a stable range | Both |
| Truncated BPTT | Limit the number of time steps for backpropagation | Exploding gradients (indirectly) |
Gradient clipping, proposed by Thomas Mikolov, is one of the simplest and most widely used techniques. It sets a maximum threshold for the gradient norm; if the gradient exceeds this threshold, it is scaled down proportionally. This prevents the extreme parameter updates caused by exploding gradients but does not solve the vanishing gradient problem. Of the listed techniques, gated architectures (LSTM and GRU) have proven the most effective and widely adopted solution to vanishing gradients.
LSTM is a gated RNN architecture introduced by Sepp Hochreiter and Jurgen Schmidhuber in 1997 to address the vanishing gradient problem. The key innovation is a cell state (also called the memory cell) that runs through the entire sequence like a conveyor belt, with information added or removed through learned gating mechanisms. This design allows LSTMs to learn dependencies spanning hundreds or even thousands of time steps.
An LSTM cell contains three gates and a cell state. Each gate is implemented as a sigmoid layer followed by element-wise multiplication:
| Component | Function | Equation |
|---|---|---|
| Forget gate (f_t) | Decides what information to discard from the cell state | f_t = sigma(W_f * [h_{t-1}, x_t] + b_f) |
| Input gate (i_t) | Decides what new information to store in the cell state | i_t = sigma(W_i * [h_{t-1}, x_t] + b_i) |
| Candidate values | Creates a vector of new candidate values | C_tilde_t = tanh(W_C * [h_{t-1}, x_t] + b_C) |
| Cell state update | Combines old cell state with new candidates | C_t = f_t * C_{t-1} + i_t * C_tilde_t |
| Output gate (o_t) | Decides what part of the cell state to output | o_t = sigma(W_o * [h_{t-1}, x_t] + b_o) |
| Hidden state | Filtered version of the cell state | h_t = o_t * tanh(C_t) |
The forget gate outputs values between 0 and 1 for each element of the cell state; a value of 1 means "keep this entirely" and 0 means "discard this completely." The input gate and candidate values together determine what new information is written to the cell state. The output gate controls which parts of the cell state are exposed as the hidden state for downstream computation.
Because the cell state update involves only element-wise multiplication and addition (no matrix multiplication by W_hh), gradients can flow through the cell state with minimal attenuation. The forget gate allows the network to explicitly "reset" parts of the cell state when they are no longer relevant. This is the core mechanism that mitigates the vanishing gradient problem in LSTMs.
Several variants of the original LSTM have been proposed:
Research by Klaus Greff et al. (2017) systematically evaluated LSTM variants and found that the standard LSTM architecture performed well across tasks, with no single variant consistently outperforming it. The forget gate bias initialization (setting it to 1 so the network defaults to remembering) was found to be one of the most important practical considerations.
The Gated Recurrent Unit (GRU) was introduced by Kyunghyun Cho and colleagues in 2014 as a simpler alternative to LSTM. GRUs achieve comparable performance to LSTMs on many tasks while using fewer parameters and less computation. The GRU merges the cell state and hidden state into a single state vector and uses only two gates instead of three.
A GRU has two gates:
| Component | Function | Equation |
|---|---|---|
| Update gate (z_t) | Controls how much of the previous hidden state to retain | z_t = sigma(W_z * [h_{t-1}, x_t] + b_z) |
| Reset gate (r_t) | Controls how much of the previous hidden state to forget when computing the candidate | r_t = sigma(W_r * [h_{t-1}, x_t] + b_r) |
| Candidate hidden state | Computes a proposed new hidden state | h_tilde_t = tanh(W * [r_t * h_{t-1}, x_t] + b) |
| Hidden state update | Interpolates between old and candidate hidden states | h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde_t |
Notably, the GRU does not maintain a separate cell state. All memory is stored directly in the hidden state. The update gate serves a role similar to both the forget and input gates of the LSTM, while the reset gate determines how much past information to incorporate into the candidate computation. When the reset gate is close to 0, the network behaves as if it is reading the first symbol of a sequence, effectively allowing it to drop irrelevant past information. The update gate z_t acts as a direct interpolation between the old state and the candidate state: when z_t is close to 0, the hidden state is largely copied from the previous step; when z_t is close to 1, the hidden state is replaced by the new candidate.
| Feature | LSTM | GRU |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (update, reset) |
| Separate cell state | Yes | No |
| Parameters per unit | More (roughly 4x hidden size^2) | Fewer (roughly 3x hidden size^2) |
| Training speed | Slower | Faster |
| Long-range dependencies | Generally better on complex sequences | Comparable on shorter or simpler sequences |
| Memory usage | Higher | Lower |
The GRU has fewer parameters than the LSTM, which makes it faster to train and less prone to overfitting on small datasets. Empirical studies have found that neither architecture consistently outperforms the other across all tasks. GRUs tend to do better when data is limited or sequences are shorter, while LSTMs often have an edge on tasks requiring very long-term memory. In practice, the choice between GRU and LSTM is often made based on computational budget and dataset size.
| Feature | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Year introduced | ~1990 (Elman) | 1997 (Hochreiter & Schmidhuber) | 2014 (Cho et al.) |
| Number of gates | 0 | 3 (forget, input, output) | 2 (update, reset) |
| Separate cell state | No | Yes | No |
| Parameters per unit | Low | High | Medium |
| Long-range dependencies | Poor | Strong | Good |
| Training speed | Fast | Slow | Moderate |
| Risk of vanishing gradients | High | Low | Low |
| Best suited for | Short sequences, simple patterns | Long sequences, complex dependencies | Medium-length sequences, limited compute |
| Common use cases | Toy problems, educational examples | Machine translation, speech recognition, text generation | Similar to LSTM; preferred when speed matters |
The earliest practical RNN architectures were simple recurrent networks (SRNs). Jeffrey Elman introduced the Elman network in 1990, where the hidden state from the previous time step is copied to a set of "context units" that feed back into the hidden layer. Michael Jordan proposed the Jordan network in 1986, which instead feeds the output from the previous time step back to the hidden layer through context units.
| Architecture | Recurrent connection | Year |
|---|---|---|
| Jordan network | Output to hidden (via context units) | 1986 |
| Elman network | Hidden to hidden (via context units) | 1990 |
Both architectures were foundational in demonstrating that recurrent networks could learn temporal structure, but they struggled with long sequences due to the vanishing gradient problem.
A bidirectional RNN (BiRNN), introduced by Mike Schuster and Kuldip Paliwal in 1997, processes the input sequence in both the forward and backward directions using two separate hidden states. The forward hidden state captures information from past inputs, while the backward hidden state captures information from future inputs. At each time step, the two hidden states are typically concatenated to form the final representation.
Bidirectional RNNs are particularly useful for tasks where the entire input sequence is available before making predictions, such as named entity recognition, part-of-speech tagging, and speech recognition. The motivation for bidirectional processing is that in many tasks, the meaning of a particular element depends on both its past and future context. For example, the correct part-of-speech tag for a word may depend on words that appear both before and after it.
Bidirectional RNNs can use any recurrent cell type (vanilla RNN, LSTM, or GRU) in each direction. Bidirectional LSTMs (BiLSTMs) became the dominant architecture for many NLP tasks in the mid-2010s, including named entity recognition, sentiment analysis, question answering, and syntactic parsing.
A limitation of bidirectional RNNs is that they cannot be used in settings that require causal (left-to-right) generation or real-time/streaming applications, since the backward pass requires access to future inputs.
Stacking multiple RNN layers on top of each other creates a deep RNN (also called a stacked RNN). The hidden state output of one RNN layer serves as the input sequence for the next layer. Deep RNNs can learn hierarchical representations of sequential data: lower layers might capture local patterns (like phonemes in speech), while higher layers capture more abstract features (like words or phrases).
In practice, deep RNNs with 2 to 4 layers often outperform single-layer RNNs, but performance gains diminish with additional depth. Residual connections, highway connections, and layer normalization are commonly used between layers of deep RNNs to stabilize training and facilitate gradient flow, borrowing ideas from deep feedforward and convolutional neural network architectures. Deep RNNs were found to improve performance on tasks such as speech recognition and machine translation, though they are more difficult to train than single-layer models.
The sequence model known as sequence-to-sequence (Seq2Seq) was introduced independently by Ilya Sutskever, Oriol Vinyals, and Quoc V. Le (2014) and by Kyunghyun Cho et al. (2014). A Seq2Seq model maps variable-length input sequences to variable-length output sequences and consists of two RNNs:
Seq2Seq models achieved breakthrough results in machine translation, text summarization, and conversational AI, and became the foundation for neural machine translation (NMT) systems. They were also applied to dialogue systems and code generation. However, the fixed-length context vector creates an information bottleneck: for long input sequences, the encoder must compress all relevant information into a single vector, which inevitably loses detail.
The attention mechanism, proposed by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in 2014, addressed the context vector bottleneck in Seq2Seq models. Instead of relying on a single fixed context vector, the decoder is allowed to attend to all encoder hidden states at each decoding step, computing a weighted sum where the weights reflect the relevance of each encoder state to the current decoding position.
Bahdanau attention uses an additive scoring function: the decoder hidden state and each encoder hidden state are passed through separate linear layers, summed, and fed through a tanh activation to produce alignment scores. These scores are then normalized with softmax to produce attention weights.
This innovation dramatically improved translation quality for long sentences and became a standard component of RNN-based Seq2Seq systems. It also provided interpretability, as the attention weights could be visualized to show which input elements the model was focusing on. The success of attention in RNN-based models directly inspired the development of the self-attention mechanism in transformers, which dispensed with recurrence altogether.
RNNs (and their gated variants) have been applied to a wide range of sequential tasks:
| Application domain | Examples | Typical architecture |
|---|---|---|
| Natural language processing | Language model training, sentiment analysis, text classification | LSTM / GRU with attention |
| Machine translation | Translating between languages | Seq2Seq with Bahdanau or Luong attention |
| Speech recognition | Converting audio waveforms to text | Bidirectional LSTM, CTC loss |
| Time series forecasting | Stock prices, weather prediction, energy demand | Stacked LSTM / GRU |
| Music generation | Composing melodies and harmonies | Character-level LSTM |
| Handwriting recognition | Converting handwritten text to digital characters | Bidirectional LSTM with CTC |
| Video analysis | Activity recognition, video captioning | CNN encoder + LSTM decoder |
| Bioinformatics | Protein structure prediction, gene sequence analysis | Bidirectional GRU / LSTM |
A language model assigns probabilities to sequences of words. RNN-based language models process text one word (or character) at a time, using the hidden state to maintain context. At each step, the model predicts the probability distribution over the next word given all previous words.
RNN language models significantly outperformed traditional n-gram models, particularly for capturing longer-range dependencies. Tomas Mikolov's work on RNN-based language models (2010-2012) demonstrated large improvements in perplexity (a standard metric for language models) compared to n-gram baselines.
Before the rise of Transformer-based models like GPT and BERT, LSTM-based language models represented the state of the art. OpenAI's early language model research used LSTMs before switching to Transformers.
Neural machine translation using RNN-based seq2seq models with attention became the dominant approach to machine translation from 2014 to 2017. Google's Neural Machine Translation system (GNMT), deployed in 2016, used a deep LSTM encoder-decoder with attention and achieved near-human-level translation quality on several language pairs. This system replaced Google's previous phrase-based statistical machine translation system.
RNNs are well suited for speech recognition because speech is inherently sequential. Deep bidirectional LSTMs became the standard acoustic model in automatic speech recognition (ASR) systems. Connectionist Temporal Classification (CTC), a training criterion designed for sequence labeling with RNNs, was introduced by Alex Graves et al. in 2006 and became widely used in end-to-end speech recognition.
Baidu's Deep Speech (2014) and Deep Speech 2 (2015) systems demonstrated that deep RNNs trained on large datasets could achieve competitive speech recognition performance with simplified pipelines. Google's voice search and dictation systems also used LSTM-based models extensively.
RNNs have been applied to forecasting future values in time series data, including stock prices, energy demand, weather patterns, and sensor readings. LSTMs are particularly popular for time series tasks because of their ability to capture both short-term and long-term patterns. However, more recent approaches using Transformers and specialized architectures (such as N-BEATS and Temporal Fusion Transformers) have shown competitive or superior performance on many time series benchmarks.
Additional applications of RNNs include:
From roughly 2013 to 2018, RNNs (particularly LSTMs and BiLSTMs) were the dominant architecture for most NLP tasks. Key milestones during this period include:
RNNs also played a central role in early neural approaches to parsing, text classification, question answering, and dialogue systems.
The introduction of the transformer architecture by Vaswani et al. in 2017 ("Attention Is All You Need") fundamentally shifted the landscape of sequence modeling. Transformers replaced the recurrent computation of RNNs with self-attention, which computes relationships between all positions in a sequence simultaneously.
| Aspect | RNNs | Transformers |
|---|---|---|
| Processing order | Sequential (one step at a time) | Parallel (all positions at once) |
| Training parallelism | Limited by sequential dependency | Fully parallelizable on GPUs/TPUs |
| Long-range dependencies | Difficult (vanishing gradients, even with LSTM) | Handled natively by self-attention |
| Computational complexity per layer | O(T * d^2) | O(T^2 * d) |
| Memory at inference | O(d) constant state | O(T * d) grows with context length (KV cache) |
| Scalability | Hard to scale beyond ~1B parameters | Scales to hundreds of billions of parameters |
| Inductive bias | Strong sequential bias | No inherent sequential bias (uses positional encoding) |
The key advantage of transformers is parallelization during training. RNNs must process each token sequentially because computing h_t requires h_{t-1}. Transformers compute attention over all tokens in parallel, enabling efficient use of modern GPU hardware and making it practical to train on far larger datasets with far larger models. This parallelization advantage is the primary reason transformers enabled the jump from hundreds-of-millions-parameter models to trillion-parameter models. Self-attention also connects every position in the sequence directly to every other position, avoiding the long gradient paths that make it difficult for RNNs to learn distant dependencies.
By 2019, Transformer-based models had surpassed RNNs on virtually all major NLP benchmarks. Large language models (LLMs) such as GPT-2, GPT-3, and BERT are all based on the Transformer architecture. In speech recognition and machine translation, Transformers also gradually replaced RNN-based systems.
However, RNNs retain an advantage at inference time for autoregressive generation: they maintain a fixed-size hidden state, giving them O(1) memory per step, whereas transformers must store and attend over a key-value cache that grows linearly with context length.
Despite the dominance of transformers, the inference-time efficiency of recurrent models has motivated a resurgence of interest in recurrent-style architectures.
The S4 model (Gu et al., 2021) introduced structured state space models that use linear recurrence with carefully parameterized state matrices. S4 resolved the gradient stability issues of classical RNNs by leveraging HiPPO (High-order Polynomial Projection Operators) initialization, enabling stable modeling of sequences with tens of thousands of steps.
Mamba (Gu and Dao, 2023) built on S4 by introducing selective state spaces, where the state transition matrices B and C are conditioned on the current input token. This gives the model content-aware filtering, analogous to a gating mechanism. Mamba achieves linear-time complexity O(T), 5x higher throughput than comparably sized transformers, and strong performance across language, audio, and genomics. The Mamba-3B model outperformed transformers of the same size on several benchmarks.
Mamba-2 (2024) introduced the State Space Duality (SSD) framework, proving mathematically that SSMs and attention are dual representations of the same underlying computation on structured matrices.
RWKV (Receptance Weighted Key Value, 2023) combines the parallelizable training of transformers with the efficient inference of RNNs. It uses a linear attention mechanism and can be formulated as either a transformer (for parallel training) or an RNN (for efficient inference). RWKV-7 "Goose" offers linear-time complexity, constant memory at inference, and competitive performance with full-attention models in the 0.7B to 1.5B parameter range.
Sepp Hochreiter, co-inventor of the original LSTM, returned to the architecture with xLSTM (2024). xLSTM incorporates exponential gating, state expansion, and normalization techniques, offering two specialized modules: sLSTM (scalar memory) and mLSTM (matrix memory). While still in the research phase, xLSTM demonstrates that the core LSTM principles remain viable when modernized.
Various approaches have also emerged that replace softmax attention with linear approximations, recovering RNN-like recurrence during inference. These developments suggest that the boundary between recurrent and attention-based models is blurring, with hybrid and linear-recurrent architectures seeking the best of both paradigms.
RNNs are supported by all major deep learning frameworks:
| Framework | RNN support | Key features |
|---|---|---|
| PyTorch | torch.nn.RNN, torch.nn.LSTM, torch.nn.GRU | Dynamic computation graphs, cuDNN acceleration, packed sequences for variable-length inputs |
| TensorFlow / Keras | tf.keras.layers.LSTM, tf.keras.layers.GRU, tf.keras.layers.SimpleRNN | Static and dynamic graphs, TensorFlow Lite for mobile deployment |
| JAX / Flax | Custom implementations via jax.lax.scan | Functional transformations, JIT compilation, TPU support |
| ONNX Runtime | LSTM and GRU operators | Cross-framework model deployment and optimization |
Historically, Theano (developed at the University of Montreal) was one of the first frameworks to support efficient RNN training and was used in much of the foundational RNN research, including the original seq2seq and attention papers. Torch (the Lua-based predecessor to PyTorch) was also widely used in early RNN research.
When working with RNNs, several practical considerations influence performance:
| Year | Milestone | Researchers |
|---|---|---|
| 1982 | Hopfield network: recurrent network for associative memory | John Hopfield |
| 1986 | Jordan network: first simple recurrent network | Michael Jordan |
| 1989 | Teacher forcing introduced for RNN training | Ronald J. Williams, David Zipser |
| 1990 | Elman network: SRN with hidden-to-hidden recurrence | Jeffrey Elman |
| 1990 | Backpropagation through time formalized | Paul Werbos |
| 1991 | Vanishing gradient problem identified | Sepp Hochreiter (diploma thesis) |
| 1994 | Formal analysis of vanishing gradient problem in RNNs | Yoshua Bengio, Patrice Simard, Paolo Frasconi |
| 1997 | Long Short-Term Memory (LSTM) introduced | Sepp Hochreiter, Jurgen Schmidhuber |
| 1997 | Bidirectional RNNs proposed | Mike Schuster, Kuldip Paliwal |
| 2000 | LSTM with peephole connections | Felix Gers, Jurgen Schmidhuber |
| 2006 | Connectionist Temporal Classification (CTC) | Alex Graves et al. |
| 2010 | RNN-based language models demonstrate large perplexity gains | Tomas Mikolov et al. |
| 2014 | GRU introduced; Seq2Seq models; Bahdanau attention | Cho et al.; Sutskever et al.; Bahdanau et al. |
| 2016 | Google Neural Machine Translation (GNMT) deployed | |
| 2017 | Transformer introduced, beginning RNN decline | Vaswani et al. |
| 2018 | ELMo and ULMFiT demonstrate LSTM-based pretraining | Peters et al.; Howard, Ruder |
| 2021 | S4 structured state space model | Albert Gu et al. |
| 2023 | Mamba: selective state spaces with linear-time inference; RWKV | Albert Gu, Tri Dao; Peng et al. |
| 2024 | xLSTM; Mamba-2; RWKV v6/v7 | Hochreiter et al.; Gu and Dao; Peng et al. |