RNN
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 6,774 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 6,774 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
RNN is the standard abbreviation for recurrent neural network, a class of artificial neural network in which connections between units form cycles, so the activations from one time step are fed back into the network at the next time step. That feedback loop gives the model an internal hidden state that acts as a running memory of everything it has seen so far in a sequence, which is why RNNs were the dominant architecture for sequence modeling tasks like language modeling, speech recognition, and handwriting recognition for roughly two decades.[1][2] Most introductory references use "RNN" as a generic umbrella that covers vanilla Elman-style networks, gated variants such as LSTM and GRU, bidirectional networks, and stacked or deep recurrent stacks.[3] Modern usage often contrasts RNNs with transformer-based models, which since 2017 have replaced them in most large-scale natural language processing systems.[4]
The RNN article on this wiki is the short reference page for the abbreviation. The full article, with extensive coverage of architecture diagrams, training math, and applications, lives at recurrent neural network.
A feedforward neural network maps an input vector to an output vector in a single pass; nothing it computes for one example influences what it computes for the next. An RNN keeps a hidden state vector h_t that is updated at every time step from the current input x_t and the previous hidden state h_{t-1}. Because the same parameters are reused at each step, the network can in principle process sequences of any length using a fixed number of weights.[1] That parameter sharing across time, together with the recurrent connection, is what defines the architecture. The network is recurrent in the strict graph-theoretic sense: if you draw it as a computation graph, there is a cycle in the connections.
The simplest RNN, often called the vanilla RNN or the Elman network, computes:
h_t = f(W_xh x_t + W_hh h_{t-1} + b_h)
y_t = g(W_hy h_t + b_y)
Here x_t is the input at step t, h_t is the new hidden state, and y_t is the optional output at that step. W_xh, W_hh, and W_hy are weight matrices; b_h and b_y are bias vectors. The function f is a nonlinearity, almost always tanh in classic work and sometimes ReLU in later variants. The function g depends on the task, often softmax for classification or identity for regression. The hidden state h_0 is initialized to a zero vector at the start of a sequence.[1]
A widely cited compact form writes the update as h_t = tanh(W x_t + U h_{t-1} + b), with W playing the role of W_xh and U the role of W_hh.[2] The two notations are equivalent; the choice of letters is purely stylistic. What matters for analysis is that two matrices act on the previous hidden state at every step: the recurrent matrix U (or W_hh) multiplies h_{t-1}, then a saturating elementwise nonlinearity reshapes the result.
It is worth noticing that the same matrix W_hh is multiplied into the hidden state at every step. The repeated application of one matrix is the source of the architecture's main strength (it lets a small model summarize an arbitrarily long sequence) and its main weakness (it makes long-range learning very hard, as the section on gradients explains).[5][6]
For an input vector of size d and a hidden state of size h, W_xh has shape (h, d), W_hh has shape (h, h), b_h has shape (h,), and the total recurrent parameter count is hd + hh + h. Hidden sizes used in mid-2010s production systems ranged from 256 for keyword spotting to 1024 or 2048 for large language models and machine translation encoders.[7][8] The classic Penn Treebank LSTM benchmarks from Zaremba, Sutskever, and Vinyals (2014) used hidden sizes of 200 (small), 650 (medium), and 1500 (large), with the large model containing roughly 66 million parameters.[9] A standard LSTM cell of hidden size h with input size d has roughly 4 * (hd + hh + h) parameters, four times the vanilla RNN count because the cell has four gate-style projections (input, forget, output, and candidate cell update). A GRU has roughly 3 * (hd + hh + h) parameters because it has three internal projections (update, reset, and candidate update), which is one reason GRUs are sometimes preferred for memory-constrained deployments.
Conceptually, W_hh defines a linear dynamical system in hidden space, and the elementwise nonlinearity (typically tanh) keeps the trajectories bounded. The singular value decomposition of W_hh determines whether information injected at step t survives, decays, or grows by step t + k. If the largest singular value is exactly 1 along certain directions, those directions are stable and carry information forward indefinitely; if it is less than 1, those directions decay; if it is greater than 1, they amplify and eventually saturate the tanh, after which the cell behaves like a binary switch. Most of the analysis behind orthogonal initialization and norm-preserving recurrent architectures (such as unitary RNNs) starts from this observation.[21][22]
Different tasks expose different parts of the RNN to the loss. The patterns are usually grouped into a few categories:
| pattern | example task | description |
|---|---|---|
| one to one | image classification | trivial case, no recurrence really used |
| one to many | image captioning | single input, sequence output (decoder) |
| many to one | sentiment classification | sequence input, single output at the end |
| many to many (aligned) | per-frame phoneme labeling | sequence in, sequence out, same length |
| many to many (encoder-decoder) | machine translation | sequence in, then sequence out, different lengths |
Andrej Karpathy's 2015 essay "The Unreasonable Effectiveness of Recurrent Neural Networks" popularized this taxonomy and is still a useful one-page summary of how an RNN's interface depends on the task.[10] The essay showed character-level RNNs generating Shakespeare, Wikipedia markup, LaTeX, and Linux source code, and was influential in convincing a broader audience that recurrent networks could capture surprisingly long-range structure when trained on enough data.
Recurrent connections in neural networks predate the term RNN. John Hopfield's 1982 paper introduced what is now called the Hopfield network, a fully connected recurrent network used as an associative memory. It is not a sequence model in the modern sense (it relaxes to fixed points rather than producing time-indexed outputs), but it established that recurrence could store information in a stable way.[11] The Hopfield network has discrete-time symmetric weights and provably converges to one of several stored patterns from any sufficient partial input, which gave the architecture an interpretation as content-addressable memory.
Michael Jordan's 1986 "Jordan network" fed the previous output of the network back into the hidden layer at the next step, which let a small network model temporal patterns.[12] Jeffrey Elman's 1990 paper Finding Structure in Time changed the wiring slightly, copying the previous hidden state (rather than the output) into a context layer that fed the hidden layer at the next step.[1] The Elman network, also called the simple recurrent network or SRN, is the architecture that most textbooks now mean when they say "vanilla RNN." Elman's experiments showed that an SRN could discover word boundaries from continuous letter streams and grammatical structure from sentence streams, which made the architecture interesting to cognitive scientists as well as engineers. The paper had been cited more than 12,000 times by the mid-2020s.[1]
The formal training algorithm for RNNs was worked out by Paul Werbos in 1990. His paper Backpropagation through time: what it does and how to do it described how to unroll the recurrent computation across the sequence and apply the chain rule to the resulting feedforward graph with tied weights, giving a recipe that practitioners have used essentially unchanged for thirty-five years.[13]
Throughout the 1990s, vanilla RNNs were used for small-scale phoneme recognition, control problems, and language modeling. The decade's most important conceptual contribution was negative: Sepp Hochreiter's 1991 diploma thesis at the Technical University of Munich, supervised by Jurgen Schmidhuber, identified the vanishing gradient problem, and Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalized it in their 1994 paper Learning Long-Term Dependencies with Gradient Descent is Difficult.[5][6] That work showed why simple RNNs could not actually learn the long-range dependencies their architecture seemed to allow. The vanishing gradient analysis directly motivated the LSTM of Hochreiter and Schmidhuber, published in Neural Computation in 1997, which addressed the problem with gating and a constant error carousel; the LSTM paper has become the most cited neural network paper of the twentieth century, with more than seventy thousand citations.[14] It also motivated, indirectly, the GRU of Cho et al. (2014).[15]
The practical RNN era of the 2010s was driven by three factors: gated cells that finally trained well, large labeled datasets, and GPUs. Tomas Mikolov's 2010 RNN language model on the Penn Treebank cut perplexity by about half compared to backoff n-gram baselines and reduced word error rate by 18% on the Wall Street Journal speech task, which reset expectations for what was possible with recurrent language models.[16] Sutskever, Vinyals, and Le's 2014 sequence-to-sequence paper used a stack of four LSTMs for English-to-French translation on WMT'14 and reached a BLEU score competitive with strong phrase-based systems, kicking off the modern era of neural machine translation.[17] Bahdanau attention, introduced by Dzmitry Bahdanau, Kyunghyun Cho, and Bengio in the same year, then removed the fixed-size bottleneck of the encoder hidden state by letting the decoder attend over all encoder states.[18]
By 2014, Hasim Sak, Andrew Senior, and Francoise Beaufays at Google had shown that deep LSTM acoustic models with a recurrent projection layer beat the previous state of the art on large-vocabulary speech recognition, and the same group deployed LSTM acoustic models in Google's Android voice search in 2015.[7] By 2016 the Google Neural Machine Translation system (Wu et al.) was a deep stack of eight LSTM encoder layers and eight LSTM decoder layers with attention and residual connections, deployed to production for several language pairs.[8] For about three years, LSTM was the default sequence model in industrial deep learning.
That default ended with Vaswani et al.'s 2017 paper Attention Is All You Need, which introduced the transformer.[4] Transformers replaced recurrence with self-attention, which is parallelizable across the time dimension and scales much better on modern accelerators. By 2019 most large language models had moved off LSTMs entirely. The architecture did not disappear, but its share of the field shrank quickly.
RNNs are trained with a variant of backpropagation called backpropagation through time (BPTT). The trick is to first "unroll" the recurrent computation across the sequence: a network that processes T steps becomes, conceptually, a feedforward network with T layers that share weights. Once unrolled, ordinary backpropagation runs through the unrolled graph, and the gradient with respect to each shared weight is the sum of contributions from every time step.[13] The training loss is typically defined as the sum (or mean) of per-step losses, for example a sum of cross-entropy losses for language modeling.
For very long sequences, full BPTT becomes impractical: it stores activations for every step in memory, and the per-step gradient computation gets expensive. Truncated BPTT (TBPTT) splits the sequence into fixed-length windows (commonly 20 to 200 steps), backpropagates within each window, and carries the hidden state forward across windows without backpropagating the gradient across the boundary. TBPTT was the workhorse training method for RNN language models throughout the 2010s and is sometimes denoted TBPTT(k1, k2), where k1 is how often a gradient update is performed and k2 is how far backward the gradient is allowed to flow.[9]
A related technique used in sequence generation is teacher forcing: during training, the model is fed the ground-truth previous token rather than its own previous prediction, which speeds convergence at the cost of a train-test distribution mismatch known as exposure bias.[19] Variants such as scheduled sampling and professor forcing have been proposed to soften this mismatch by mixing model predictions and ground-truth tokens during training.
During BPTT, the gradient of an early hidden state with respect to a late loss involves a long product of Jacobians of the recurrent update. Each Jacobian carries one factor of W_hh and one factor of the derivative of the activation function. If the spectral radius of W_hh is below 1, the product shrinks toward zero exponentially with the number of steps; if it is above 1, the product grows without bound.[5][20]
The shrinking case is the vanishing gradient problem: by the time the gradient signal reaches an early step, it is numerically indistinguishable from noise, so the network cannot learn dependencies that span many steps. Hochreiter (1991) and Bengio, Simard, and Frasconi (1994) worked the math out and showed that this is not an artifact of bad optimization; it is a property of the architecture.[5][6] Empirically, vanilla RNNs struggle with dependencies longer than about 10 to 20 steps, while LSTMs in the original 1997 paper learned to bridge gaps in excess of 1000 time steps on benchmark problems.[14]
The growing case is the exploding gradient problem, where individual updates are so large that training diverges. Pascanu, Mikolov, and Bengio's 2013 paper On the difficulty of training recurrent neural networks analyzed both regimes geometrically and proposed gradient clipping as a simple, robust fix: when the global norm of the gradient exceeds a threshold, rescale the gradient so its norm equals the threshold, then take the step.[20] The geometric intuition is that the RNN loss surface has narrow, sharply curved valleys where unclipped gradients overshoot dramatically; clipping pulls the step back inside the high-curvature region. Gradient clipping has been part of the standard RNN training recipe ever since.
Clipping handles explosion. It does not help vanishing. The mainstream fix for vanishing is to change the architecture so that the cell state is updated additively rather than multiplicatively, which is the central idea behind LSTM and GRU. A second line of defense is to constrain the recurrent matrix to lie near the orthogonal manifold; Saxe, McClelland, and Ganguli (2014) and Henaff, Szlam, and LeCun (2016) showed that initializing W_hh to an orthogonal or unitary matrix keeps singular values near 1 and lets gradient signals propagate over hundreds of steps even in a vanilla RNN.[21][22]
The term "RNN" covers a family of architectures that differ in how they wire the recurrent connections, what gates they include, and what direction they read the sequence. The table summarizes the main members of the family.
| variant | year | key idea | typical use |
|---|---|---|---|
| Hopfield network | 1982 | symmetric weights, settles to fixed point | associative memory |
| Jordan network | 1986 | output fed back to hidden via context units | early sequence modeling |
| Elman / vanilla RNN | 1990 | hidden state fed back via context units | textbook baseline, short sequences |
| LSTM | 1997 | cell state with input, forget, output gates | long-range sequence modeling |
| bidirectional RNN (Schuster & Paliwal) | 1997 | two passes, forward and reverse, concatenated | tagging, NER, speech, biology |
| LSTM with forget gate (Gers et al.) | 2000 | adds explicit forget gate to original LSTM | continual prediction tasks |
| Echo state network | 2001 | random fixed reservoir, only output trained | low-cost time series, neuromorphic |
| GRU (Cho et al.) | 2014 | update and reset gates, no separate cell state | translation, smaller models |
| Stacked / deep RNN | 1990s onward | multiple recurrent layers stacked vertically | strong sequence models |
| ConvLSTM (Shi et al.) | 2015 | convolutions inside an LSTM cell | spatiotemporal data, weather |
| IndRNN (Li et al.) | 2018 | hidden units have only self-recurrence | very deep, long-sequence RNNs |
| xLSTM (Beck et al.) | 2024 | exponential gating, scalar and matrix memory | LLM-scale sequence modeling |
| minLSTM / minGRU (Feng et al.) | 2024 | parallel-scan training, no hidden-state gating | parallel-trainable lightweight RNN |
A bidirectional RNN, introduced by Mike Schuster and Kuldip Paliwal in 1997, runs two independent recurrent passes: one left-to-right and one right-to-left. The two hidden states at each position are concatenated, so each output sees both past and future context.[23] Bidirectional networks cannot be used in strict streaming settings (you need the whole sequence before you can run the backward pass), but they are standard for tasks like phoneme recognition, named entity recognition, and biological sequence labeling.
Deep or stacked RNNs simply stack multiple recurrent layers on top of one another, with the hidden states of layer k serving as the inputs to layer k+1 at the same time step. Deep recurrent stacks (often deep LSTMs) were the workhorse of large-scale speech and translation systems in the mid-2010s. The Google NMT system mentioned above is a representative example, with eight LSTM layers in both the encoder and the decoder.[8]
The GRU, introduced by Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk, and Bengio in 2014 alongside the encoder-decoder framework for statistical machine translation, simplifies the LSTM by merging the forget and input gates into a single update gate and combining the cell state and the hidden state.[15] Chung, Gulcehre, Cho, and Bengio (2014) compared the two empirically on polyphonic music and speech signal modeling and found that GRU and LSTM achieved comparable performance, with no clear winner across tasks.[24]
A large 2017 study by Greff, Srivastava, Koutnik, Steunebrink, and Schmidhuber, titled LSTM: A search space odyssey, evaluated eight LSTM variants across speech, handwriting, and polyphonic music tasks using 5400 experimental runs (about fifteen years of CPU time). The study found that no variant improved significantly over the standard LSTM and that the forget gate and the output activation were the most critical components.[25] Jozefowicz, Zaremba, and Sutskever (2015) searched over ten thousand RNN architectures with an evolutionary procedure and identified some variants that beat LSTM on specific tasks but no architecture that dominated across the board, again suggesting that the LSTM design is near a local optimum.[26]
Echo state networks (Herbert Jaeger, 2001) and the closely related liquid state machines (Wolfgang Maass et al., 2002) take a very different approach: they leave the recurrent weights random and fixed, and only train a linear readout on top. This reservoir computing view sidesteps the gradient-flow problems of full RNN training, which made it attractive for analog hardware implementations and for low-power time-series applications.[27][28] It also gives up the ability to shape the recurrent dynamics to a specific task, so it has not been competitive on large datasets.
The most influential applications of RNNs (almost always LSTMs in practice) clustered around sequential data with strong temporal structure.
| domain | typical RNN role | representative system |
|---|---|---|
| language modeling | predict next token from previous tokens | Mikolov et al. RNN-LM (2010); Penn Treebank LSTM baselines (Zaremba et al., 2014) |
| machine translation | encoder-decoder with attention | Sutskever seq2seq (2014); GNMT (Wu et al., 2016) |
| speech recognition | acoustic model and CTC decoder | Google Voice Search LSTM (2014-2015); DeepSpeech 2 (2016) |
| handwriting recognition | online stroke sequence to text | Graves and Schmidhuber multi-dimensional LSTM (2008-2009); Graves (2013) |
| music and symbolic generation | next-note prediction | Magenta MelodyRNN (2016) |
| time series forecasting | next-value prediction with context | DeepAR (Salinas et al., 2017) |
| video classification | per-frame features then RNN aggregation | LRCN (Donahue et al., 2015) |
| robotic control | policy with hidden state for partial observability | DeepMind continuous control LSTM agents |
| reinforcement learning | recurrent policies for POMDPs | A3C-LSTM (Mnih et al., 2016) |
Many of these applications used RNNs as one stage of a larger pipeline. Speech recognition systems, for example, often combined a recurrent acoustic model with connectionist temporal classification (CTC) loss and a separate language model. CTC, introduced by Graves, Fernandez, Gomez, and Schmidhuber at ICML 2006, lets a recurrent network output a sequence of labels without explicit alignment to the input timeline, by marginalizing over all possible alignments with a special blank token.[29] CTC paired with deep bidirectional LSTMs powered the end-to-end speech systems of the mid-2010s, including Baidu's DeepSpeech 2 (Amodei et al., 2016), which transcribed Mandarin Chinese short voice queries at 3.7% character error rate compared to about 4.0% for human transcribers in their evaluation.[30]
Alex Graves's 2013 paper Generating sequences with recurrent neural networks extended the recurrent generative recipe to real-valued data, training a deep LSTM to synthesize realistic cursive online handwriting conditioned on text input.[31] Karpathy's 2015 character-level demonstrations of Shakespeare and Linux source generation were essentially the discrete-text counterpart.[10]
Outside speech and text, RNNs were the default architecture for video classification (with LRCN combining per-frame convolutional features and an LSTM aggregator) and for univariate and multivariate time-series forecasting (with DeepAR using stacked LSTMs to produce probabilistic forecasts for retail demand at Amazon). In reinforcement learning, recurrent policies extended actor-critic agents to partially observable environments, with the A3C-LSTM agent of Mnih et al. (2016) one of the canonical examples.
In a representative production deployment, Google reported in 2015 that its voice search system had switched to LSTM acoustic models trained with the asynchronous distributed training framework described by Sak, Senior, and Beaufays in 2014, replacing the previous deep neural network with hidden Markov model hybrid system and lowering word error rates while reducing latency by avoiding the long context windows that the older system needed.[7] Apple, Microsoft, and Amazon shipped comparable LSTM-based speech recognition stacks during the same period. By the late 2010s, the dominant production speech systems were largely deep bidirectional LSTM acoustic models with CTC training and an external language model, sometimes a transformer language model used for n-best rescoring.
A separate line of work used RNNs as differentiable controllers for external memory. The neural Turing machine (Graves, Wayne, and Danihelka, 2014) and the differentiable neural computer (Graves et al., 2016) coupled an LSTM controller with an external addressable memory, blurring the line between neural networks and classical computer architectures. These models were not deployed at scale, but they influenced later attention-based architectures, including the transformer's key-value caching design.
The shift from RNNs to transformers was not driven by the math of recurrence being wrong; it was driven by hardware. Recurrent computation is inherently serial along the time axis: to compute h_t you need h_{t-1}, which means you cannot parallelize across the time dimension on a single example. Transformers compute attention over all tokens at once, which fits how GPUs and TPUs prefer to work, so for the same number of parameters and the same dataset, a transformer trains much faster than an RNN.[4]
Three limits of RNNs were widely cited as reasons for the migration:
The places where RNNs still win are the inverse of those weaknesses. Streaming inference is a natural fit, because an RNN processes one new input at a time with constant memory and constant compute per step. Latency-sensitive deployments such as on-device speech keyword spotting, real-time captioning, and embedded sensor processing still use small LSTMs or GRUs. Time-series forecasting models in production, especially when they must run continuously on long histories, often benefit from the constant-memory inference of recurrence. And research on biologically inspired or neuromorphic hardware tends to lean recurrent because brains are recurrent.
Around 2023, recurrence came back into research attention through a set of architectures that combine RNN-style sequential state with the parallel training friendliness of transformers. They tend to be called "linear recurrent" or "selective state space" models, but the lineage is clearly recurrent.
| model | year | recurrent idea | notable property |
|---|---|---|---|
| Linear attention (Katharopoulos et al.) | 2020 | attention rewritten as RNN with kernel feature maps | O(N) inference, parallel training |
| S4 (Gu, Goel, Re) | 2022 | structured diagonal-plus-low-rank state space | sequential CIFAR-10 91% accuracy, Long Range Arena state of the art |
| RetNet (Sun et al.) | 2023 | retention with parallel and recurrent forms | trained in parallel, run as RNN |
| RWKV (Peng et al.) | 2023 | linear attention with RNN-style time mixing | scaled to 14B parameters, dense RNN |
| Mamba (Gu and Dao) | 2023 | selective state space model | linear time, competitive with transformers at 3B |
| Mamba-2 (Dao and Gu) | 2024 | state space duality with attention | unifies SSMs and linear attention |
| xLSTM (Beck et al.) | 2024 | LSTM with exponential gating, matrix memory | LLM-scale recurrent baseline |
| minLSTM / minGRU (Feng et al.) | 2024 | gates without hidden-state dependence, parallel scan | classic RNNs trained 175x faster |
The roots of this revival go back further than 2023. Katharopoulos, Vyas, Pappas, and Fleuret's 2020 paper Transformers are RNNs showed that attention with a positive kernel feature map can be rewritten as a recurrence, with the running sum of key-value outer products playing the role of an RNN hidden state. That gave a linear-time inference algorithm for attention and pointed at a deep equivalence between the two architectures.[32] Gu, Goel, and Re's 2022 S4 paper showed that a carefully parameterized state-space model could be computed efficiently and could solve the 16384-step Path-X task that no previous architecture had solved, while running generation roughly 60 times faster than transformers on the same task.[33]
The Mamba paper by Albert Gu and Tri Dao (2023) made the state-space approach data-dependent: the model's transition and output matrices become functions of the current input, which lets a single model selectively remember some context and forget the rest. Mamba-3B outperforms transformers of the same size and matches transformers twice its size on language modeling, achieves about five times higher throughput than transformers, and scales linearly with sequence length, with experiments reaching one-million-token sequences.[34] Mamba-2, by Dao and Gu (2024), introduces the structured state space duality (SSD) framework, which exposes a tight mathematical connection between SSMs and a generalization of attention with structured matrices and runs faster than Mamba-1 by using matrix multiplication as the inner primitive.[35]
RWKV (Receptance, Weight, Key, Value), introduced by Bo Peng, Eric Alcaide, and roughly thirty co-authors at Findings of EMNLP 2023, takes a different angle. It rewrites attention as a linear recurrence with time-mixing and channel-mixing operations, then trains the model in parallel like a transformer while keeping a recurrent inference form. RWKV scaled to 14 billion parameters in the original release, making it the largest dense RNN trained at the time, with quality comparable to similarly sized transformers.[36]
Maximilian Beck, Korbinian Poppel, Markus Spanring, and Sepp Hochreiter (the same Hochreiter who introduced LSTM in 1997) revisited their own architecture in the 2024 xLSTM paper, replacing the LSTM's sigmoid gates with exponential gating and replacing the scalar cell state with either an updated scalar state plus memory mixing (sLSTM) or a matrix-valued state with a covariance update rule that is fully parallelizable (mLSTM). xLSTM models trained on roughly the same data as Llama compete with state-of-the-art transformers and state-space models in both quality and scaling behavior.[37]
Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Bengio, and Hossein Hajimirsadeghi's Were RNNs all we needed? paper (Mila and Borealis AI, October 2024) goes back to the classical LSTM and GRU and asks a sharper question: how much of the architectural complexity is actually necessary? They show that if you remove the dependence of the input, forget, and update gates on the hidden state, the resulting cell can be evaluated with a parallel scan, which lets minLSTM and minGRU train 175 times faster than the original cells on sequences of length 512 and match the test loss of transformers and Mamba on character-level Shakespeare and several other benchmarks.[38] The paper's framing was that a decade of architectural innovation may have been less essential than parallel training; with parallel training added, the 1997 and 2014 cells are close to state of the art.
These models are not a return to vanilla RNNs. They borrow specific ideas (a hidden state that summarizes the past, constant per-step compute at inference, an additive update that avoids vanishing gradients) and combine them with techniques borrowed from transformers and from classical signal processing. The common motivation is that the quadratic cost of attention in sequence length is a real problem for long-context applications, and that recurrent or recurrence-like updates give a principled way to keep inference cost linear.
Whether any of these architectures will displace transformers as the default for general-purpose language modeling is still open as of 2026. What is clear is that the recurrent idea did not die in 2017; it was reformulated.
Deep learning libraries expose RNNs at two levels. The cell level (one time step) lets you write a custom unrolling loop, which is useful for unusual architectures or for research. The layer level wraps the loop and the unrolling logic, which is what most applications use.
| library | cell API | layer API | gated cells |
|---|---|---|---|
| PyTorch | torch.nn.RNNCell | torch.nn.RNN | torch.nn.LSTM, torch.nn.GRU |
| TensorFlow / Keras | tf.keras.layers.SimpleRNNCell | tf.keras.layers.SimpleRNN | tf.keras.layers.LSTM, tf.keras.layers.GRU |
| JAX (Flax) | flax.linen.OptimizedLSTMCell | flax.linen.scan-based RNNs | LSTM, GRU cells |
| MXNet | mxnet.gluon.rnn.RNNCell | mxnet.gluon.rnn.RNN | LSTM, GRU |
Production RNN training on NVIDIA GPUs typically routes through cuDNN's fused LSTM and GRU kernels, which gain a factor of two to ten in throughput by combining the eight matrix multiplies of an LSTM step into one large operation. cuDNN's bias on fixed structure was one of the practical reasons LSTM and GRU dominated over more exotic recurrent variants in industry: if your custom cell does not have a cuDNN kernel, you pay for it in wall-clock time.
A few rules of thumb that survived the LSTM era:
Even at the height of their dominance, RNNs were criticized for several specific failure modes that motivated much of the research described above:
These criticisms are part of why the field moved to transformers in 2017. The 2023-2024 reappraisal, with its parallel-trainable recurrent models, is partly an attempt to keep the strengths of recurrence (constant-time inference, linear scaling with sequence length) while sidestepping the optimization and parallelism problems.
For most readers the relevant deeper articles are recurrent neural network (the long-form treatment of architecture, training, and theory), LSTM (the dominant gated cell), backpropagation through time (the training algorithm), vanishing gradient (the central theoretical obstacle), bidirectional RNN (two-direction context), and sequence-to-sequence task (the encoder-decoder pattern that drove RNN adoption in NLP). For modern alternatives, see transformer, Mamba, Mamba-2, state space model, linear attention, RWKV, and xLSTM. For neighboring training topics, see gradient clipping and the exploding gradient problem.