See also: Machine learning terms
RNN is the standard abbreviation for recurrent neural network, a class of artificial neural network in which connections between units form cycles, so the activations from one time step are fed back into the network at the next time step. That feedback loop gives the model an internal hidden state that acts as a running memory of everything it has seen so far in a sequence, which is why RNNs were the dominant architecture for sequence modeling tasks like language modeling, speech recognition, and handwriting recognition for roughly two decades. Most introductory references use "RNN" as a generic umbrella that covers vanilla Elman-style networks, gated variants such as LSTM and GRU, bidirectional networks, and stacked or deep recurrent stacks. Modern usage often contrasts RNNs with transformer-based models, which since 2017 have replaced them in most large-scale natural language processing systems.
The RNN article on this wiki is the short reference page for the abbreviation. The full article, with extensive coverage of architecture diagrams, training math, and applications, lives at recurrent neural network.
A feedforward neural network maps an input vector to an output vector in a single pass; nothing it computes for one example influences what it computes for the next. An RNN keeps a hidden state vector h_t that is updated at every time step from the current input x_t and the previous hidden state h_{t-1}. Because the same parameters are reused at each step, the network can in principle process sequences of any length using a fixed number of weights. That parameter sharing across time, together with the recurrent connection, is what defines the architecture. The network is recurrent in the strict graph-theoretic sense: if you draw it as a computation graph, there is a cycle in the connections.
The simplest RNN, often called the vanilla RNN or the Elman network, computes:
h_t = f(W_xh x_t + W_hh h_{t-1} + b_h)
y_t = g(W_hy h_t + b_y)
Here x_t is the input at step t, h_t is the new hidden state, and y_t is the optional output at that step. W_xh, W_hh, and W_hy are weight matrices; b_h and b_y are bias vectors. The function f is a nonlinearity, almost always tanh in classic work and sometimes ReLU in later variants. The function g depends on the task, often softmax for classification or identity for regression. The hidden state h_0 is initialized to a zero vector at the start of a sequence.
It is worth noticing that the same matrix W_hh is multiplied into the hidden state at every step. The repeated application of one matrix is the source of the architecture's main strength (it lets a small model summarize an arbitrarily long sequence) and its main weakness (it makes long-range learning very hard, as we will see in the section on gradients).
Different tasks expose different parts of the RNN to the loss. The patterns are usually grouped into a few categories:
| pattern | example task | description |
|---|---|---|
| one to one | image classification | trivial case, no recurrence really used |
| one to many | image captioning | single input, sequence output (decoder) |
| many to one | sentiment classification | sequence input, single output at the end |
| many to many (aligned) | per-frame phoneme labeling | sequence in, sequence out, same length |
| many to many (encoder-decoder) | machine translation | sequence in, then sequence out, different lengths |
Andrej Karpathy's 2015 essay "The Unreasonable Effectiveness of Recurrent Neural Networks" popularized this taxonomy and is still a useful one-page summary of how an RNN's interface depends on the task.
Recurrent connections in neural networks predate the term RNN. John Hopfield's 1982 paper introduced what is now called the Hopfield network, a fully connected recurrent network used as an associative memory. It is not a sequence model in the modern sense (it relaxes to fixed points rather than producing time-indexed outputs), but it established that recurrence could store information in a stable way.
Michael Jordan's 1986 "Jordan network" fed the previous output of the network back into the hidden layer at the next step, which let a small network model temporal patterns. Jeffrey Elman's 1990 paper Finding Structure in Time changed the wiring slightly, copying the previous hidden state (rather than the output) into a context layer that fed the hidden layer at the next step. The Elman network, also called the simple recurrent network or SRN, is the architecture that most textbooks now mean when they say "vanilla RNN." Elman's experiments showed that an SRN could discover word boundaries from continuous letter streams and grammatical structure from sentence streams, which made the architecture interesting to cognitive scientists as well as engineers.
Throughout the 1990s, vanilla RNNs were used for small-scale phoneme recognition, control problems, and language modeling. The decade's most important conceptual contribution was negative: Sepp Hochreiter's 1991 diploma thesis identified the vanishing gradient problem, and Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalized it in their 1994 paper Learning Long-Term Dependencies with Gradient Descent is Difficult. That work showed why simple RNNs could not actually learn the long-range dependencies their architecture seemed to allow. The vanishing gradient analysis directly motivated the LSTM (Hochreiter and Jurgen Schmidhuber, 1997), which addressed the problem with gating, and indirectly the GRU (Cho et al., 2014).
The practical RNN era of the 2010s was driven by three factors: gated cells that finally trained well, large labeled datasets, and GPUs. Sutskever, Vinyals, and Le's 2014 sequence-to-sequence paper used a stack of LSTMs for English to French translation and reached a BLEU score competitive with strong phrase-based systems, which kicked off the modern era of neural machine translation. By 2015, Google had deployed LSTM-based acoustic models in Voice Search, and by 2016 the Google Neural Machine Translation system (Wu et al.) was a deep stack of eight LSTM encoder layers and eight LSTM decoder layers with attention. For about three years, LSTM was the default sequence model in industrial deep learning.
That default ended with Vaswani et al.'s 2017 paper Attention Is All You Need, which introduced the transformer. Transformers replaced recurrence with self-attention, which is parallelizable across the time dimension and scales much better on modern accelerators. By 2019 most large language models had moved off LSTMs entirely. The architecture did not disappear, but its share of the field shrank quickly.
RNNs are trained with a variant of backpropagation called backpropagation through time (BPTT). The trick is to first "unroll" the recurrent computation across the sequence: a network that processes T steps becomes, conceptually, a feedforward network with T layers that share weights. Once unrolled, ordinary backpropagation runs through the unrolled graph, and the gradient with respect to each shared weight is the sum of contributions from every time step.
For very long sequences, full BPTT becomes impractical: it stores activations for every step in memory, and the per-step gradient computation gets expensive. Truncated BPTT (TBPTT) splits the sequence into fixed-length windows (commonly 20 to 200 steps), backpropagates within each window, and carries the hidden state forward across windows without backpropagating the gradient across the boundary. TBPTT was the workhorse training method for RNN language models throughout the 2010s.
A related technique used in sequence generation is teacher forcing: during training, the model is fed the ground-truth previous token rather than its own previous prediction, which speeds convergence at the cost of a train-test distribution mismatch known as exposure bias.
During BPTT, the gradient of an early hidden state with respect to a late loss involves a long product of Jacobians of the recurrent update. Each Jacobian carries one factor of W_hh and one factor of the derivative of the activation function. If the spectral radius of W_hh is below 1, the product shrinks toward zero exponentially with the number of steps; if it is above 1, the product grows without bound.
The shrinking case is the vanishing gradient problem: by the time the gradient signal reaches an early step, it is numerically indistinguishable from noise, so the network cannot learn dependencies that span many steps. Hochreiter (1991) and Bengio, Simard, and Frasconi (1994) worked the math out and showed that this is not an artifact of bad optimization; it is a property of the architecture. Empirically, vanilla RNNs struggle with dependencies longer than about 10 to 20 steps.
The growing case is the exploding gradient problem, where individual updates are so large that training diverges. Pascanu, Mikolov, and Bengio's 2013 paper On the Difficulty of Training Recurrent Neural Networks analyzed both regimes geometrically and proposed gradient clipping as a simple, robust fix: when the global norm of the gradient exceeds a threshold, rescale the gradient so its norm equals the threshold, then take the step. Gradient clipping has been part of the standard RNN training recipe ever since.
Clipping handles explosion. It does not help vanishing. The mainstream fix for vanishing is to change the architecture so that the cell state is updated additively rather than multiplicatively, which is the central idea behind LSTM and GRU.
The term "RNN" covers a family of architectures that differ in how they wire the recurrent connections, what gates they include, and what direction they read the sequence. The table summarizes the main members of the family.
| variant | year | key idea | typical use |
|---|---|---|---|
| Hopfield network | 1982 | symmetric weights, settles to fixed point | associative memory |
| Jordan network | 1986 | output fed back to hidden via context units | early sequence modeling |
| Elman / vanilla RNN | 1990 | hidden state fed back via context units | textbook baseline, short sequences |
| LSTM | 1997 | cell state with input, forget, output gates | long-range sequence modeling |
| bidirectional RNN (Schuster & Paliwal) | 1997 | two passes, forward and reverse, concatenated | tagging, NER, speech, biology |
| Echo state network | 2001 | random fixed reservoir, only output trained | low-cost time series, neuromorphic |
| GRU (Cho et al.) | 2014 | update and reset gates, no separate cell state | translation, smaller models |
| Stacked / deep RNN | 1990s onward | multiple recurrent layers stacked vertically | strong sequence models |
| ConvLSTM (Shi et al.) | 2015 | convolutions inside an LSTM cell | spatiotemporal data, weather |
| IndRNN (Li et al.) | 2018 | hidden units have only self-recurrence | very deep, long-sequence RNNs |
| xLSTM (Beck et al.) | 2024 | exponential gating, scalar and matrix memory | LLM-scale sequence modeling |
A bidirectional RNN, introduced by Mike Schuster and Kuldip Paliwal in 1997, runs two independent recurrent passes: one left-to-right and one right-to-left. The two hidden states at each position are concatenated, so each output sees both past and future context. Bidirectional networks cannot be used in strict streaming settings (you need the whole sequence before you can run the backward pass), but they are standard for tasks like phoneme recognition, named entity recognition, and biological sequence labeling.
Deep or stacked RNNs simply stack multiple recurrent layers on top of one another, with the hidden states of layer k serving as the inputs to layer k+1 at the same time step. Deep recurrent stacks (often deep LSTMs) were the workhorse of large-scale speech and translation systems in the mid-2010s. The Google NMT system mentioned above is a representative example.
Echo state networks (Herbert Jaeger, 2001) and the closely related liquid state machines (Wolfgang Maass et al., 2002) take a very different approach: they leave the recurrent weights random and fixed, and only train a linear readout on top. This reservoir computing view sidesteps the gradient-flow problems of full RNN training, which made it attractive for analog hardware implementations and for low-power time-series applications. It also gives up the ability to shape the recurrent dynamics to a specific task, so it has not been competitive on large datasets.
The most influential applications of RNNs (almost always LSTMs in practice) clustered around sequential data with strong temporal structure.
| domain | typical RNN role | representative system |
|---|---|---|
| language modeling | predict next token from previous tokens | Mikolov et al. RNN-LM (2010); Penn Treebank LSTM baselines |
| machine translation | encoder-decoder with attention | Sutskever seq2seq (2014); GNMT (Wu et al., 2016) |
| speech recognition | acoustic model and CTC decoder | Google Voice Search LSTM (2015); DeepSpeech 2 (2016) |
| handwriting recognition | online stroke sequence to text | Graves et al. multi-dimensional LSTM (2009) |
| music and symbolic generation | next-note prediction | Magenta MelodyRNN (2016) |
| time series forecasting | next-value prediction with context | DeepAR (Salinas et al., 2017) |
| video classification | per-frame features then RNN aggregation | LRCN (Donahue et al., 2015) |
| robotic control | policy with hidden state for partial observability | DeepMind continuous control LSTM agents |
| reinforcement learning | recurrent policies for POMDPs | A3C-LSTM (Mnih et al., 2016) |
Many of these applications used RNNs as one stage of a larger pipeline. Speech recognition systems, for example, often combined a recurrent acoustic model with connectionist temporal classification (CTC) loss and a separate language model.
The shift from RNNs to transformers was not driven by the math of recurrence being wrong; it was driven by hardware. Recurrent computation is inherently serial along the time axis: to compute h_t you need h_{t-1}, which means you cannot parallelize across the time dimension on a single example. Transformers compute attention over all tokens at once, which fits how GPUs and TPUs prefer to work, so for the same number of parameters and the same dataset, a transformer trains much faster than an RNN.
Three limits of RNNs were widely cited as reasons for the migration:
The places where RNNs still win are the inverse of those weaknesses. Streaming inference is a natural fit, because an RNN processes one new input at a time with constant memory and constant compute per step. Latency-sensitive deployments such as on-device speech keyword spotting, real-time captioning, and embedded sensor processing still use small LSTMs or GRUs. Time-series forecasting models in production, especially when they must run continuously on long histories, often benefit from the constant-memory inference of recurrence. And research on biologically inspired or neuromorphic hardware tends to lean recurrent because brains are recurrent.
Around 2023, recurrence came back into research attention through a set of architectures that combine RNN-style sequential state with the parallel training friendliness of transformers. They tend to be called "linear recurrent" or "selective state space" models, but the lineage is clearly recurrent.
| model | year | recurrent idea | notable property |
|---|---|---|---|
| Linear Transformer (Katharopoulos et al.) | 2020 | attention rewritten as RNN with kernel feature maps | O(N) inference, parallel training |
| RetNet (Sun et al.) | 2023 | retention with parallel and recurrent forms | trained in parallel, run as RNN |
| RWKV (Peng et al.) | 2023 | linear attention with RNN-style time mixing | scaled to 14B parameters, dense RNN |
| Mamba (Gu and Dao) | 2023 | selective state space model | linear time, competitive with transformers at 3B |
| Mamba-2 (Dao and Gu) | 2024 | state space duality with attention | unifies SSMs and linear attention |
| xLSTM (Beck et al.) | 2024 | LSTM with exponential gating, matrix memory | LLM-scale recurrent baseline |
These models are not a return to vanilla RNNs. They borrow specific ideas (a hidden state that summarizes the past, constant per-step compute at inference, an additive update that avoids vanishing gradients) and combine them with techniques borrowed from transformers and from classical signal processing. The common motivation is that the quadratic cost of attention in sequence length is a real problem for long-context applications, and that recurrent or recurrence-like updates give a principled way to keep inference cost linear.
Whether any of these architectures will displace transformers as the default for general-purpose language modeling is still open as of 2026. What is clear is that the recurrent idea did not die in 2017; it was reformulated.
Deep learning libraries expose RNNs at two levels. The cell level (one time step) lets you write a custom unrolling loop, which is useful for unusual architectures or for research. The layer level wraps the loop and the unrolling logic, which is what most applications use.
| library | cell API | layer API | gated cells |
|---|---|---|---|
| PyTorch | torch.nn.RNNCell | torch.nn.RNN | torch.nn.LSTM, torch.nn.GRU |
| TensorFlow / Keras | tf.keras.layers.SimpleRNNCell | tf.keras.layers.SimpleRNN | tf.keras.layers.LSTM, tf.keras.layers.GRU |
| JAX (Flax) | flax.linen.OptimizedLSTMCell | flax.linen.scan-based RNNs | LSTM, GRU cells |
| MXNet | mxnet.gluon.rnn.RNNCell | mxnet.gluon.rnn.RNN | LSTM, GRU |
Production RNN training on NVIDIA GPUs typically routes through cuDNN's fused LSTM and GRU kernels, which gain a factor of two to ten in throughput by combining the eight matrix multiplies of an LSTM step into one large operation. cuDNN's bias on fixed structure was one of the practical reasons LSTM and GRU dominated over more exotic recurrent variants in industry: if your custom cell does not have a cuDNN kernel, you pay for it in wall-clock time.
A few rules of thumb that survived the LSTM era:
For most readers the relevant deeper articles are recurrent neural network (the long-form treatment of architecture, training, and theory), LSTM (the dominant gated cell), GRU (the simpler alternative), backpropagation through time (the training algorithm), vanishing gradient (the central theoretical obstacle), bidirectional RNN (two-direction context), and sequence-to-sequence task (the encoder-decoder pattern that drove RNN adoption in NLP). For modern alternatives, see transformer, Mamba, state space model, and RWKV.