RNN

See also: Machine learning terms

RNN is the standard abbreviation for recurrent neural network, a class of artificial neural network in which connections between units form cycles, so the activations from one time step are fed back into the network at the next time step. That feedback loop gives the model an internal hidden state that acts as a running memory of everything it has seen so far in a sequence, which is why RNNs were the dominant architecture for sequence modeling tasks like language modeling, speech recognition, and handwriting recognition for roughly two decades. Most introductory references use "RNN" as a generic umbrella that covers vanilla Elman-style networks, gated variants such as LSTM and GRU, bidirectional networks, and stacked or deep recurrent stacks. Modern usage often contrasts RNNs with transformer-based models, which since 2017 have replaced them in most large-scale natural language processing systems.

The RNN article on this wiki is the short reference page for the abbreviation. The full article, with extensive coverage of architecture diagrams, training math, and applications, lives at recurrent neural network.

what an RNN actually is

A feedforward neural network maps an input vector to an output vector in a single pass; nothing it computes for one example influences what it computes for the next. An RNN keeps a hidden state vector h_t that is updated at every time step from the current input x_t and the previous hidden state h_{t-1}. Because the same parameters are reused at each step, the network can in principle process sequences of any length using a fixed number of weights. That parameter sharing across time, together with the recurrent connection, is what defines the architecture. The network is recurrent in the strict graph-theoretic sense: if you draw it as a computation graph, there is a cycle in the connections.

The simplest RNN, often called the vanilla RNN or the Elman network, computes:

h_t = f(W_xh x_t + W_hh h_{t-1} + b_h)
y_t = g(W_hy h_t + b_y)

Here x_t is the input at step t, h_t is the new hidden state, and y_t is the optional output at that step. W_xh, W_hh, and W_hy are weight matrices; b_h and b_y are bias vectors. The function f is a nonlinearity, almost always tanh in classic work and sometimes ReLU in later variants. The function g depends on the task, often softmax for classification or identity for regression. The hidden state h_0 is initialized to a zero vector at the start of a sequence.

It is worth noticing that the same matrix W_hh is multiplied into the hidden state at every step. The repeated application of one matrix is the source of the architecture's main strength (it lets a small model summarize an arbitrarily long sequence) and its main weakness (it makes long-range learning very hard, as we will see in the section on gradients).

inputs, outputs, and sequence patterns

Different tasks expose different parts of the RNN to the loss. The patterns are usually grouped into a few categories:

pattern	example task	description
one to one	image classification	trivial case, no recurrence really used
one to many	image captioning	single input, sequence output (decoder)
many to one	sentiment classification	sequence input, single output at the end
many to many (aligned)	per-frame phoneme labeling	sequence in, sequence out, same length
many to many (encoder-decoder)	machine translation	sequence in, then sequence out, different lengths

Andrej Karpathy's 2015 essay "The Unreasonable Effectiveness of Recurrent Neural Networks" popularized this taxonomy and is still a useful one-page summary of how an RNN's interface depends on the task.

a brief history

Recurrent connections in neural networks predate the term RNN. John Hopfield's 1982 paper introduced what is now called the Hopfield network, a fully connected recurrent network used as an associative memory. It is not a sequence model in the modern sense (it relaxes to fixed points rather than producing time-indexed outputs), but it established that recurrence could store information in a stable way.

Michael Jordan's 1986 "Jordan network" fed the previous output of the network back into the hidden layer at the next step, which let a small network model temporal patterns. Jeffrey Elman's 1990 paper Finding Structure in Time changed the wiring slightly, copying the previous hidden state (rather than the output) into a context layer that fed the hidden layer at the next step. The Elman network, also called the simple recurrent network or SRN, is the architecture that most textbooks now mean when they say "vanilla RNN." Elman's experiments showed that an SRN could discover word boundaries from continuous letter streams and grammatical structure from sentence streams, which made the architecture interesting to cognitive scientists as well as engineers.

Throughout the 1990s, vanilla RNNs were used for small-scale phoneme recognition, control problems, and language modeling. The decade's most important conceptual contribution was negative: Sepp Hochreiter's 1991 diploma thesis identified the vanishing gradient problem, and Yoshua Bengio, Patrice Simard, and Paolo Frasconi formalized it in their 1994 paper Learning Long-Term Dependencies with Gradient Descent is Difficult. That work showed why simple RNNs could not actually learn the long-range dependencies their architecture seemed to allow. The vanishing gradient analysis directly motivated the LSTM (Hochreiter and Jurgen Schmidhuber, 1997), which addressed the problem with gating, and indirectly the GRU (Cho et al., 2014).

The practical RNN era of the 2010s was driven by three factors: gated cells that finally trained well, large labeled datasets, and GPUs. Sutskever, Vinyals, and Le's 2014 sequence-to-sequence paper used a stack of LSTMs for English to French translation and reached a BLEU score competitive with strong phrase-based systems, which kicked off the modern era of neural machine translation. By 2015, Google had deployed LSTM-based acoustic models in Voice Search, and by 2016 the Google Neural Machine Translation system (Wu et al.) was a deep stack of eight LSTM encoder layers and eight LSTM decoder layers with attention. For about three years, LSTM was the default sequence model in industrial deep learning.

That default ended with Vaswani et al.'s 2017 paper Attention Is All You Need, which introduced the transformer. Transformers replaced recurrence with self-attention, which is parallelizable across the time dimension and scales much better on modern accelerators. By 2019 most large language models had moved off LSTMs entirely. The architecture did not disappear, but its share of the field shrank quickly.

training an RNN

RNNs are trained with a variant of backpropagation called backpropagation through time (BPTT). The trick is to first "unroll" the recurrent computation across the sequence: a network that processes T steps becomes, conceptually, a feedforward network with T layers that share weights. Once unrolled, ordinary backpropagation runs through the unrolled graph, and the gradient with respect to each shared weight is the sum of contributions from every time step.

For very long sequences, full BPTT becomes impractical: it stores activations for every step in memory, and the per-step gradient computation gets expensive. Truncated BPTT (TBPTT) splits the sequence into fixed-length windows (commonly 20 to 200 steps), backpropagates within each window, and carries the hidden state forward across windows without backpropagating the gradient across the boundary. TBPTT was the workhorse training method for RNN language models throughout the 2010s.

A related technique used in sequence generation is teacher forcing: during training, the model is fed the ground-truth previous token rather than its own previous prediction, which speeds convergence at the cost of a train-test distribution mismatch known as exposure bias.

vanishing and exploding gradients

During BPTT, the gradient of an early hidden state with respect to a late loss involves a long product of Jacobians of the recurrent update. Each Jacobian carries one factor of W_hh and one factor of the derivative of the activation function. If the spectral radius of W_hh is below 1, the product shrinks toward zero exponentially with the number of steps; if it is above 1, the product grows without bound.

The shrinking case is the vanishing gradient problem: by the time the gradient signal reaches an early step, it is numerically indistinguishable from noise, so the network cannot learn dependencies that span many steps. Hochreiter (1991) and Bengio, Simard, and Frasconi (1994) worked the math out and showed that this is not an artifact of bad optimization; it is a property of the architecture. Empirically, vanilla RNNs struggle with dependencies longer than about 10 to 20 steps.

The growing case is the exploding gradient problem, where individual updates are so large that training diverges. Pascanu, Mikolov, and Bengio's 2013 paper On the Difficulty of Training Recurrent Neural Networks analyzed both regimes geometrically and proposed gradient clipping as a simple, robust fix: when the global norm of the gradient exceeds a threshold, rescale the gradient so its norm equals the threshold, then take the step. Gradient clipping has been part of the standard RNN training recipe ever since.

Clipping handles explosion. It does not help vanishing. The mainstream fix for vanishing is to change the architecture so that the cell state is updated additively rather than multiplicatively, which is the central idea behind LSTM and GRU.

variants

The term "RNN" covers a family of architectures that differ in how they wire the recurrent connections, what gates they include, and what direction they read the sequence. The table summarizes the main members of the family.

variant	year	key idea	typical use
Hopfield network	1982	symmetric weights, settles to fixed point	associative memory
Jordan network	1986	output fed back to hidden via context units	early sequence modeling
Elman / vanilla RNN	1990	hidden state fed back via context units	textbook baseline, short sequences
LSTM	1997	cell state with input, forget, output gates	long-range sequence modeling
bidirectional RNN (Schuster & Paliwal)	1997	two passes, forward and reverse, concatenated	tagging, NER, speech, biology
Echo state network	2001	random fixed reservoir, only output trained	low-cost time series, neuromorphic
GRU (Cho et al.)	2014	update and reset gates, no separate cell state	translation, smaller models
Stacked / deep RNN	1990s onward	multiple recurrent layers stacked vertically	strong sequence models
ConvLSTM (Shi et al.)	2015	convolutions inside an LSTM cell	spatiotemporal data, weather
IndRNN (Li et al.)	2018	hidden units have only self-recurrence	very deep, long-sequence RNNs
xLSTM (Beck et al.)	2024	exponential gating, scalar and matrix memory	LLM-scale sequence modeling

A bidirectional RNN, introduced by Mike Schuster and Kuldip Paliwal in 1997, runs two independent recurrent passes: one left-to-right and one right-to-left. The two hidden states at each position are concatenated, so each output sees both past and future context. Bidirectional networks cannot be used in strict streaming settings (you need the whole sequence before you can run the backward pass), but they are standard for tasks like phoneme recognition, named entity recognition, and biological sequence labeling.

Deep or stacked RNNs simply stack multiple recurrent layers on top of one another, with the hidden states of layer k serving as the inputs to layer k+1 at the same time step. Deep recurrent stacks (often deep LSTMs) were the workhorse of large-scale speech and translation systems in the mid-2010s. The Google NMT system mentioned above is a representative example.

Echo state networks (Herbert Jaeger, 2001) and the closely related liquid state machines (Wolfgang Maass et al., 2002) take a very different approach: they leave the recurrent weights random and fixed, and only train a linear readout on top. This reservoir computing view sidesteps the gradient-flow problems of full RNN training, which made it attractive for analog hardware implementations and for low-power time-series applications. It also gives up the ability to shape the recurrent dynamics to a specific task, so it has not been competitive on large datasets.

applications, historic and current

The most influential applications of RNNs (almost always LSTMs in practice) clustered around sequential data with strong temporal structure.

domain	typical RNN role	representative system
language modeling	predict next token from previous tokens	Mikolov et al. RNN-LM (2010); Penn Treebank LSTM baselines
machine translation	encoder-decoder with attention	Sutskever seq2seq (2014); GNMT (Wu et al., 2016)
speech recognition	acoustic model and CTC decoder	Google Voice Search LSTM (2015); DeepSpeech 2 (2016)
handwriting recognition	online stroke sequence to text	Graves et al. multi-dimensional LSTM (2009)
music and symbolic generation	next-note prediction	Magenta MelodyRNN (2016)
time series forecasting	next-value prediction with context	DeepAR (Salinas et al., 2017)
video classification	per-frame features then RNN aggregation	LRCN (Donahue et al., 2015)
robotic control	policy with hidden state for partial observability	DeepMind continuous control LSTM agents
reinforcement learning	recurrent policies for POMDPs	A3C-LSTM (Mnih et al., 2016)

Many of these applications used RNNs as one stage of a larger pipeline. Speech recognition systems, for example, often combined a recurrent acoustic model with connectionist temporal classification (CTC) loss and a separate language model.

the move to transformers, and the limits that survived it

The shift from RNNs to transformers was not driven by the math of recurrence being wrong; it was driven by hardware. Recurrent computation is inherently serial along the time axis: to compute h_t you need h_{t-1}, which means you cannot parallelize across the time dimension on a single example. Transformers compute attention over all tokens at once, which fits how GPUs and TPUs prefer to work, so for the same number of parameters and the same dataset, a transformer trains much faster than an RNN.

Three limits of RNNs were widely cited as reasons for the migration:

No parallelism over time, which makes training slow on accelerators built for matrix throughput.
Effective context length that is much shorter than the architectural context length, even with LSTM gating. Empirically, LSTM language models stop benefiting from context beyond a few hundred tokens, while transformers continue to benefit out to thousands or more.
Information bottleneck through a fixed-size hidden state, which limits how much the encoder can carry into the decoder in encoder-decoder setups. Attention mitigates this in seq2seq, but it does not remove it for an RNN encoder.

The places where RNNs still win are the inverse of those weaknesses. Streaming inference is a natural fit, because an RNN processes one new input at a time with constant memory and constant compute per step. Latency-sensitive deployments such as on-device speech keyword spotting, real-time captioning, and embedded sensor processing still use small LSTMs or GRUs. Time-series forecasting models in production, especially when they must run continuously on long histories, often benefit from the constant-memory inference of recurrence. And research on biologically inspired or neuromorphic hardware tends to lean recurrent because brains are recurrent.

a recurrent revival

Around 2023, recurrence came back into research attention through a set of architectures that combine RNN-style sequential state with the parallel training friendliness of transformers. They tend to be called "linear recurrent" or "selective state space" models, but the lineage is clearly recurrent.

model	year	recurrent idea	notable property
Linear Transformer (Katharopoulos et al.)	2020	attention rewritten as RNN with kernel feature maps	O(N) inference, parallel training
RetNet (Sun et al.)	2023	retention with parallel and recurrent forms	trained in parallel, run as RNN
RWKV (Peng et al.)	2023	linear attention with RNN-style time mixing	scaled to 14B parameters, dense RNN
Mamba (Gu and Dao)	2023	selective state space model	linear time, competitive with transformers at 3B
Mamba-2 (Dao and Gu)	2024	state space duality with attention	unifies SSMs and linear attention
xLSTM (Beck et al.)	2024	LSTM with exponential gating, matrix memory	LLM-scale recurrent baseline

These models are not a return to vanilla RNNs. They borrow specific ideas (a hidden state that summarizes the past, constant per-step compute at inference, an additive update that avoids vanishing gradients) and combine them with techniques borrowed from transformers and from classical signal processing. The common motivation is that the quadratic cost of attention in sequence length is a real problem for long-context applications, and that recurrent or recurrence-like updates give a principled way to keep inference cost linear.

Whether any of these architectures will displace transformers as the default for general-purpose language modeling is still open as of 2026. What is clear is that the recurrent idea did not die in 2017; it was reformulated.

implementations

Deep learning libraries expose RNNs at two levels. The cell level (one time step) lets you write a custom unrolling loop, which is useful for unusual architectures or for research. The layer level wraps the loop and the unrolling logic, which is what most applications use.

library	cell API	layer API	gated cells
PyTorch	torch.nn.RNNCell	torch.nn.RNN	torch.nn.LSTM, torch.nn.GRU
TensorFlow / Keras	tf.keras.layers.SimpleRNNCell	tf.keras.layers.SimpleRNN	tf.keras.layers.LSTM, tf.keras.layers.GRU
JAX (Flax)	flax.linen.OptimizedLSTMCell	flax.linen.scan-based RNNs	LSTM, GRU cells
MXNet	mxnet.gluon.rnn.RNNCell	mxnet.gluon.rnn.RNN	LSTM, GRU

Production RNN training on NVIDIA GPUs typically routes through cuDNN's fused LSTM and GRU kernels, which gain a factor of two to ten in throughput by combining the eight matrix multiplies of an LSTM step into one large operation. cuDNN's bias on fixed structure was one of the practical reasons LSTM and GRU dominated over more exotic recurrent variants in industry: if your custom cell does not have a cuDNN kernel, you pay for it in wall-clock time.

practical tips

A few rules of thumb that survived the LSTM era:

Initialize the forget gate bias to a positive value (commonly 1) so that the LSTM defaults to remembering at the start of training.
Use orthogonal or identity initialization for W_hh in vanilla RNNs to start with eigenvalues near 1.
Always clip gradients (a global-norm clip of 1 to 5 is a typical starting point).
Use truncated BPTT for long sequences, with a window length tuned to the task's expected dependency range.
Layer normalization usually helps; batch normalization across time is awkward with variable-length batches.
For variable-length batches, pack and pad sequences (PyTorch's pack_padded_sequence and TensorFlow's masking) to avoid wasting compute on pad tokens.

relationship to other articles on this wiki

For most readers the relevant deeper articles are recurrent neural network (the long-form treatment of architecture, training, and theory), LSTM (the dominant gated cell), GRU (the simpler alternative), backpropagation through time (the training algorithm), vanishing gradient (the central theoretical obstacle), bidirectional RNN (two-direction context), and sequence-to-sequence task (the encoder-decoder pattern that drove RNN adoption in NLP). For modern alternatives, see transformer, Mamba, state space model, and RWKV.

references

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8), 2554-2558.
Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. ICS Report 8604, University of California, San Diego.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211. https://onlinelibrary.wiley.com/doi/10.1207/s15516709cog1402_1
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technical University of Munich.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157-166.
Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
Schuster, M., and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://ieeexplore.ieee.org/document/650093
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10), 2451-2471.
Jaeger, H. (2001). The "echo state" approach to analysing and training recurrent neural networks. GMD Technical Report 148, German National Research Center for Information Technology.
Maass, W., Natschlager, T., and Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531-2560.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. Interspeech 2010.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013. https://arxiv.org/abs/1211.5063
Cho, K., van Merrienboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. NIPS 2014. https://arxiv.org/abs/1409.3215
Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Blog post.
Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Interspeech 2014. (Background for Google Voice 2015.)
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-C. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. NIPS 2015.
Wu, Y., Schuster, M., Chen, Z., et al. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. https://arxiv.org/abs/1609.08144
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. NIPS 2017.
Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., and Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222-2232.
Li, S., Li, W., Cook, C., Zhu, C., and Gao, Y. (2018). Independently recurrent neural network (IndRNN): Building a longer and deeper RNN. CVPR 2018.
Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). Transformers are RNNs: Fast autoregressive transformers with linear attention. ICML 2020.
Peng, B., Alcaide, E., Anthony, Q., et al. (2023). RWKV: Reinventing RNNs for the transformer era. Findings of EMNLP 2023. https://arxiv.org/abs/2305.13048
Sun, Y., Dong, L., Huang, S., et al. (2023). Retentive network: A successor to transformer for large language models. https://arxiv.org/abs/2307.08621
Gu, A., and Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. https://arxiv.org/abs/2312.00752
Beck, M., Poppel, K., Spanring, M., et al. (2024). xLSTM: Extended long short-term memory. NeurIPS 2024.

what an RNN actually is

inputs, outputs, and sequence patterns

a brief history

training an RNN

vanishing and exploding gradients

variants

applications, historic and current

the move to transformers, and the limits that survived it

a recurrent revival

implementations

practical tips

relationship to other articles on this wiki

references

Improve this article

Related Articles

LSTM

Forget Gate

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Context window

what an RNN actually is

inputs, outputs, and sequence patterns

a brief history

training an RNN

vanishing and exploding gradients

variants

applications, historic and current

the move to transformers, and the limits that survived it

a recurrent revival

implementations

practical tips

relationship to other articles on this wiki

references

Related Articles

LSTM

Forget Gate

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Context window