See also: Machine learning terms
Long Short-Term Memory (LSTM) is a type of recurrent neural network architecture designed to learn long-range dependencies in sequential data. It was introduced by Sepp Hochreiter and Jürgen Schmidhuber in a 1997 paper in Neural Computation, and it became the dominant approach to sequence modeling from roughly 2013 to 2018, when it was largely supplanted in language tasks by the Transformer. LSTM cells maintain a separate cell state (a long-term memory) and use small learned gates to decide what to store, what to forget, and what to read out at each time step. That structure gives the network an additive, gradient-friendly path through time, which is the main reason LSTMs can train on sequences thousands of steps long without the gradients collapsing the way plain recurrent networks do.
Despite the rise of attention-based models, LSTMs remain in active use for speech recognition front-ends, on-device inference, time series, control policies, and any setting where strict left-to-right semantics and constant per-step compute matter more than raw throughput on a GPU. The architecture also enjoyed a small revival in 2024 with the publication of xLSTM, which scaled the original idea to billions of parameters and put it back in conversation with Mamba and modern state space models.
The story of LSTM starts with a problem rather than a solution. In April 1991, Sepp Hochreiter, then a master's student at the Technical University of Munich, submitted his diploma thesis Untersuchungen zu dynamischen neuronalen Netzen ("Investigations into Dynamic Neural Networks"). The thesis contains the first detailed analysis of what is now called the vanishing gradient problem: when backpropagation through time is unrolled across many steps, the gradient of an early input with respect to a late output decays (or, less often, blows up) exponentially with the number of steps in between. Hochreiter showed this both empirically and analytically. The thesis was written in German and was not widely read at the time, which delayed the field's recognition of the problem by several years.
Hochreiter and his advisor Jürgen Schmidhuber kept working on the issue and eventually proposed a fix in the 1997 paper "Long Short-Term Memory" in Neural Computation 9(8):1735-1780. Their core idea was the constant error carousel: a self-connected linear unit, with weight 1.0 on its self-loop, that lets the gradient flow backward through time without being multiplied by anything that could shrink or explode it. To control what gets written to and read from this carousel, they added two multiplicative "gates": an input gate and an output gate.
This original 1997 cell had no forget gate. Once a value was written into the cell state, it stayed there until the input gate wrote on top of it. That worked in benchmark tasks with clean sequence boundaries, but it caused problems on continuous streams where the cell state would drift and eventually saturate. Felix Gers, Jürgen Schmidhuber, and Fred Cummins fixed this in "Learning to Forget: Continual Prediction with LSTM" (Neural Computation 12(10):2451-2471, 2000), adding a third gate that learns when to reset the cell. Almost every modern reference to "the LSTM" actually means this 2000 variant.
The same year, Gers and Schmidhuber added peephole connections, which let the gates inspect the cell state directly when deciding whether to open or close. From there the architecture branched in many directions: bidirectional LSTM (Graves & Schmidhuber, 2005), the simpler GRU (Cho et al., 2014), ConvLSTM for spatiotemporal grids (Shi et al., 2015), and many more. By the mid 2010s LSTM was the default sequence model in deep learning, powering Google Voice Search, Apple's Siri dictation, Google Translate, and most academic NLP papers. The 2017 publication of the Transformer in "Attention Is All You Need" started a fast migration toward attention-based models, but LSTM stayed in the toolbox and, with xLSTM (Beck et al., 2024), came back into research focus.
A standard recurrent neural network updates a hidden state at each time step with something like h_t = tanh(W x_t + U h_{t-1} + b). When you train it with backpropagation through time, the gradient of the loss with respect to an early hidden state involves a long product of Jacobians of that recurrence. If those Jacobians have spectral radius below 1, the product shrinks toward zero (vanishing gradient); if above 1, it explodes (exploding gradient). Hochreiter's 1991 thesis worked the math out carefully, and the result is that vanilla RNNs cannot reliably learn dependencies that span more than 10 to 20 steps.
Gradient clipping helps with the explosion case, since you can cap the norm of the gradient and keep training stable. There is no equally simple fix for the vanishing case in a plain RNN; the gradient that you need to learn from has actually become numerical noise by the time it reaches the early steps. LSTM attacks the vanishing problem at the source by changing what is being multiplied. Instead of squashing every step through a tanh and a weight matrix, the cell state is updated by an additive expression: c_t = f_t * c_{t-1} + i_t * c_tilde_t. When the forget gate f_t is close to 1, the gradient of c_t with respect to c_{t-1} is also close to 1, and the chain of derivatives along the cell state path stays well behaved across hundreds or thousands of steps. The gates themselves still have to be learned, and they can saturate, but the additive cell-state path is what gives LSTM its name and its power.
A modern LSTM cell, as it appears in textbooks and in deep learning libraries, has six computations per time step. Let x_t be the input vector at step t, h_{t-1} the previous hidden state, and c_{t-1} the previous cell state. Let sigma denote the sigmoid function and tanh the hyperbolic tangent. The cell computes:
| step | name | equation | role |
|---|---|---|---|
| 1 | forget gate | f_t = sigma(W_f [h_{t-1}, x_t] + b_f) | how much of the old cell state to keep |
| 2 | input gate | i_t = sigma(W_i [h_{t-1}, x_t] + b_i) | how much of the new candidate to write |
| 3 | candidate | c_tilde_t = tanh(W_c [h_{t-1}, x_t] + b_c) | proposed new content |
| 4 | cell update | c_t = f_t * c_{t-1} + i_t * c_tilde_t | additive memory update |
| 5 | output gate | o_t = sigma(W_o [h_{t-1}, x_t] + b_o) | how much of the cell to expose |
| 6 | hidden output | h_t = o_t * tanh(c_t) | what the rest of the network sees |
The brackets [h_{t-1}, x_t] denote vector concatenation; * denotes elementwise multiplication. Each gate is a small linear layer followed by a sigmoid, so its output is a vector of values between 0 and 1 that act as soft binary masks over the cell-state coordinates. The candidate vector c_tilde_t lives in (-1, 1) thanks to the tanh, and the gates decide how much of it to add and how much of the previous c_{t-1} to keep. The hidden state h_t is a gated, squashed view of the cell state, and it is the only thing other layers (or the next time step) can see.
PyTorch implements an equivalent form with separate input and recurrent weight matrices, which is mostly a notational difference. Its documentation writes the equations as i_t = sigma(W_ii x_t + b_ii + W_hi h_{t-1} + b_hi), and so on for f_t, g_t (the candidate), and o_t, with c_t = f_t * c_{t-1} + i_t * g_t and h_t = o_t * tanh(c_t). Two bias vectors per gate look redundant on paper, since b_ii + b_hi could be folded into a single bias, but keeping them separate matches the cuDNN kernels and avoids extra memory copies during training.
If you take the partial derivative of c_t with respect to c_{t-1} from the cell-state equation, you get f_t. When the forget gate is open (close to 1), the gradient flowing backward through that step is roughly the identity. Compose this across many steps and the product is the elementwise product of forget gates, which can stay near 1 if the network has learned to keep certain coordinates open. This is the modern, post-2000 version of the constant error carousel. The 1997 paper achieved the same effect by hard-coding a self-loop weight of 1.0; the 2000 paper let the network learn that weight per coordinate per time step.
A practical detail that turns out to matter a lot is the forget gate bias. If you initialize all biases to zero, the forget gate sigmoid starts at 0.5, which means that on average the cell state is multiplied by 0.5 at each step and decays by half per step. This makes early training very slow, because the network has to learn to push the bias up before it can store anything for very long. Jozefowicz, Zaremba, and Sutskever showed in 2015 that simply initializing the forget gate bias to 1 (so the gate sigmoid starts at about 0.73) closes most of the gap between LSTM and GRU on a large empirical benchmark. Many libraries do this by default; some tutorials recommend an initial bias of 2.
Training an LSTM looks like training any other recurrent network. The sequence is unrolled in time, the loss is summed (or averaged) across the steps where there is supervision, and gradients are computed by backpropagation through time. In practice three details deserve attention.
First, gradient clipping. Even with the additive cell-state path, the gates and the input-to-hidden weights can still produce large gradients on rare events, especially early in training. Clipping the global gradient norm to a value like 1 or 5 is standard. Pascanu, Mikolov, and Bengio (2013) gave the now-standard analysis of why clipping works.
Second, truncation. For very long sequences, full BPTT is expensive and memory-hungry. Truncated BPTT processes the sequence in chunks of, say, 100 to 200 steps, carrying the cell and hidden state forward but only backpropagating gradients within the current chunk. This trades some long-range learning for tractability and is what every deep learning library does by default when you call .detach() on the hidden state between batches.
Third, dropout. Naive dropout on the recurrent connections destroys the long-term memory because the dropout mask changes at every step. Two solutions are common: variational dropout (Gal & Ghahramani, 2016), which fixes a single dropout mask per sequence, and zoneout (Krueger et al., 2017), which randomly copies the previous hidden state to the next one. Most production code applies dropout only between stacked LSTM layers, not within the recurrence, which is what dropout controls in torch.nn.LSTM.
LSTM is less a single architecture than a family. Greff and colleagues published "LSTM: A Search Space Odyssey" in 2017 (IEEE TNNLS 28(10):2222-2232), comparing eight variants across speech, handwriting, and music tasks with about 5,400 training runs. Their conclusion was blunt: none of the standard variants meaningfully beats the modern (forget-gate, no peephole) LSTM on average, the forget gate and the output activation are the most important pieces, and the rest is mostly noise. Still, the family tree is worth knowing.
| variant | year | authors | distinguishing change |
|---|---|---|---|
| Vanilla LSTM | 1997 | Hochreiter & Schmidhuber | Constant error carousel, input and output gates, no forget gate |
| Forget-gate LSTM | 2000 | Gers, Schmidhuber, Cummins | Adds the forget gate, what most people now call "the LSTM" |
| Peephole LSTM | 2000 | Gers & Schmidhuber | Gates can inspect c_{t-1} directly, helps with precise timing tasks |
| Bidirectional LSTM | 2005 | Graves & Schmidhuber | Two LSTMs run forward and backward over the same sequence, outputs concatenated |
| Multiplicative LSTM | 2016 | Krause et al. | Input-conditioned recurrence, used in OpenAI's sentiment neuron |
| ConvLSTM | 2015 | Shi et al. | Replaces matrix multiplies with convolutions, designed for video and weather radar |
| Tree LSTM | 2015 | Tai, Socher & Manning | Recurrence over a tree structure rather than a linear chain |
| Highway LSTM | 2015 | Zhang et al. | Adds highway connections between stacked LSTM layers for deeper stacks |
| GRU | 2014 | Cho et al. | Two gates instead of three, no separate cell state |
| Mogrifier LSTM | 2020 | Melis, Kocišký, Blunsom | Pre-mixes the input and previous hidden state several times before the LSTM update |
| xLSTM (sLSTM, mLSTM) | 2024 | Beck et al. | Exponential gating, scalar and matrix memory, parallelizable variant for billions of parameters |
The Gated Recurrent Unit, introduced by Kyunghyun Cho and colleagues in "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" (EMNLP 2014), is the most widely used LSTM relative. It collapses the input and forget gates into a single update gate z_t and adds a reset gate r_t that controls how much of the previous hidden state contributes to the candidate. There is no separate cell state; the hidden state plays both roles. The full update is h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W [r_t * h_{t-1}, x_t]).
GRU has roughly 25 percent fewer parameters than LSTM at the same hidden size and trains slightly faster. Empirically the two are usually within a percentage point of each other on most tasks. Where they differ, LSTM tends to win on tasks with very long dependencies and GRU on smaller datasets where the parameter savings translate to less overfitting.
Most real systems do not use a single forward LSTM. They stack two or more LSTM layers (with the output of layer k being the input to layer k+1 at each time step), and they run the bottom layer in both directions. Graves and Schmidhuber's 2005 paper on framewise phoneme classification with bidirectional LSTM (Neural Networks 18(5-6):602-610) was the first to show that adding a backward pass meaningfully improves performance on labeled sequence tasks. The trick obviously cannot be used for autoregressive generation, since the backward pass would peek at future tokens, but for tagging, classification, and acoustic modeling it is the default.
ConvLSTM, from "Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting" (Shi et al., NeurIPS 2015), replaces the matrix multiplications inside the gates with convolutions. The cell state and hidden state then become 3D tensors (channels by height by width) instead of vectors. This is the natural fit for any task where each time step is itself a spatial grid: weather radar, video, or fluid simulation. The original paper used it for short-term rainfall prediction.
LSTMs powered most of the practical sequence learning systems in the 2010s. A small selection:
| year | system | role of LSTM | reference |
|---|---|---|---|
| 2014 | Sequence to Sequence Learning | Two stacked LSTMs as encoder and decoder for machine translation | Sutskever, Vinyals, Le, NeurIPS 2014 |
| 2014 | Show and Tell image captioning | CNN encoder feeding an LSTM caption decoder | Vinyals, Toshev, Bengio, Erhan, CVPR 2015 |
| 2014 | DeepSpeech (Baidu) | Bidirectional RNN with CTC for English and Mandarin recognition | Hannun et al., 2014 |
| 2014 | Sak, Senior, Beaufays acoustic model | Distributed LSTM acoustic models for speech recognition at Google | Interspeech 2014 |
| 2015 | Google Voice Search | Production deployment of LSTM acoustic models, cut transcription errors substantially | Sak et al. blog post |
| 2015 | Karpathy's char-rnn | Popular blog post showing LSTMs generating Shakespeare and Linux source code | "The Unreasonable Effectiveness of Recurrent Neural Networks" |
| 2016 | Google Neural Machine Translation | 8-layer LSTM encoder, 8-layer LSTM decoder, attention, wordpiece tokens, replaced phrase-based MT in Google Translate | Wu et al., arXiv:1609.08144 |
| 2016 | DeepMind WaveNet baselines | LSTM language model used as comparison point for raw-audio generation | Van den Oord et al. |
| 2017 | ELMo contextual embeddings | Bidirectional LSTM language model whose hidden states are used as word features | Peters et al., NAACL 2018 |
| 2018 | OpenAI Five (Dota 2) | LSTM-based policy network trained with PPO to play 5v5 Dota at professional level | OpenAI |
| 2019 | DeepMind AlphaStar (StarCraft II) | LSTM core inside a transformer-LSTM hybrid policy that beat top human players | Vinyals et al., Nature 575 |
| 2019 | Apple Siri offline dictation | On-device LSTM acoustic model running locally on iPhone | Apple ML Journal |
A few categories beyond this table are worth calling out. In handwriting recognition Alex Graves' work in the late 2000s established LSTM with Connectionist Temporal Classification as the dominant approach; the same combination later went into Google Voice Search. In time series forecasting LSTM is a common baseline, often paired with attention or convolutional front-ends. In reinforcement learning LSTM cores are standard for partially observed environments; Deep Q-Networks with LSTM (DRQN), IMPALA, R2D2, and the OpenAI Five and AlphaStar agents above all used recurrent value or policy networks. In computational biology LSTM has been used for protein structure prediction features and DNA sequence analysis, although Transformers have largely taken over there as well.
The Transformer, introduced by Vaswani and colleagues in "Attention Is All You Need" (NeurIPS 2017), removes recurrence entirely and replaces it with self-attention over the whole sequence at once. This is good news on a GPU because every position can be processed in parallel rather than waiting for the previous step, and bad news on the memory bill because the attention matrix is quadratic in the sequence length. For most NLP tasks the trade-off favors Transformers, and from roughly 2018 onward the field shifted accordingly. ELMo (2018) was the last famous LSTM-based pretrained language model; BERT, GPT, and everything that followed used self-attention.
That said, LSTMs and Transformers have different inductive biases and different cost structures, and the choice is not as one-sided as the trend suggests.
| property | LSTM | Transformer |
|---|---|---|
| time per training step | O(T) sequential | O(T^2) parallel |
| time per token at inference | O(1), constant memory | O(T) attention over the cache, O(T) memory |
| state | fixed-size hidden + cell | growing key-value cache |
| parallelism over sequence | poor | excellent |
| typical context length in practice | thousands of steps with care | millions of tokens with hardware tricks |
| inductive bias | strong recency, sequential | none, position must be encoded |
| works well with very small data | yes | usually needs more |
| dominant in speech front-ends (2026) | still common | catching up |
The practical upshot: Transformers win when you have lots of data, lots of compute, and care about throughput. LSTMs win, or at least compete, when you care about constant memory at inference (streaming speech, on-device, embedded), when sequences are very long but mostly recent context matters, when data is small, or when you want a strong online learning algorithm.
After several years where new sequence architectures meant some variant of attention, the late 2023 and 2024 wave of papers brought recurrence back. The headline names are Mamba, RWKV, and xLSTM.
Mamba (Gu and Dao, arXiv:2312.00752, December 2023) is a state space model, not an LSTM, but it inherits the recurrent flavor: linear time in the sequence length, constant memory at inference, and, in the selective version, input-dependent gating that closely mirrors what an LSTM does with its forget gate. Mamba-3B was the first sub-quadratic model to clearly match a Transformer of the same size on language modeling, and on long sequences (hundreds of thousands to a million tokens) it is significantly faster.
RWKV (Peng and colleagues, 2023) reformulates attention so that it can be computed as a recurrence, giving it Transformer-like training and RNN-like inference. The training loop looks like a Transformer; deployment looks like an LSTM.
xLSTM (Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, and Hochreiter, NeurIPS 2024, arXiv:2405.04517) is the closest direct revival of the original idea. The paper introduces two new cell types: sLSTM, which keeps the scalar memory of the original LSTM but adds exponential gating with stabilization, and mLSTM, which replaces the scalar cell state with a matrix and uses a covariance update rule that can be parallelized like attention. Stacked into residual blocks, xLSTM models compete with Llama-style Transformers and Mamba at billion-parameter scale. The fact that the original LSTM author co-led the paper, almost three decades after the 1997 paper, is a nice closing of a loop.
Every major deep learning framework ships an LSTM implementation backed by a fast GPU kernel. The interfaces are similar enough to swap with a one-line change.
| framework | module | notes |
|---|---|---|
| PyTorch | torch.nn.LSTM, torch.nn.LSTMCell | cuDNN backend, supports stacked, bidirectional, dropout between layers |
| TensorFlow / Keras | tf.keras.layers.LSTM, LSTMCell | XLA and cuDNN paths, the layer auto-selects the fast kernel when conditions are met |
| JAX | flax.linen.OptimizedLSTMCell, haiku.LSTM | functional API, scan over time |
| MXNet | mx.gluon.rnn.LSTM | similar API to PyTorch |
| ONNX | LSTM operator | for model export and cross-framework deployment |
| Apple Core ML | LSTMLayer | converts from PyTorch and TensorFlow for on-device inference |
import torch
import torch.nn as nn
lstm = nn.LSTM(
input_size=64,
hidden_size=128,
num_layers=2,
batch_first=True,
bidirectional=False,
dropout=0.2,
)
x = torch.randn(32, 100, 64) # (batch, time, features)
outputs, (h_n, c_n) = lstm(x)
# outputs: (32, 100, 128) - hidden state at every step
# h_n, c_n: (2, 32, 128) - final hidden and cell states for each layer
The two big problems with LSTM are both about scale.
First, the recurrence is inherently sequential. The cell state at step t depends on the cell state at step t-1, so you cannot parallelize the forward pass over time the way you can for self-attention. Modern GPUs are extremely wide, and a model that uses 5 percent of the silicon for 100 percent of the time is slower than a model that uses 80 percent of the silicon some of the time, even if the second one technically does more arithmetic. cuDNN's fused LSTM kernel and tricks like quasi-RNN or simple recurrent unit (SRU) recover some of this, but not all of it.
Second, LSTMs are hard to scale to extremely long contexts. Even with a forget gate close to 1, the cell state is a fixed-size vector, and you cannot cram a million tokens of context into a few thousand floats without losing information. Transformers handle this by keeping a key-value cache that grows with the input, paying quadratic compute for the privilege. State space models and xLSTM-mLSTM aim at the middle ground: subquadratic compute with a richer state than a single vector.
Less fundamental but still real: LSTM hyperparameters (learning rate, gradient clipping value, hidden size, dropout) are tightly coupled, and the same recipe rarely transfers across tasks. The Greff et al. "Search Space Odyssey" paper found a few stable defaults, but practitioners still report that getting an LSTM to train well takes more babysitting than getting a Transformer to train well, mostly because Transformer pretraining recipes are now extremely well-documented and LSTM ones are not.
For about five years, from roughly 2013 through 2018, LSTM was synonymous with sequence modeling in deep learning. The original 1997 paper has been cited well over 100,000 times. Hochreiter and Schmidhuber received the IEEE Neural Networks Pioneer Award in 2021 in part for this work. Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" introduced an entire generation of practitioners to LSTM by showing it generating plausible Shakespeare, Wikipedia articles, and even compilable C code from a character-level model. Christopher Olah's 2015 blog post "Understanding LSTM Networks" remains, ten years later, the diagram people reach for when they want to explain the gates.
The Transformer mostly displaced LSTM in research after 2018, but in production the transition has been slower. Speech recognition front-ends, on-device keyboard prediction, real-time translation pipelines, and a long tail of tabular and time series models still rely on LSTM cells, often because the constant memory and constant per-step latency are exactly what you need on a phone or in a low-latency service. With xLSTM, Mamba, and RWKV bringing recurrence back to the frontier, the architecture's second act is still being written.
torch.nn.LSTM. https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html