LSTM
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,986 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 4,986 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Long Short-Term Memory (LSTM) is a type of [[recurrent_neural_network]] architecture designed to learn long-range dependencies in sequential data. It was introduced by Sepp [[sepp_hochreiter]] and Jürgen [[jurgen_schmidhuber]] in a 1997 paper in Neural Computation[^1], and it became the dominant approach to sequence modeling from roughly 2013 to 2018, when it was largely supplanted in language tasks by the [[transformer]][^2]. LSTM cells maintain a separate cell state (a long-term memory) and use small learned gates to decide what to store, what to forget, and what to read out at each time step. That structure gives the network an additive, gradient-friendly path through time, which is the main reason LSTMs can train on sequences thousands of steps long without the gradients collapsing the way plain recurrent networks do.
Despite the rise of attention-based models, LSTMs remain in active use for speech recognition front-ends, on-device inference, time series, control policies, and any setting where strict left-to-right semantics and constant per-step compute matter more than raw throughput on a GPU. The architecture also enjoyed a small revival in 2024 with the publication of xLSTM[^3], which scaled the original idea to billions of parameters and put it back in conversation with [[mamba]] and modern [[state_space_model]] approaches.
The story of LSTM starts with a problem rather than a solution. In April 1991, Sepp Hochreiter, then a master's student at the Technical University of Munich, submitted his diploma thesis Untersuchungen zu dynamischen neuronalen Netzen ("Investigations into Dynamic Neural Networks") under Schmidhuber's advisorship[^4]. The thesis contains the first detailed analysis of what is now called the [[vanishing_gradient]] problem: when [[backpropagation_through_time]] is unrolled across many steps, the gradient of an early input with respect to a late output decays (or, less often, blows up) exponentially with the number of steps in between. Hochreiter showed this both empirically and analytically. The thesis was written in German and was not widely read at the time, which delayed the field's recognition of the problem by several years.
A standard recurrent neural network updates its hidden state with h_t = tanh(W x_t + U h_{t-1} + b). When trained by backpropagation through time, the gradient of the loss with respect to an early hidden state involves a long product of Jacobians of that recurrence. If those Jacobians have spectral radius below 1, the product shrinks toward zero (vanishing gradient); if above 1, it explodes. The practical consequence is that vanilla RNNs cannot reliably learn dependencies that span more than 10 to 20 steps. Pascanu, Mikolov, and Bengio later gave the widely cited modern treatment[^5]. Gradient clipping fixes the explosion case but not the vanishing case; LSTM attacks the vanishing problem at the source by replacing the multiplicative chain with an additive cell-state path.
Hochreiter and Schmidhuber kept working on the issue identified in the 1991 thesis and eventually proposed a fix in the 1997 paper "Long Short-Term Memory" in Neural Computation 9(8):1735-1780[^1]. Their core idea was the constant error carousel: a self-connected linear unit, with weight 1.0 on its self-loop, that lets the gradient flow backward through time without being multiplied by anything that could shrink or explode it. To control what gets written to and read from this carousel, they added two multiplicative "gates": an input gate and an output gate.
This original 1997 cell had no forget gate. Once a value was written into the cell state, it stayed there until the input gate wrote on top of it. That worked in benchmark tasks with clean sequence boundaries, but it caused problems on continuous streams where the cell state would drift and eventually saturate. Felix Gers, Jürgen Schmidhuber, and Fred Cummins fixed this in "Learning to Forget: Continual Prediction with LSTM" (Neural Computation 12(10):2451-2471, October 2000), adding a third gate that learns when to reset the cell[^6]. An earlier conference version appeared in 1999, and the IDSIA technical report carries the number IDSIA-01-99, but the canonical reference is the 2000 journal paper. Almost every modern reference to "the LSTM" actually means this 2000 variant.
The same year, Gers and Schmidhuber added peephole connections, which let the gates inspect the cell state directly when deciding whether to open or close[^7]. From there the architecture branched in many directions: bidirectional LSTM (Graves & Schmidhuber, 2005[^8]), the simpler [[gru]] (Cho et al., 2014[^9]), ConvLSTM for spatiotemporal grids (Shi et al., 2015[^10]), and many more. By the mid 2010s LSTM was the default sequence model in deep learning, powering Google Voice Search, Apple's Siri dictation, Google Translate, and most academic NLP papers. The 2017 publication of the Transformer in "Attention Is All You Need" started a fast migration toward [[attention_mechanism]]-based models, but LSTM stayed in the toolbox and, with xLSTM (Beck et al., 2024[^3]), came back into research focus.
Jürgen Schmidhuber has, for more than a decade, publicly argued that he and his collaborators have not received adequate credit for several foundational deep learning contributions, including LSTM[^11]. His critique has been pointed at Geoffrey Hinton, Yann LeCun, and Yoshua Bengio (recipients of the 2018 Turing Award) and at the broader public narrative around the deep learning revolution. The technical claims about the 1991 thesis and 1997 LSTM paper are uncontroversial; what is disputed is the relative weight given to recurrent versus feedforward work in the standard history. This article takes no position on the dispute and attributes each idea to its earliest verifiable publication.
Hochreiter went on to lead the Institute for Machine Learning at Johannes Kepler University Linz, and in 2021 he received the IEEE Computational Intelligence Society's Neural Networks Pioneer Award, in significant part for the LSTM work[^12]. Schmidhuber received the same award in 2016. The 1997 Neural Computation paper has been cited tens of thousands of times.
A modern LSTM cell, as it appears in textbooks and in deep learning libraries, has six computations per time step. Let x_t be the input vector at step t, h_{t-1} the previous hidden state, and c_{t-1} the previous cell state. Let sigma denote the sigmoid function and tanh the hyperbolic tangent. The cell computes:
| step | name | equation | role |
|---|---|---|---|
| 1 | forget gate | f_t = sigma(W_f [h_{t-1}, x_t] + b_f) | how much of the old cell state to keep |
| 2 | input gate | i_t = sigma(W_i [h_{t-1}, x_t] + b_i) | how much of the new candidate to write |
| 3 | candidate | c_tilde_t = tanh(W_c [h_{t-1}, x_t] + b_c) | proposed new content |
| 4 | cell update | c_t = f_t * c_{t-1} + i_t * c_tilde_t | additive memory update |
| 5 | output gate | o_t = sigma(W_o [h_{t-1}, x_t] + b_o) | how much of the cell to expose |
| 6 | hidden output | h_t = o_t * tanh(c_t) | what the rest of the network sees |
The brackets [h_{t-1}, x_t] denote vector concatenation; * denotes elementwise multiplication. Each gate is a small linear layer followed by a sigmoid, so its output is a vector of values between 0 and 1 that act as soft binary masks over the cell-state coordinates. The candidate vector c_tilde_t lives in (-1, 1) thanks to the tanh, and the gates decide how much of it to add and how much of the previous c_{t-1} to keep. The hidden state h_t is a gated, squashed view of the cell state, and it is the only thing other layers (or the next time step) can see.
PyTorch implements an equivalent form with separate input and recurrent weight matrices, which is mostly a notational difference[^13]. Its documentation writes the equations as i_t = sigma(W_ii x_t + b_ii + W_hi h_{t-1} + b_hi), and so on for f_t, g_t (the candidate), and o_t, with c_t = f_t * c_{t-1} + i_t * g_t and h_t = o_t * tanh(c_t). Two bias vectors per gate look redundant on paper, since b_ii + b_hi could be folded into a single bias, but keeping them separate matches the cuDNN kernels and avoids extra memory copies during training.
The partial derivative of c_t with respect to c_{t-1} is just f_t. When the forget gate is open (close to 1), the gradient flowing backward through that step is roughly the identity. Compose this across many steps and the product is the elementwise product of forget gates, which can stay near 1 if the network has learned to keep certain coordinates open. This is the modern version of the constant error carousel. The 1997 paper hard-coded the self-loop weight at 1.0; the 2000 paper let the network learn that weight per coordinate per time step.
A practical detail that matters a lot is the forget gate bias. With all biases zero, the forget gate sigmoid starts at 0.5, and the cell state decays by half per step on average. Jozefowicz, Zaremba, and Sutskever showed in 2015 that initializing the forget gate bias to 1 (gate sigmoid starts near 0.73) closes most of the gap between LSTM and GRU on a large empirical benchmark[^14]. Many libraries now do this by default.
LSTM is less a single architecture than a family. Greff and colleagues published "LSTM: A Search Space Odyssey" in 2017, comparing eight variants across speech, handwriting, and music tasks with about 5,400 training runs[^15]. Their conclusion was blunt: none of the standard variants meaningfully beats the modern (forget-gate, no peephole) LSTM on average, the forget gate and the output activation are the most important pieces, and the rest is mostly noise. Still, the family tree is worth knowing.
| variant | year | authors | distinguishing change |
|---|---|---|---|
| Vanilla LSTM | 1997 | Hochreiter & Schmidhuber | Constant error carousel, input and output gates, no forget gate |
| Forget-gate LSTM | 2000 | Gers, Schmidhuber, Cummins | Adds the forget gate, what most people now call "the LSTM" |
| Peephole LSTM | 2000 | Gers & Schmidhuber | Gates can inspect c_{t-1} directly, helps with precise timing tasks |
| Bidirectional LSTM | 2005 | Graves & Schmidhuber | Two LSTMs run forward and backward over the same sequence, outputs concatenated |
| Tree LSTM | 2015 | Tai, Socher & Manning | Recurrence over a tree structure rather than a linear chain |
| ConvLSTM | 2015 | Shi et al. | Replaces matrix multiplies with convolutions, designed for video and weather radar |
| Highway LSTM | 2015 | Zhang et al. | Adds highway connections between stacked LSTM layers for deeper stacks |
| Multiplicative LSTM | 2016 | Krause et al. | Input-conditioned recurrence, used in OpenAI's sentiment neuron |
| Mogrifier LSTM | 2020 | Melis, Kocišký, Blunsom | Pre-mixes the input and previous hidden state several times before the LSTM update |
| xLSTM (sLSTM, mLSTM) | 2024 | Beck et al. | Exponential gating, scalar and matrix memory, parallelizable variant for billions of parameters |
Peephole connections, introduced by Gers and Schmidhuber in 2000[^7], let each gate inspect the previous cell state c_{t-1} in addition to the hidden state h_{t-1}. The change is small (one extra term per gate) but matters for tasks that need precise timing, such as counting beats. Peepholes are not the default in modern libraries because they did not improve average performance in the Search Space Odyssey benchmark.
Most real systems stack two or more LSTM layers and often run the bottom layer in both directions. Graves and Schmidhuber's 2005 paper on framewise phoneme classification with bidirectional LSTM[^8] was the first to show that adding a backward pass meaningfully improves performance on labeled sequence tasks. The trick cannot be used for autoregressive generation, since the backward pass would peek at future tokens, but it is the default for tagging, classification, and acoustic modeling.
Tree LSTM (Tai, Socher, Manning, ACL 2015[^16]) generalises the chain-structured recurrence to arbitrary tree topologies, with each cell receiving inputs from its child nodes. The two main variants are the Child-Sum Tree LSTM and the N-ary Tree LSTM. Tree LSTMs improved on chain LSTMs for semantic relatedness on SemEval 2014 Task 1 and for sentiment classification on the Stanford Sentiment Treebank. The architecture has been largely displaced by attention-based parsers in production NLP.
The Gated Recurrent Unit, introduced by Cho and colleagues in EMNLP 2014[^9], is the most widely used LSTM relative. It collapses the input and forget gates into a single update gate z_t and adds a reset gate r_t controlling how much of the previous hidden state contributes to the candidate. There is no separate cell state. The update is h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W [r_t * h_{t-1}, x_t]). GRU has roughly 25 percent fewer parameters than LSTM at the same hidden size and trains slightly faster; the two are usually within a percentage point on most tasks. See [[gru]] for a longer treatment.
ConvLSTM (Shi et al., NeurIPS 2015[^10]) replaces the matrix multiplications inside the gates with convolutions. The cell state and hidden state become 3D tensors (channels by height by width) instead of vectors. This is the natural fit for any task where each time step is itself a spatial grid: weather radar, video, or fluid simulation. The original paper used it for short-term rainfall prediction.
Training an LSTM looks like training any other recurrent network: the sequence is unrolled in time, the loss is summed across supervised steps, and gradients flow back through [[backpropagation_through_time]]. Three details deserve attention.
Gradient clipping. Even with the additive cell-state path, the gates and input-to-hidden weights can produce large gradients on rare events. Clipping the global gradient norm to 1 or 5 is standard; Pascanu, Mikolov, and Bengio (2013) gave the now-standard analysis[^5].
Truncated BPTT. Full BPTT over very long sequences is expensive. Truncated BPTT processes the sequence in chunks of, say, 100 to 200 steps, carrying the cell and hidden state forward but only backpropagating gradients within the current chunk. This is what .detach() on the hidden state between batches achieves.
Dropout. Naive dropout on the recurrent connections destroys the long-term memory because the mask changes at every step. Variational dropout (Gal & Ghahramani, 2016[^17]) fixes a single mask per sequence; zoneout (Krueger et al., 2017) randomly copies the previous hidden state. Most production code applies dropout only between stacked LSTM layers, which is what dropout controls in torch.nn.LSTM.
LSTMs powered most of the practical sequence learning systems in the 2010s. A small selection:
| year | system | role of LSTM | reference |
|---|---|---|---|
| 2013 | Deep RNN for TIMIT phoneme recognition | Deep bidirectional LSTM with CTC achieves 17.7% error on TIMIT | Graves, Mohamed, Hinton, ICASSP 2013[^18] |
| 2014 | Sequence to Sequence Learning | Two stacked LSTMs as encoder and decoder for machine translation | Sutskever, Vinyals, Le, NeurIPS 2014[^19] |
| 2014 | Sak, Senior, Beaufays acoustic model | Distributed LSTM acoustic models for speech recognition at Google | Sak et al., Interspeech 2014[^20] |
| 2015 | Show and Tell image captioning | CNN encoder feeding an LSTM caption decoder | Vinyals, Toshev, Bengio, Erhan, CVPR 2015[^21] |
| 2015 | Google Voice Search | Production deployment of LSTM acoustic models, cut transcription errors substantially | Sak et al. blog post |
| 2015 | Karpathy's char-rnn | Popular blog post showing LSTMs generating Shakespeare and Linux source code | "The Unreasonable Effectiveness of Recurrent Neural Networks"[^22] |
| 2016 | Google Neural Machine Translation | 8-layer LSTM encoder, 8-layer LSTM decoder, [[attention_mechanism]], wordpiece tokens, replaced phrase-based MT in Google Translate | Wu et al., arXiv:1609.08144[^23] |
| 2016 | DeepMind WaveNet baselines | LSTM language model used as comparison point for raw-audio generation | Van den Oord et al. |
| 2017 | ELMo contextual embeddings | Bidirectional LSTM language model whose hidden states are used as word features | Peters et al., NAACL 2018 |
| 2018 | OpenAI Five (Dota 2) | LSTM-based policy network trained with PPO to play 5v5 Dota at professional level | OpenAI |
| 2019 | DeepMind AlphaStar (StarCraft II) | LSTM core inside a transformer-LSTM hybrid policy that beat top human players | Vinyals et al., Nature 575 |
| 2019 | Apple Siri offline dictation | On-device LSTM acoustic model running locally on iPhone | Apple ML Journal |
Speech recognition was arguably the first commercially important domain where LSTM beat alternative approaches at scale. Graves, Mohamed, and Hinton's 2013 paper used deep bidirectional LSTM stacks trained with Connectionist Temporal Classification to set a TIMIT phoneme recognition record at 17.7% error[^18]. Sak, Senior, and Beaufays then showed at Interspeech 2014 that LSTM acoustic models could be trained on Google-scale distributed infrastructure and matched or beat existing hybrid DNN-HMM systems[^20]. Within a year these models were running in Google Voice Search, and similar architectures soon powered Apple's Siri and Amazon's Alexa speech front-ends.
Sutskever, Vinyals, and Le's 2014 paper[^19] introduced the [[seq2seq]] framework: an encoder LSTM compresses an input sequence into a fixed vector and a decoder LSTM expands that vector into the output. With Bahdanau attention added in 2015 and the wordpiece encoding of GNMT in 2016[^23], LSTM-based seq2seq became the engine of production neural machine translation until the Transformer replaced it.
Vinyals, Toshev, Bengio, and Erhan's "Show and Tell" model[^21] glued a pretrained convolutional image encoder to an LSTM language decoder, conditioning the LSTM on the visual features at the first step. The system won the 2015 MSCOCO captioning challenge and established the encoder-decoder framework that dominated multimodal generation for several years.
LSTM language models held state of the art on Penn Treebank and WikiText benchmarks until Transformer language models took over. ELMo (Peters et al., NAACL 2018) was the last big LSTM-based pretrained model before BERT. In time-series forecasting LSTM remains a common baseline. In reinforcement learning LSTM cores are standard for partially observed environments; DRQN, IMPALA, R2D2, OpenAI Five, and AlphaStar all used recurrent value or policy networks. In computational biology LSTM has been used for protein structure features and DNA sequence analysis, although Transformers have largely taken over there as well.
The two big problems with LSTM are both about scale.
First, the recurrence is inherently sequential. The cell state at step t depends on the cell state at step t-1, so you cannot parallelize the forward pass over time the way you can for self-attention. On modern GPUs a model that uses 5 percent of the silicon for 100 percent of the time is slower than one that uses 80 percent some of the time, even if the second does more arithmetic. cuDNN's fused LSTM kernel and tricks like quasi-RNN or simple recurrent unit (SRU) recover some of this, but not all of it.
Second, LSTMs are hard to scale to extremely long contexts. Even with a forget gate close to 1, the cell state is a fixed-size vector; you cannot cram a million tokens of context into a few thousand floats without losing information. Transformers handle this by keeping a key-value cache that grows with the input, paying quadratic compute for the privilege. State space models and xLSTM-mLSTM aim at the middle ground.
LSTM hyperparameters (learning rate, gradient clipping, hidden size, dropout) are also tightly coupled, and the same recipe rarely transfers across tasks. Practitioners report that getting an LSTM to train well takes more babysitting than a Transformer, mostly because Transformer pretraining recipes are now extremely well-documented and LSTM ones are not.
The Transformer, introduced by Vaswani and colleagues in "Attention Is All You Need" (NeurIPS 2017[^2]), removed recurrence entirely and replaced it with self-attention over the whole sequence at once. This was good news on a GPU because every position can be processed in parallel, and bad news on the memory bill because the attention matrix is quadratic in sequence length. For most NLP tasks the trade-off favored Transformers, and from roughly 2018 onward the field shifted accordingly. ELMo (2018) was the last famous LSTM-based pretrained language model; BERT, GPT, and everything that followed used self-attention.
That said, LSTMs and Transformers have different inductive biases and different cost structures, and the choice is not as one-sided as the trend suggests.
| property | LSTM | Transformer |
|---|---|---|
| time per training step | O(T) sequential | O(T^2) parallel |
| time per token at inference | O(1), constant memory | O(T) attention over the cache, O(T) memory |
| state | fixed-size hidden + cell | growing key-value cache |
| parallelism over sequence | poor | excellent |
| typical context length in practice | thousands of steps with care | millions of tokens with hardware tricks |
| inductive bias | strong recency, sequential | none, position must be encoded |
| works well with very small data | yes | usually needs more |
| dominant in speech front-ends (2026) | still common | catching up |
The practical upshot: Transformers win when you have lots of data, lots of compute, and care about throughput. LSTMs win, or at least compete, when you care about constant memory at inference (streaming speech, on-device, embedded), when sequences are very long but mostly recent context matters, when data is small, or when you want a strong online learning algorithm. Speech recognition front-ends, on-device keyboard prediction, real-time translation pipelines, and a long tail of tabular and time-series models still rely on LSTM cells.
After several years where new sequence architectures meant some variant of attention, the late 2023 and 2024 wave of papers brought recurrence back. The headline names are Mamba, RWKV, and xLSTM.
[[mamba]] (Gu and Dao, arXiv:2312.00752, December 2023[^24]) is a [[state_space_model]], not an LSTM, but it inherits the recurrent flavor: linear time in sequence length, constant memory at inference, and, in the selective version, input-dependent gating that closely mirrors what an LSTM does with its forget gate. Mamba-3B was the first sub-quadratic model to clearly match a Transformer of the same size on language modeling, and on long sequences it is significantly faster. Mamba builds on Gu's earlier S4 family of structured state space models and the "selective scan" idea that makes the recurrence data-dependent.
[[rwkv]] (Peng et al., EMNLP Findings 2023[^25]) reformulates attention so that it can be computed as a recurrence, giving it Transformer-like training and RNN-like inference. RWKV models have been trained up to 14 billion parameters, the largest dense RNN trained at the time of publication.
[[xlstm]] (Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, and Hochreiter, arXiv:2405.04517, May 2024[^3]) is the closest direct revival of the original idea. The paper introduces two new cell types: sLSTM, which keeps the scalar memory of the original LSTM but adds exponential gating with stabilization, and mLSTM, which replaces the scalar cell state with a matrix and uses a covariance update rule that can be parallelized like attention. Stacked into residual blocks, xLSTM models compete with Llama-style Transformers and Mamba at billion-parameter scale. NXAI, the company Hochreiter co-founded in 2023, has continued to push xLSTM as a basis for European foundation models.
Mamba, RWKV, xLSTM, and their relatives (Griffin, Hawk, Retentive Networks, GLA, RetNet) are sometimes called the linear RNN renaissance or post-attention architectures. They share a few common moves: keep a fixed or slowly growing recurrent state, use input-dependent gating to compensate for lost expressive power of attention, and arrange the math so the recurrence can be evaluated as a parallel prefix scan during training. They have already won the long-context, on-device, and streaming markets where Transformers struggle.
Every major deep learning framework ships an LSTM implementation backed by a fast GPU kernel. The interfaces are similar enough to swap with a one-line change.
| framework | module | notes |
|---|---|---|
| PyTorch | torch.nn.LSTM, torch.nn.LSTMCell | cuDNN backend, supports stacked, bidirectional, dropout between layers |
| TensorFlow / Keras | tf.keras.layers.LSTM, LSTMCell | XLA and cuDNN paths, the layer auto-selects the fast kernel when conditions are met |
| JAX | flax.linen.OptimizedLSTMCell, haiku.LSTM | functional API, scan over time |
| MXNet | mx.gluon.rnn.LSTM | similar API to PyTorch |
| ONNX | LSTM operator | for model export and cross-framework deployment |
| Apple Core ML | LSTMLayer | converts from PyTorch and TensorFlow for on-device inference |
import torch
import torch.nn as nn
lstm = nn.LSTM(
input_size=64,
hidden_size=128,
num_layers=2,
batch_first=True,
bidirectional=False,
dropout=0.2,
)
x = torch.randn(32, 100, 64) # (batch, time, features)
outputs, (h_n, c_n) = lstm(x)
# outputs: (32, 100, 128) - hidden state at every step
# h_n, c_n: (2, 32, 128) - final hidden and cell states for each layer
For about five years, from roughly 2013 through 2018, LSTM was synonymous with sequence modeling in deep learning. The original 1997 paper has been cited well over 100,000 times. Andrej Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks"[^22] introduced an entire generation of practitioners to LSTM by showing it generating plausible Shakespeare and compilable C code from a character-level model. Christopher Olah's 2015 blog post "Understanding LSTM Networks"[^26] remains, ten years later, the diagram people reach for when explaining the gates.
The Transformer mostly displaced LSTM in research after 2018, but in production the transition has been slower. With xLSTM, Mamba, and RWKV bringing recurrence back to the frontier, the architecture's second act is still being written.