Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a specialized type of recurrent neural network (RNN) architecture designed to learn long-range dependencies in sequential data. Introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, LSTMs address the fundamental limitation of standard RNNs: their inability to retain information across many time steps due to the vanishing gradient problem. The key innovation is a memory cell regulated by learnable gates that control the flow of information, allowing the network to selectively remember or forget information over arbitrarily long sequences. LSTMs have become one of the most widely used architectures in deep learning, powering advances in natural language processing, speech recognition, machine translation, and time series forecasting.

Explain like I'm 5 (ELI5)

Imagine you are reading a really long storybook. A regular brain (a basic RNN) forgets the beginning of the story by the time it reaches the end. An LSTM brain is like having a special notebook beside you while reading. The notebook has three colored pens:

A red pen (forget gate) that crosses out notes you no longer need.
A green pen (input gate) that writes down new important things.
A blue pen (output gate) that picks which notes to share when someone asks you a question.

Because the notebook keeps a running record and only changes what is truly needed, you can remember important details from the very first page even when you are on the last page. That is what makes LSTMs so good at tasks where order and long-term context matter.

Background and motivation

Traditional RNNs process sequential data by maintaining a hidden state that is updated at each time step. In theory, this hidden state can carry information from earlier parts of a sequence to later parts. In practice, however, training RNNs with backpropagation through time (BPTT) causes gradients to either shrink exponentially (vanishing gradients) or grow uncontrollably (exploding gradients) as they propagate backward through many time steps.

Sepp Hochreiter formally analyzed this problem in his 1991 diploma thesis, showing that error signals in standard RNNs decay exponentially with the number of time steps, making it nearly impossible to learn dependencies that span more than about 10 steps. This analysis motivated the search for an architecture that could maintain stable gradient flow over long sequences.

Architecture

An LSTM unit replaces the simple hidden-state update of a vanilla RNN with a more elaborate structure consisting of a cell state and three gates: the forget gate, the input gate, and the output gate. The cell state acts as a conveyor belt that carries information through time with minimal interference, while the gates learn to regulate what information enters, persists in, and exits the cell.

Cell state (the constant error carousel)

The cell state is the defining feature of an LSTM. Unlike the hidden state in a standard RNN, which is entirely rewritten at every time step through a matrix multiplication and nonlinear activation function, the cell state is updated through additive operations. Hochreiter and Schmidhuber called this mechanism the constant error carousel (CEC): because the cell state is modified only by element-wise addition and multiplication (not by a full matrix transformation), the gradient of the cell state with respect to itself at a previous time step remains close to 1. This property allows error signals to flow backward through hundreds or even thousands of time steps without vanishing.

Forget gate

Introduced by Gers, Schmidhuber, and Cummins in 1999 as an improvement to the original LSTM (which lacked this gate), the forget gate decides which information in the cell state should be discarded. It takes the previous hidden state h(t-1) and the current input x(t), passes them through a sigmoid function that outputs values between 0 and 1 for each element of the cell state. A value of 1 means "keep this entirely" and a value of 0 means "discard this completely."

f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f)

Input gate

The input gate determines which new information should be written to the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values.

i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i)

c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c)

Cell state update

The new cell state is computed by combining the forget gate's filtering of the old cell state with the input gate's selection of new candidate values:

c(t) = f(t) * c(t-1) + i(t) * c_candidate(t)

Here, the asterisk (*) denotes element-wise (Hadamard) multiplication. This additive update is what enables the constant error carousel.

Output gate

The output gate controls which parts of the cell state are exposed as the hidden state output. The cell state is passed through a tanh function (squashing values to the range [-1, 1]) and then multiplied element-wise by the sigmoid-gated output:

o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o)

h(t) = o(t) * tanh(c(t))

The hidden state h(t) serves as both the output of the current time step and the recurrent input to the next time step.

Summary of LSTM equations

Component	Equation	Activation
Forget gate	f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f)	Sigmoid
Input gate	i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i)	Sigmoid
Candidate cell	c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c)	Tanh
Cell state update	c(t) = f(t) * c(t-1) + i(t) * c_candidate(t)	None (element-wise)
Output gate	o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o)	Sigmoid
Hidden state	h(t) = o(t) * tanh(c(t))	Tanh

In these equations, W denotes the input weight matrices, U denotes the recurrent weight matrices, and b denotes the bias vectors. Each gate has its own set of parameters, which are learned during training.

How LSTMs solve the vanishing gradient problem

The vanishing gradient problem occurs in standard RNNs because gradients are multiplied by the recurrent weight matrix at every time step during backpropagation through time. If the largest eigenvalue of this matrix is less than 1, gradients shrink exponentially; if it is greater than 1, gradients explode.

LSTMs circumvent this through two mechanisms:

Additive cell state updates. The cell state is updated via addition rather than multiplication by a weight matrix. When the forget gate is close to 1, the gradient of the loss with respect to the cell state at time t passes nearly unchanged to time t-1. This is the constant error carousel: the self-recurrence of the cell state has a derivative of approximately 1.
Learnable gates. The gates learn when to allow information in, retain it, or release it. By setting the forget gate close to 1 for information that should be remembered and close to 0 for information that should be discarded, the network adaptively controls gradient flow.

This does not completely eliminate gradient issues (gradients can still vanish through the gates themselves), but it greatly extends the effective range of temporal dependencies that can be learned. Hochreiter and Schmidhuber demonstrated that LSTMs can learn to bridge time lags of over 1,000 steps, far exceeding the capabilities of standard RNNs.

LSTM variants

Since the original 1997 publication, several variants of the LSTM architecture have been proposed.

Peephole connections

Gers and Schmidhuber (2000) introduced peephole connections that allow the gates to access the cell state directly, rather than relying solely on the hidden state. In the standard LSTM, the gates see only the previous hidden state h(t-1) and the current input x(t). With peephole connections, the forget and input gates also receive c(t-1), and the output gate receives c(t):

f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + P_f * c(t-1) + b_f)
i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + P_i * c(t-1) + b_i)
o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + P_o * c(t) + b_o)

Here, P_f, P_i, and P_o are diagonal weight matrices for the peephole connections. Peephole connections were originally designed to help LSTMs learn precise timing, but empirical studies have shown mixed results regarding their general benefit. A large-scale study by Greff et al. (2017) found that peephole connections did not significantly improve performance on most tasks.

Bidirectional LSTM (BiLSTM)

Graves and Schmidhuber (2005) combined LSTM with bidirectional processing. A bidirectional LSTM runs two separate LSTM layers in parallel: one processes the input sequence from left to right (forward), and the other processes it from right to left (backward). The outputs of both layers are concatenated at each time step, giving the network access to both past and future context.

BiLSTMs are especially useful for tasks where the entire input sequence is available at once, such as named entity recognition, part-of-speech tagging, and text classification. They are not suitable for real-time or autoregressive tasks where future inputs are unavailable.

Stacked (deep) LSTMs

Stacking multiple LSTM layers on top of each other creates a deep LSTM architecture. The hidden state output of the first LSTM layer serves as the input sequence for the second layer, and so on. Stacked LSTMs increase the model's capacity to learn hierarchical representations of sequential data. Sutskever et al. (2014) used a four-layer stacked LSTM for their influential sequence-to-sequence machine translation model.

Gated recurrent unit (GRU)

The Gated Recurrent Unit was introduced by Cho et al. in 2014 as a simplified alternative to LSTM. GRUs merge the cell state and hidden state into a single state vector and replace the three gates of an LSTM with two: a reset gate and an update gate. This reduces the number of parameters and makes GRUs faster to train.

Feature	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
State vectors	2 (cell state + hidden state)	1 (hidden state only)
Parameters	More	Fewer (approximately 25% less)
Training speed	Slower	Faster
Long-sequence performance	Slightly better on very long sequences	Comparable on short-to-medium sequences
Memory usage	Higher	Lower

Empirical comparisons by Chung et al. (2014) showed that GRUs and LSTMs perform comparably on many tasks, with neither consistently outperforming the other. GRUs tend to be preferred when computational resources are limited, while LSTMs may have an edge on tasks requiring fine-grained memory control over very long sequences.

Training LSTMs

LSTMs are trained using gradient descent combined with backpropagation through time (BPTT). Several practical considerations affect training quality and stability.

Gradient clipping

Although LSTMs mitigate the vanishing gradient problem, the exploding gradient problem can still occur during training. Gradient clipping, proposed by Pascanu et al. (2013), addresses this by rescaling the gradient vector whenever its norm exceeds a specified threshold. Norm-based clipping (scaling the entire gradient to a fixed maximum norm) is the most common approach. A clipping threshold of 1.0 to 5.0 is a typical starting point.

Weight initialization

Proper initialization of LSTM weights is important for stable training. Common strategies include:

Xavier/Glorot initialization for input-to-hidden weights, which maintains the variance of activations across layers.
Orthogonal initialization for hidden-to-hidden (recurrent) weights, which helps preserve gradient magnitudes through recurrent connections.
Forget gate bias initialization to a large value (e.g., 1.0 or 2.0), as recommended by Jozefowicz et al. (2015), so that the forget gate starts close to 1 and the network defaults to remembering information early in training.

Optimizer selection

Adaptive optimizers such as Adam and RMSprop are commonly used for training LSTMs because they handle the varying gradient scales across different parameters more effectively than vanilla stochastic gradient descent. Adam is the most popular choice for LSTM-based models.

Regularization

Overfitting is a common issue when training LSTMs on limited data. Dropout, applied to the input and output connections (but not the recurrent connections), is the most widely used regularization technique. Gal and Ghahramani (2016) proposed variational dropout for RNNs, which applies the same dropout mask at every time step, yielding better regularization than naive dropout. Other techniques include weight decay (L2 regularization) and early stopping.

Applications

LSTMs have been successfully applied across a wide range of domains. Their ability to model sequential data with long-range dependencies makes them particularly well suited for the following tasks.

Natural language processing

LSTMs were the dominant architecture for many NLP tasks before the rise of the transformer in 2017. Key applications include:

Language modeling. LSTMs were used in state-of-the-art language models, including the AWD-LSTM model by Merity et al. (2018), which held benchmark records on Penn Treebank and WikiText-2.
Machine translation. Sutskever et al. (2014) demonstrated that a deep LSTM-based encoder-decoder architecture could achieve competitive results on English-to-French translation, leading Google to adopt LSTM-based neural machine translation for Google Translate in 2016.
Sentiment analysis. LSTMs capture long-range contextual cues in text, making them effective at classifying the sentiment of documents and sentences.
Named entity recognition. BiLSTM-CRF models became the standard architecture for sequence labeling tasks such as NER before transformer-based models took over.

Speech recognition

LSTMs significantly improved automatic speech recognition (ASR) by modeling the temporal structure of audio signals. Graves et al. (2013) combined deep bidirectional LSTMs with connectionist temporal classification (CTC), achieving breakthrough results on speech benchmarks. Google adopted LSTM-based acoustic models in 2015, and LSTM remained central to commercial speech recognition systems for several years.

Time series forecasting

LSTMs are widely used for time series prediction in finance, energy, weather, and healthcare. Their ability to capture nonlinear temporal patterns and long-range seasonal dependencies makes them effective for forecasting stock prices, electricity demand, patient health metrics, and other sequential data.

Other applications

Domain	Application examples
Handwriting recognition	Online and offline handwriting synthesis and recognition
Music generation	Modeling musical sequences and generating compositions
Video analysis	Activity recognition and video captioning
Robotics	Robot control and planning from sequential sensor data
Bioinformatics	Protein structure prediction and genomic sequence analysis
Anomaly detection	Identifying unusual patterns in network traffic and sensor data

LSTMs vs. transformers

The introduction of the transformer architecture by Vaswani et al. in 2017 marked a turning point for sequence modeling. Transformers replaced recurrence with self-attention, enabling parallel processing of entire sequences and capturing long-range dependencies more effectively on many benchmarks.

Aspect	LSTM	Transformer
Sequence processing	Sequential (step by step)	Parallel (all positions at once)
Long-range dependencies	Good (via cell state)	Excellent (via self-attention)
Training speed	Slower (not parallelizable over time)	Faster (fully parallelizable)
Memory complexity	O(1) per step for hidden state	O(n^2) for self-attention
Data efficiency	Better on small datasets	Requires large datasets
Inference on streaming data	Natural fit (processes one step at a time)	Requires re-encoding or caching
Parameter count for comparable tasks	Fewer	More
Hardware utilization	Less efficient on modern GPUs/TPUs	Highly optimized for parallel hardware

Transformers have largely supplanted LSTMs for large-scale NLP tasks such as language modeling, machine translation, and text generation. However, LSTMs remain competitive or preferred in several scenarios:

Small datasets where transformers tend to overfit.
Edge deployment where memory and compute budgets are limited.
Streaming or real-time applications where data arrives one step at a time.
Certain time series tasks, especially in finance and signal processing, where LSTM-based models have shown more robust performance.

xLSTM: extended Long Short-Term Memory (2024)

In 2024, Maximilian Beck, Sepp Hochreiter, and colleagues published xLSTM (Extended Long Short-Term Memory), a modernized LSTM architecture designed to compete with transformers and state space models at scale. The paper was accepted as a spotlight paper at NeurIPS 2024.

xLSTM introduces two key innovations:

Exponential gating. Traditional LSTM gates use a sigmoid function, which limits gate values to the range (0, 1). xLSTM replaces this with exponential gating, using appropriate normalization and stabilization techniques, which provides more expressive control over information flow.
Modified memory structures. xLSTM defines two new variants:
- sLSTM retains a scalar memory (like the original LSTM) but adds new memory mixing capabilities across multiple memory cells.
- mLSTM uses a matrix-valued memory and a covariance update rule, making it fully parallelizable during training (unlike standard LSTMs).

These LSTM extensions are integrated into residual block backbones, forming xLSTM blocks that are stacked to create deep architectures. The authors trained a 7-billion-parameter xLSTM language model on 2.3 trillion tokens and demonstrated performance competitive with state-of-the-art transformers and state space models, both in absolute performance and in scaling behavior.

xLSTM represents a significant development because it shows that the core LSTM principles (gated memory with additive updates) can be scaled to the parameter counts and dataset sizes used by modern large language models, challenging the assumption that attention-based architectures are inherently superior.

Historical significance and impact

The LSTM paper by Hochreiter and Schmidhuber (1997) is one of the most cited papers in the history of artificial intelligence, with tens of thousands of citations. The architecture transformed both machine learning research and commercial technology.

Timeline of key milestones

Year	Milestone
1991	Hochreiter analyzes the vanishing gradient problem in his diploma thesis
1997	Hochreiter and Schmidhuber publish the original LSTM paper in Neural Computation
1999	Gers, Schmidhuber, and Cummins introduce the forget gate
2000	Gers and Schmidhuber add peephole connections
2005	Graves and Schmidhuber develop bidirectional LSTM with full BPTT
2006	Graves et al. introduce CTC for sequence labeling with LSTMs
2014	Cho et al. propose the GRU as a simplified LSTM variant
2014	Sutskever et al. use stacked LSTMs for sequence-to-sequence machine translation
2015	Google deploys LSTM-based speech recognition in production
2016	Google Translate switches to LSTM-based neural machine translation
2017	Facebook reports 4 billion LSTM-based translations per day
2017	The transformer architecture begins to displace LSTMs for NLP tasks
2024	xLSTM demonstrates that modernized LSTMs can compete with transformers at scale

Industry adoption

LSTMs became a foundational technology for major technology companies during the 2010s. Google used LSTMs to improve speech recognition accuracy and to power Google Translate's neural machine translation system. Apple integrated LSTM-based models into Siri for voice recognition. Amazon used LSTMs in Alexa's language understanding pipeline. Facebook deployed LSTMs for machine translation at massive scale, handling billions of translations daily by 2017.

Although transformers have replaced LSTMs in many of these systems, the gating and memory principles pioneered by LSTMs directly influenced the design of later architectures, including the GRU, the transformer's gated feed-forward layers, and modern state space models.

References

Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." *Neural Computation*, 12(10), 2451-2471.
Gers, F. A., & Schmidhuber, J. (2000). "Recurrent Nets that Time and Count." *Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN)*.
Graves, A., & Schmidhuber, J. (2005). "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures." *Neural Networks*, 18(5-6), 602-610.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP*.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." *arXiv preprint arXiv:1412.3555*.
Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232.
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*.
Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2024). "xLSTM: Extended Long Short-Term Memory." *Advances in Neural Information Processing Systems (NeurIPS) 2024*.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*.

Explain like I'm 5 (ELI5)

Background and motivation

Architecture

Cell state (the constant error carousel)

Forget gate

Input gate

Cell state update

Output gate

Summary of LSTM equations

How LSTMs solve the vanishing gradient problem

LSTM variants

Peephole connections

Bidirectional LSTM (BiLSTM)

Stacked (deep) LSTMs

Gated recurrent unit (GRU)

Training LSTMs

Gradient clipping

Weight initialization

Optimizer selection

Regularization

Applications

Natural language processing

Speech recognition

Time series forecasting

Other applications

LSTMs vs. transformers

xLSTM: extended Long Short-Term Memory (2024)

Historical significance and impact

Timeline of key milestones

Industry adoption

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Recurrent Neural Network

RWKV

Sparse autoencoder

ARC-AGI 2

Explain like I'm 5 (ELI5)

Background and motivation

Architecture

Cell state (the constant error carousel)

Forget gate

Input gate

Cell state update

Output gate

Summary of LSTM equations

How LSTMs solve the vanishing gradient problem

LSTM variants

Peephole connections

Bidirectional LSTM (BiLSTM)

Stacked (deep) LSTMs

Gated recurrent unit (GRU)

Training LSTMs

Gradient clipping

Weight initialization

Optimizer selection

Regularization

Applications

Natural language processing

Speech recognition

Time series forecasting

Other applications

LSTMs vs. transformers

xLSTM: extended Long Short-Term Memory (2024)

Historical significance and impact

Timeline of key milestones

Industry adoption

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Recurrent Neural Network

RWKV

Sparse autoencoder

ARC-AGI 2