# Long Short-Term Memory (LSTM)

> Source: https://aiwiki.ai/wiki/long_short-term_memory_lstm
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Model Architecture, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Long Short-Term Memory (LSTM)** is a specialized type of [recurrent neural network](/wiki/recurrent_neural_network) (RNN) architecture designed to learn long-range dependencies in sequential data. Introduced by Sepp Hochreiter and Jürgen Schmidhuber in a 1997 paper in the journal *Neural Computation*, LSTMs address the fundamental limitation of standard RNNs: their inability to retain information across many time steps due to the [vanishing gradient problem](/wiki/vanishing_gradient_problem).[1] The key innovation is a memory cell regulated by learnable gates that control the flow of information, allowing the network to selectively remember or forget information over arbitrarily long sequences. In their original abstract, Hochreiter and Schmidhuber reported that "LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units."[1] LSTMs became one of the most widely used architectures in deep learning, powering advances in natural language processing, speech recognition, machine translation, and time series forecasting, and the 1997 paper is among the most cited in the history of artificial intelligence, with more than 90,000 citations recorded by Semantic Scholar.[1][13]

## What is an LSTM in simple terms? (ELI5)

Imagine you are reading a really long storybook. A regular brain (a basic RNN) forgets the beginning of the story by the time it reaches the end. An LSTM brain is like having a special notebook beside you while reading. The notebook has three colored pens:

- A **red pen** (forget gate) that crosses out notes you no longer need.
- A **green pen** (input gate) that writes down new important things.
- A **blue pen** (output gate) that picks which notes to share when someone asks you a question.

Because the notebook keeps a running record and only changes what is truly needed, you can remember important details from the very first page even when you are on the last page. That is what makes LSTMs so good at tasks where order and long-term context matter.

## Why were LSTMs invented? Background and motivation

Traditional RNNs process sequential data by maintaining a hidden state that is updated at each time step. In theory, this hidden state can carry information from earlier parts of a sequence to later parts. In practice, however, training RNNs with [backpropagation](/wiki/backpropagation) through time (BPTT) causes gradients to either shrink exponentially (vanishing gradients) or grow uncontrollably (exploding gradients) as they propagate backward through many time steps.

Sepp Hochreiter formally analyzed this problem in his 1991 diploma thesis, showing that error signals in standard RNNs decay exponentially with the number of time steps, making it nearly impossible to learn dependencies that span more than about 10 steps. This analysis motivated the search for an architecture that could maintain stable gradient flow over long sequences. A central design goal of the resulting LSTM was computational efficiency: the original paper described the method as "local in space and time," with a computational complexity per time step and weight of O(1).[1]

## Architecture

An LSTM unit replaces the simple hidden-state update of a vanilla RNN with a more elaborate structure consisting of a **cell state** and three **gates**: the forget gate, the input gate, and the output gate. The cell state acts as a conveyor belt that carries information through time with minimal interference, while the gates learn to regulate what information enters, persists in, and exits the cell.

### Cell state (the constant error carousel)

The cell state is the defining feature of an LSTM. Unlike the hidden state in a standard RNN, which is entirely rewritten at every time step through a matrix multiplication and nonlinear [activation function](/wiki/activation_function), the cell state is updated through additive operations. Hochreiter and Schmidhuber called this mechanism the **constant error carousel** (CEC): because the cell state is modified only by element-wise addition and multiplication (not by a full matrix transformation), the gradient of the cell state with respect to itself at a previous time step remains close to 1.[1] This property allows error signals to flow backward through hundreds or even thousands of time steps without vanishing.

### Forget gate

Introduced by Gers, Schmidhuber, and Cummins in 1999 as an improvement to the original LSTM (which lacked this gate), the forget gate decides which information in the cell state should be discarded.[2] It takes the previous hidden state h(t-1) and the current input x(t), passes them through a [sigmoid function](/wiki/sigmoid_function) that outputs values between 0 and 1 for each element of the cell state. A value of 1 means "keep this entirely" and a value of 0 means "discard this completely."

**f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f)**

### Input gate

The input gate determines which new information should be written to the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values.

**i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i)**

**c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c)**

### Cell state update

The new cell state is computed by combining the forget gate's filtering of the old cell state with the input gate's selection of new candidate values:

**c(t) = f(t) * c(t-1) + i(t) * c_candidate(t)**

Here, the asterisk (*) denotes element-wise (Hadamard) multiplication. This additive update is what enables the constant error carousel.

### Output gate

The output gate controls which parts of the cell state are exposed as the hidden state output. The cell state is passed through a tanh function (squashing values to the range [-1, 1]) and then multiplied element-wise by the sigmoid-gated output:

**o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o)**

**h(t) = o(t) * tanh(c(t))**

The hidden state h(t) serves as both the output of the current time step and the recurrent input to the next time step.

### Summary of LSTM equations

| Component | Equation | Activation |
|---|---|---|
| Forget gate | f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f) | Sigmoid |
| Input gate | i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i) | Sigmoid |
| Candidate cell | c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c) | Tanh |
| Cell state update | c(t) = f(t) * c(t-1) + i(t) * c_candidate(t) | None (element-wise) |
| Output gate | o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o) | Sigmoid |
| Hidden state | h(t) = o(t) * tanh(c(t)) | Tanh |

In these equations, W denotes the input weight matrices, U denotes the recurrent weight matrices, and b denotes the bias vectors. Each gate has its own set of parameters, which are learned during training.

## How do LSTMs solve the vanishing gradient problem?

The vanishing gradient problem occurs in standard RNNs because gradients are multiplied by the recurrent weight matrix at every time step during backpropagation through time. If the largest eigenvalue of this matrix is less than 1, gradients shrink exponentially; if it is greater than 1, gradients explode.

LSTMs circumvent this through two mechanisms:

1. **Additive cell state updates.** The cell state is updated via addition rather than multiplication by a weight matrix. When the forget gate is close to 1, the gradient of the loss with respect to the cell state at time t passes nearly unchanged to time t-1. This is the constant error carousel: the self-recurrence of the cell state has a derivative of approximately 1.

2. **Learnable gates.** The gates learn when to allow information in, retain it, or release it. By setting the forget gate close to 1 for information that should be remembered and close to 0 for information that should be discarded, the network adaptively controls gradient flow.

This does not completely eliminate gradient issues (gradients can still vanish through the gates themselves), but it greatly extends the effective range of temporal dependencies that can be learned. Hochreiter and Schmidhuber demonstrated that LSTMs can learn to bridge time lags of over 1,000 steps, far exceeding the capabilities of standard RNNs.[1]

## LSTM variants

Since the original 1997 publication, several variants of the LSTM architecture have been proposed.

### Peephole connections

Gers and Schmidhuber (2000) introduced **peephole connections** that allow the gates to access the cell state directly, rather than relying solely on the hidden state.[3] In the standard LSTM, the gates see only the previous hidden state h(t-1) and the current input x(t). With peephole connections, the forget and input gates also receive c(t-1), and the output gate receives c(t):

- f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + P_f * c(t-1) + b_f)
- i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + P_i * c(t-1) + b_i)
- o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + P_o * c(t) + b_o)

Here, P_f, P_i, and P_o are diagonal weight matrices for the peephole connections. Peephole connections were originally designed to help LSTMs learn precise timing, but empirical studies have shown mixed results regarding their general benefit. A large-scale study by Greff et al. (2017) found that peephole connections did not significantly improve performance on most tasks.[8]

### Bidirectional LSTM (BiLSTM)

Graves and Schmidhuber (2005) combined LSTM with bidirectional processing.[4] A bidirectional LSTM runs two separate LSTM layers in parallel: one processes the input sequence from left to right (forward), and the other processes it from right to left (backward). The outputs of both layers are concatenated at each time step, giving the network access to both past and future context.

BiLSTMs are especially useful for tasks where the entire input sequence is available at once, such as named entity recognition, part-of-speech tagging, and text classification. They are not suitable for real-time or autoregressive tasks where future inputs are unavailable.

### Stacked (deep) LSTMs

Stacking multiple LSTM layers on top of each other creates a deep LSTM architecture. The hidden state output of the first LSTM layer serves as the input sequence for the second layer, and so on. Stacked LSTMs increase the model's capacity to learn hierarchical representations of sequential data. Sutskever et al. (2014) used a four-layer stacked LSTM for their influential sequence-to-sequence machine translation model.[6]

### Gated recurrent unit (GRU)

The Gated Recurrent Unit was introduced by Cho et al. in 2014 as a simplified alternative to LSTM.[5] GRUs merge the cell state and hidden state into a single state vector and replace the three gates of an LSTM with two: a reset gate and an update gate. This reduces the number of parameters and makes GRUs faster to train.

| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| State vectors | 2 (cell state + hidden state) | 1 (hidden state only) |
| Parameters | More | Fewer (approximately 25% less) |
| Training speed | Slower | Faster |
| Long-sequence performance | Slightly better on very long sequences | Comparable on short-to-medium sequences |
| Memory usage | Higher | Lower |

Empirical comparisons by Chung et al. (2014) showed that GRUs and LSTMs perform comparably on many tasks, with neither consistently outperforming the other.[7] GRUs tend to be preferred when computational resources are limited, while LSTMs may have an edge on tasks requiring fine-grained memory control over very long sequences.

### What did the "Search Space Odyssey" study find?

The most systematic empirical comparison of LSTM variants is the "LSTM: A Search Space Odyssey" study by Greff et al., published in the *IEEE Transactions on Neural Networks and Learning Systems* in 2017 (first released as a preprint in 2015).[8] The study evaluated eight LSTM variants on three benchmark tasks (speech recognition, handwriting recognition, and polyphonic music modeling), performing roughly 5,400 individual training runs that consumed about 15 years of CPU time. Its headline finding was that the standard ("vanilla") LSTM performs reasonably well on all tested problems and that none of the eight variants improved on it significantly. The authors identified the **forget gate** and the **output activation function** as the most critical components: "The most important changes... are the forget gate and the output activation function. Removing any of them significantly hurts performance."[8] By contrast, they reported that peephole connections, full gate recurrence, and coupling the input and forget gates produced no consistent benefit, helping to standardize the modern LSTM cell used in libraries such as PyTorch and TensorFlow.

## Training LSTMs

LSTMs are trained using gradient descent combined with backpropagation through time (BPTT). Several practical considerations affect training quality and stability.

### Gradient clipping

Although LSTMs mitigate the vanishing gradient problem, the exploding gradient problem can still occur during training. Gradient clipping, proposed by Pascanu et al. (2013), addresses this by rescaling the gradient vector whenever its norm exceeds a specified threshold.[12] Norm-based clipping (scaling the entire gradient to a fixed maximum norm) is the most common approach. A clipping threshold of 1.0 to 5.0 is a typical starting point.

### Weight initialization

Proper initialization of LSTM weights is important for stable training. Common strategies include:

- **Xavier/Glorot initialization** for input-to-hidden weights, which maintains the variance of activations across layers.
- **Orthogonal initialization** for hidden-to-hidden (recurrent) weights, which helps preserve gradient magnitudes through recurrent connections.
- **Forget gate bias initialization** to a large value (e.g., 1.0 or 2.0), as recommended by Jozefowicz et al. (2015), so that the forget gate starts close to 1 and the network defaults to remembering information early in training.[9]

### Optimizer selection

Adaptive optimizers such as Adam and RMSprop are commonly used for training LSTMs because they handle the varying gradient scales across different parameters more effectively than vanilla stochastic gradient descent. Adam is the most popular choice for LSTM-based models.

### Regularization

Overfitting is a common issue when training LSTMs on limited data. Dropout, applied to the input and output connections (but not the recurrent connections), is the most widely used regularization technique. Gal and Ghahramani (2016) proposed variational dropout for RNNs, which applies the same dropout mask at every time step, yielding better regularization than naive dropout. Other techniques include weight decay (L2 regularization) and early stopping.

## What are LSTMs used for? Applications

LSTMs have been successfully applied across a wide range of domains. Their ability to model sequential data with long-range dependencies makes them particularly well suited for the following tasks.

### Natural language processing

LSTMs were the dominant architecture for many NLP tasks before the rise of the [transformer](/wiki/transformer) in 2017. Key applications include:

- **[Language model](/wiki/language_model)ing.** LSTMs were used in state-of-the-art language models, including the AWD-LSTM model by Merity et al. (2018), which held benchmark records on Penn Treebank and WikiText-2.
- **Machine translation.** Sutskever et al. (2014) demonstrated that a deep LSTM-based encoder-decoder architecture could achieve competitive results on English-to-French translation: their stacked LSTM reached a BLEU score of 34.8 on the WMT-14 test set, surpassing a strong phrase-based statistical machine translation baseline that scored 33.3.[6] This work led Google to adopt LSTM-based neural machine translation for Google Translate in 2016.
- **Sentiment analysis.** LSTMs capture long-range contextual cues in text, making them effective at classifying the sentiment of documents and sentences.
- **Named entity recognition.** BiLSTM-CRF models became the standard architecture for sequence labeling tasks such as NER before transformer-based models took over.

### Speech recognition

LSTMs significantly improved automatic speech recognition (ASR) by modeling the temporal structure of audio signals. Graves et al. (2013) combined deep bidirectional LSTMs with connectionist temporal classification (CTC), achieving breakthrough results on speech benchmarks. Google adopted LSTM-based acoustic models in 2015, and LSTM remained central to commercial speech recognition systems for several years.

### Time series forecasting

LSTMs are widely used for time series prediction in finance, energy, weather, and healthcare. Their ability to capture nonlinear temporal patterns and long-range seasonal dependencies makes them effective for forecasting stock prices, electricity demand, patient health metrics, and other sequential data.

### Other applications

| Domain | Application examples |
|---|---|
| Handwriting recognition | Online and offline handwriting synthesis and recognition |
| Music generation | Modeling musical sequences and generating compositions |
| Video analysis | Activity recognition and video captioning |
| Robotics | Robot control and planning from sequential sensor data |
| Bioinformatics | Protein structure prediction and genomic sequence analysis |
| Anomaly detection | Identifying unusual patterns in network traffic and sensor data |

## How do LSTMs differ from transformers?

The introduction of the transformer architecture by Vaswani et al. in 2017 marked a turning point for sequence modeling.[10] Transformers replaced recurrence with self-attention, enabling parallel processing of entire sequences and capturing long-range dependencies more effectively on many benchmarks.

| Aspect | LSTM | Transformer |
|---|---|---|
| Sequence processing | Sequential (step by step) | Parallel (all positions at once) |
| Long-range dependencies | Good (via cell state) | Excellent (via self-attention) |
| Training speed | Slower (not parallelizable over time) | Faster (fully parallelizable) |
| Memory complexity | O(1) per step for hidden state | O(n^2) for self-attention |
| Data efficiency | Better on small datasets | Requires large datasets |
| Inference on streaming data | Natural fit (processes one step at a time) | Requires re-encoding or caching |
| Parameter count for comparable tasks | Fewer | More |
| Hardware utilization | Less efficient on modern GPUs/TPUs | Highly optimized for parallel hardware |

Transformers have largely supplanted LSTMs for large-scale NLP tasks such as language modeling, machine translation, and text generation. However, LSTMs remain competitive or preferred in several scenarios:

- **Small datasets** where transformers tend to overfit.
- **Edge deployment** where memory and compute budgets are limited.
- **Streaming or real-time applications** where data arrives one step at a time.
- **Certain time series tasks,** especially in finance and signal processing, where LSTM-based models have shown more robust performance.

## xLSTM: extended Long Short-Term Memory (2024)

In 2024, Maximilian Beck, Sepp Hochreiter, and colleagues published **xLSTM** (Extended Long Short-Term Memory), a modernized LSTM architecture designed to compete with transformers and state space models at scale.[11] The paper was published at NeurIPS 2024. The authors framed the project around a single question: "How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs?"[11]

xLSTM introduces two key innovations:

1. **Exponential gating.** Traditional LSTM gates use a sigmoid function, which limits gate values to the range (0, 1). xLSTM replaces this with exponential gating, using appropriate normalization and stabilization techniques, which provides more expressive control over information flow.

2. **Modified memory structures.** xLSTM defines two new variants:
   - **sLSTM** retains a scalar memory (like the original LSTM) but adds new memory mixing capabilities across multiple memory cells.
   - **mLSTM** uses a matrix-valued memory and a covariance update rule, making it fully parallelizable during training (unlike standard LSTMs).

These LSTM extensions are integrated into residual block backbones, forming xLSTM blocks that are stacked to create deep architectures. In the original paper, the authors trained xLSTM language models at 125M, 350M, 760M, and 1.3B parameters on 300 billion tokens from the SlimPajama dataset, comparing them against RWKV-4, Llama-style transformers, and Mamba; they reported that xLSTM models maintained low perplexity at longer context lengths and showed favorable scaling behavior relative to these baselines.[11]

In a 2025 follow-up paper, the same group released **xLSTM 7B**, scaling the architecture to 7 billion parameters. As the authors describe it, "we scale the xLSTM to 7B parameters and present our xLSTM 7B, a large language model trained on 2.3T tokens from the DCLM dataset with context length 8192 using 128 H100 GPUs."[13] xLSTM 7B was positioned as a recurrent large language model optimized for fast and efficient inference, with performance reported to be competitive with similarly sized transformer and Mamba models.[13]

xLSTM represents a significant development because it shows that the core LSTM principles (gated memory with additive updates) can be scaled to the parameter counts and dataset sizes used by modern large language models, challenging the assumption that attention-based architectures are inherently superior.

## Historical significance and impact

The LSTM paper by Hochreiter and Schmidhuber (1997) is one of the most cited papers in the history of artificial intelligence, with more than 90,000 citations recorded by Semantic Scholar.[1][13] The architecture transformed both machine learning research and commercial technology.

### Timeline of key milestones

| Year | Milestone |
|---|---|
| 1991 | Hochreiter analyzes the vanishing gradient problem in his diploma thesis |
| 1997 | Hochreiter and Schmidhuber publish the original LSTM paper in Neural Computation |
| 1999 | Gers, Schmidhuber, and Cummins introduce the forget gate |
| 2000 | Gers and Schmidhuber add peephole connections |
| 2005 | Graves and Schmidhuber develop bidirectional LSTM with full BPTT |
| 2006 | Graves et al. introduce CTC for sequence labeling with LSTMs |
| 2014 | Cho et al. propose the GRU as a simplified LSTM variant |
| 2014 | Sutskever et al. use stacked LSTMs for sequence-to-sequence machine translation (BLEU 34.8 on WMT-14 English-French) |
| 2015 | Google deploys LSTM-based speech recognition in production |
| 2015 | Greff et al. release "LSTM: A Search Space Odyssey," comparing eight variants across about 5,400 training runs |
| 2016 | Google Translate switches to LSTM-based neural machine translation (GNMT) |
| 2017 | Facebook reports 4 billion LSTM-based translations per day |
| 2017 | The transformer architecture begins to displace LSTMs for NLP tasks |
| 2024 | xLSTM demonstrates that modernized LSTMs can compete with transformers at scale |
| 2025 | xLSTM 7B scales the architecture to 7 billion parameters on 2.3 trillion tokens |

### Industry adoption

LSTMs became a foundational technology for major technology companies during the 2010s. Google used LSTMs to improve speech recognition accuracy and to power Google Translate's neural machine translation system. The Google Neural Machine Translation (GNMT) system, introduced in November 2016, used a deep LSTM network with 8 encoder layers and 8 decoder layers connected by an attention mechanism, and Google reported that it reduced translation errors by 55 to 85 percent relative to its previous phrase-based system on several language pairs.[14] Apple integrated LSTM-based models into Siri for voice recognition. Amazon used LSTMs in Alexa's language understanding pipeline. Facebook deployed LSTMs for machine translation at massive scale, handling billions of translations daily by 2017.

Although transformers have replaced LSTMs in many of these systems, the gating and memory principles pioneered by LSTMs directly influenced the design of later architectures, including the GRU, the transformer's gated feed-forward layers, and modern state space models.

## References

1. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
2. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." *Neural Computation*, 12(10), 2451-2471.
3. Gers, F. A., & Schmidhuber, J. (2000). "Recurrent Nets that Time and Count." *Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN)*.
4. Graves, A., & Schmidhuber, J. (2005). "Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures." *Neural Networks*, 18(5-6), 602-610.
5. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP*.
6. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1409.3215.
7. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." *arXiv preprint arXiv:1412.3555*.
8. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232. arXiv:1503.04069.
9. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
10. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*.
11. Beck, M., Pöppel, K., Spanring, M., Auer, A., Prudnikova, O., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2024). "xLSTM: Extended Long Short-Term Memory." *Advances in Neural Information Processing Systems (NeurIPS) 2024*. arXiv:2405.04517.
12. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." *Proceedings of the 30th International Conference on Machine Learning (ICML)*.
13. Beck, M., Pöppel, K., Lippe, P., Kurle, R., Blies, P. M., Klambauer, G., Böck, S., & Hochreiter, S. (2025). "xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference." *arXiv preprint arXiv:2503.13427*.
14. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." *arXiv preprint arXiv:1609.08144*.

