Long Short-Term Memory (LSTM) is a specialized type of recurrent neural network (RNN) architecture designed to learn long-range dependencies in sequential data. Introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, LSTMs address the fundamental limitation of standard RNNs: their inability to retain information across many time steps due to the vanishing gradient problem. The key innovation is a memory cell regulated by learnable gates that control the flow of information, allowing the network to selectively remember or forget information over arbitrarily long sequences. LSTMs have become one of the most widely used architectures in deep learning, powering advances in natural language processing, speech recognition, machine translation, and time series forecasting.
Imagine you are reading a really long storybook. A regular brain (a basic RNN) forgets the beginning of the story by the time it reaches the end. An LSTM brain is like having a special notebook beside you while reading. The notebook has three colored pens:
Because the notebook keeps a running record and only changes what is truly needed, you can remember important details from the very first page even when you are on the last page. That is what makes LSTMs so good at tasks where order and long-term context matter.
Traditional RNNs process sequential data by maintaining a hidden state that is updated at each time step. In theory, this hidden state can carry information from earlier parts of a sequence to later parts. In practice, however, training RNNs with backpropagation through time (BPTT) causes gradients to either shrink exponentially (vanishing gradients) or grow uncontrollably (exploding gradients) as they propagate backward through many time steps.
Sepp Hochreiter formally analyzed this problem in his 1991 diploma thesis, showing that error signals in standard RNNs decay exponentially with the number of time steps, making it nearly impossible to learn dependencies that span more than about 10 steps. This analysis motivated the search for an architecture that could maintain stable gradient flow over long sequences.
An LSTM unit replaces the simple hidden-state update of a vanilla RNN with a more elaborate structure consisting of a cell state and three gates: the forget gate, the input gate, and the output gate. The cell state acts as a conveyor belt that carries information through time with minimal interference, while the gates learn to regulate what information enters, persists in, and exits the cell.
The cell state is the defining feature of an LSTM. Unlike the hidden state in a standard RNN, which is entirely rewritten at every time step through a matrix multiplication and nonlinear activation function, the cell state is updated through additive operations. Hochreiter and Schmidhuber called this mechanism the constant error carousel (CEC): because the cell state is modified only by element-wise addition and multiplication (not by a full matrix transformation), the gradient of the cell state with respect to itself at a previous time step remains close to 1. This property allows error signals to flow backward through hundreds or even thousands of time steps without vanishing.
Introduced by Gers, Schmidhuber, and Cummins in 1999 as an improvement to the original LSTM (which lacked this gate), the forget gate decides which information in the cell state should be discarded. It takes the previous hidden state h(t-1) and the current input x(t), passes them through a sigmoid function that outputs values between 0 and 1 for each element of the cell state. A value of 1 means "keep this entirely" and a value of 0 means "discard this completely."
f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f)
The input gate determines which new information should be written to the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates a vector of candidate values.
i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i)
c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c)
The new cell state is computed by combining the forget gate's filtering of the old cell state with the input gate's selection of new candidate values:
c(t) = f(t) * c(t-1) + i(t) * c_candidate(t)
Here, the asterisk (*) denotes element-wise (Hadamard) multiplication. This additive update is what enables the constant error carousel.
The output gate controls which parts of the cell state are exposed as the hidden state output. The cell state is passed through a tanh function (squashing values to the range [-1, 1]) and then multiplied element-wise by the sigmoid-gated output:
o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o)
h(t) = o(t) * tanh(c(t))
The hidden state h(t) serves as both the output of the current time step and the recurrent input to the next time step.
| Component | Equation | Activation |
|---|---|---|
| Forget gate | f(t) = sigmoid(W_f * x(t) + U_f * h(t-1) + b_f) | Sigmoid |
| Input gate | i(t) = sigmoid(W_i * x(t) + U_i * h(t-1) + b_i) | Sigmoid |
| Candidate cell | c_candidate(t) = tanh(W_c * x(t) + U_c * h(t-1) + b_c) | Tanh |
| Cell state update | c(t) = f(t) * c(t-1) + i(t) * c_candidate(t) | None (element-wise) |
| Output gate | o(t) = sigmoid(W_o * x(t) + U_o * h(t-1) + b_o) | Sigmoid |
| Hidden state | h(t) = o(t) * tanh(c(t)) | Tanh |
In these equations, W denotes the input weight matrices, U denotes the recurrent weight matrices, and b denotes the bias vectors. Each gate has its own set of parameters, which are learned during training.
The vanishing gradient problem occurs in standard RNNs because gradients are multiplied by the recurrent weight matrix at every time step during backpropagation through time. If the largest eigenvalue of this matrix is less than 1, gradients shrink exponentially; if it is greater than 1, gradients explode.
LSTMs circumvent this through two mechanisms:
Additive cell state updates. The cell state is updated via addition rather than multiplication by a weight matrix. When the forget gate is close to 1, the gradient of the loss with respect to the cell state at time t passes nearly unchanged to time t-1. This is the constant error carousel: the self-recurrence of the cell state has a derivative of approximately 1.
Learnable gates. The gates learn when to allow information in, retain it, or release it. By setting the forget gate close to 1 for information that should be remembered and close to 0 for information that should be discarded, the network adaptively controls gradient flow.
This does not completely eliminate gradient issues (gradients can still vanish through the gates themselves), but it greatly extends the effective range of temporal dependencies that can be learned. Hochreiter and Schmidhuber demonstrated that LSTMs can learn to bridge time lags of over 1,000 steps, far exceeding the capabilities of standard RNNs.
Since the original 1997 publication, several variants of the LSTM architecture have been proposed.
Gers and Schmidhuber (2000) introduced peephole connections that allow the gates to access the cell state directly, rather than relying solely on the hidden state. In the standard LSTM, the gates see only the previous hidden state h(t-1) and the current input x(t). With peephole connections, the forget and input gates also receive c(t-1), and the output gate receives c(t):
Here, P_f, P_i, and P_o are diagonal weight matrices for the peephole connections. Peephole connections were originally designed to help LSTMs learn precise timing, but empirical studies have shown mixed results regarding their general benefit. A large-scale study by Greff et al. (2017) found that peephole connections did not significantly improve performance on most tasks.
Graves and Schmidhuber (2005) combined LSTM with bidirectional processing. A bidirectional LSTM runs two separate LSTM layers in parallel: one processes the input sequence from left to right (forward), and the other processes it from right to left (backward). The outputs of both layers are concatenated at each time step, giving the network access to both past and future context.
BiLSTMs are especially useful for tasks where the entire input sequence is available at once, such as named entity recognition, part-of-speech tagging, and text classification. They are not suitable for real-time or autoregressive tasks where future inputs are unavailable.
Stacking multiple LSTM layers on top of each other creates a deep LSTM architecture. The hidden state output of the first LSTM layer serves as the input sequence for the second layer, and so on. Stacked LSTMs increase the model's capacity to learn hierarchical representations of sequential data. Sutskever et al. (2014) used a four-layer stacked LSTM for their influential sequence-to-sequence machine translation model.
The Gated Recurrent Unit was introduced by Cho et al. in 2014 as a simplified alternative to LSTM. GRUs merge the cell state and hidden state into a single state vector and replace the three gates of an LSTM with two: a reset gate and an update gate. This reduces the number of parameters and makes GRUs faster to train.
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (reset, update) |
| State vectors | 2 (cell state + hidden state) | 1 (hidden state only) |
| Parameters | More | Fewer (approximately 25% less) |
| Training speed | Slower | Faster |
| Long-sequence performance | Slightly better on very long sequences | Comparable on short-to-medium sequences |
| Memory usage | Higher | Lower |
Empirical comparisons by Chung et al. (2014) showed that GRUs and LSTMs perform comparably on many tasks, with neither consistently outperforming the other. GRUs tend to be preferred when computational resources are limited, while LSTMs may have an edge on tasks requiring fine-grained memory control over very long sequences.
LSTMs are trained using gradient descent combined with backpropagation through time (BPTT). Several practical considerations affect training quality and stability.
Although LSTMs mitigate the vanishing gradient problem, the exploding gradient problem can still occur during training. Gradient clipping, proposed by Pascanu et al. (2013), addresses this by rescaling the gradient vector whenever its norm exceeds a specified threshold. Norm-based clipping (scaling the entire gradient to a fixed maximum norm) is the most common approach. A clipping threshold of 1.0 to 5.0 is a typical starting point.
Proper initialization of LSTM weights is important for stable training. Common strategies include:
Adaptive optimizers such as Adam and RMSprop are commonly used for training LSTMs because they handle the varying gradient scales across different parameters more effectively than vanilla stochastic gradient descent. Adam is the most popular choice for LSTM-based models.
Overfitting is a common issue when training LSTMs on limited data. Dropout, applied to the input and output connections (but not the recurrent connections), is the most widely used regularization technique. Gal and Ghahramani (2016) proposed variational dropout for RNNs, which applies the same dropout mask at every time step, yielding better regularization than naive dropout. Other techniques include weight decay (L2 regularization) and early stopping.
LSTMs have been successfully applied across a wide range of domains. Their ability to model sequential data with long-range dependencies makes them particularly well suited for the following tasks.
LSTMs were the dominant architecture for many NLP tasks before the rise of the transformer in 2017. Key applications include:
LSTMs significantly improved automatic speech recognition (ASR) by modeling the temporal structure of audio signals. Graves et al. (2013) combined deep bidirectional LSTMs with connectionist temporal classification (CTC), achieving breakthrough results on speech benchmarks. Google adopted LSTM-based acoustic models in 2015, and LSTM remained central to commercial speech recognition systems for several years.
LSTMs are widely used for time series prediction in finance, energy, weather, and healthcare. Their ability to capture nonlinear temporal patterns and long-range seasonal dependencies makes them effective for forecasting stock prices, electricity demand, patient health metrics, and other sequential data.
| Domain | Application examples |
|---|---|
| Handwriting recognition | Online and offline handwriting synthesis and recognition |
| Music generation | Modeling musical sequences and generating compositions |
| Video analysis | Activity recognition and video captioning |
| Robotics | Robot control and planning from sequential sensor data |
| Bioinformatics | Protein structure prediction and genomic sequence analysis |
| Anomaly detection | Identifying unusual patterns in network traffic and sensor data |
The introduction of the transformer architecture by Vaswani et al. in 2017 marked a turning point for sequence modeling. Transformers replaced recurrence with self-attention, enabling parallel processing of entire sequences and capturing long-range dependencies more effectively on many benchmarks.
| Aspect | LSTM | Transformer |
|---|---|---|
| Sequence processing | Sequential (step by step) | Parallel (all positions at once) |
| Long-range dependencies | Good (via cell state) | Excellent (via self-attention) |
| Training speed | Slower (not parallelizable over time) | Faster (fully parallelizable) |
| Memory complexity | O(1) per step for hidden state | O(n^2) for self-attention |
| Data efficiency | Better on small datasets | Requires large datasets |
| Inference on streaming data | Natural fit (processes one step at a time) | Requires re-encoding or caching |
| Parameter count for comparable tasks | Fewer | More |
| Hardware utilization | Less efficient on modern GPUs/TPUs | Highly optimized for parallel hardware |
Transformers have largely supplanted LSTMs for large-scale NLP tasks such as language modeling, machine translation, and text generation. However, LSTMs remain competitive or preferred in several scenarios:
In 2024, Maximilian Beck, Sepp Hochreiter, and colleagues published xLSTM (Extended Long Short-Term Memory), a modernized LSTM architecture designed to compete with transformers and state space models at scale. The paper was accepted as a spotlight paper at NeurIPS 2024.
xLSTM introduces two key innovations:
Exponential gating. Traditional LSTM gates use a sigmoid function, which limits gate values to the range (0, 1). xLSTM replaces this with exponential gating, using appropriate normalization and stabilization techniques, which provides more expressive control over information flow.
Modified memory structures. xLSTM defines two new variants:
These LSTM extensions are integrated into residual block backbones, forming xLSTM blocks that are stacked to create deep architectures. The authors trained a 7-billion-parameter xLSTM language model on 2.3 trillion tokens and demonstrated performance competitive with state-of-the-art transformers and state space models, both in absolute performance and in scaling behavior.
xLSTM represents a significant development because it shows that the core LSTM principles (gated memory with additive updates) can be scaled to the parameter counts and dataset sizes used by modern large language models, challenging the assumption that attention-based architectures are inherently superior.
The LSTM paper by Hochreiter and Schmidhuber (1997) is one of the most cited papers in the history of artificial intelligence, with tens of thousands of citations. The architecture transformed both machine learning research and commercial technology.
| Year | Milestone |
|---|---|
| 1991 | Hochreiter analyzes the vanishing gradient problem in his diploma thesis |
| 1997 | Hochreiter and Schmidhuber publish the original LSTM paper in Neural Computation |
| 1999 | Gers, Schmidhuber, and Cummins introduce the forget gate |
| 2000 | Gers and Schmidhuber add peephole connections |
| 2005 | Graves and Schmidhuber develop bidirectional LSTM with full BPTT |
| 2006 | Graves et al. introduce CTC for sequence labeling with LSTMs |
| 2014 | Cho et al. propose the GRU as a simplified LSTM variant |
| 2014 | Sutskever et al. use stacked LSTMs for sequence-to-sequence machine translation |
| 2015 | Google deploys LSTM-based speech recognition in production |
| 2016 | Google Translate switches to LSTM-based neural machine translation |
| 2017 | Facebook reports 4 billion LSTM-based translations per day |
| 2017 | The transformer architecture begins to displace LSTMs for NLP tasks |
| 2024 | xLSTM demonstrates that modernized LSTMs can compete with transformers at scale |
LSTMs became a foundational technology for major technology companies during the 2010s. Google used LSTMs to improve speech recognition accuracy and to power Google Translate's neural machine translation system. Apple integrated LSTM-based models into Siri for voice recognition. Amazon used LSTMs in Alexa's language understanding pipeline. Facebook deployed LSTMs for machine translation at massive scale, handling billions of translations daily by 2017.
Although transformers have replaced LSTMs in many of these systems, the gating and memory principles pioneered by LSTMs directly influenced the design of later architectures, including the GRU, the transformer's gated feed-forward layers, and modern state space models.