# Forget Gate

> Source: https://aiwiki.ai/wiki/forget_gate
> Updated: 2026-04-09
> Categories: Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The **forget gate** is a component of [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm) [neural networks](/wiki/neural_network) that controls how much information from the previous cell state is retained or discarded at each time step. Introduced by Felix Gers, Jurgen Schmidhuber, and Fred Cummins in 1999, the forget gate was not part of the original LSTM architecture proposed by Hochreiter and Schmidhuber in 1997. Its addition solved a key limitation: without a mechanism to reset or clear memory cells, the internal state of an LSTM could grow indefinitely during continuous input streams, eventually causing the network to break down. The forget gate allows each memory cell to learn when to flush its contents, enabling the network to handle tasks where old information must be replaced by new information over time.

## How the forget gate works

The forget gate receives two inputs at each time step *t*: the hidden state from the previous time step (*h*<sub>t-1</sub>) and the current input (*x*<sub>t</sub>). These are concatenated and multiplied by a learned weight matrix, with a bias term added. The result is passed through a [sigmoid activation function](/wiki/sigmoid_function), which squashes the output to a value between 0 and 1 for each element in the cell state vector.

The mathematical formulation is:

**f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **b**<sub>f</sub>)

where:

| Symbol | Meaning |
|---|---|
| **f**<sub>t</sub> | Forget gate activation vector at time step *t* |
| σ | [Sigmoid function](/wiki/sigmoid_function), outputting values in (0, 1) |
| **W**<sub>f</sub> | Weight matrix for the current input |
| **U**<sub>f</sub> | Weight matrix for the previous hidden state |
| **h**<sub>t-1</sub> | Hidden state from the previous time step |
| **x**<sub>t</sub> | Input vector at the current time step |
| **b**<sub>f</sub> | Bias vector for the forget gate |

The output **f**<sub>t</sub> is then used in the cell state update through element-wise (Hadamard) multiplication with the previous cell state:

**c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + **i**<sub>t</sub> ⊙ **c̃**<sub>t</sub>

Here, **i**<sub>t</sub> is the input gate activation, **c̃**<sub>t</sub> is the candidate cell state (produced by a tanh layer), and ⊙ denotes element-wise multiplication. When a particular element of **f**<sub>t</sub> is close to 0, the corresponding value in the cell state is effectively erased. When it is close to 1, the value is carried forward almost unchanged.

## Role within the LSTM architecture

The standard LSTM cell contains three gates and one candidate memory computation. The forget gate operates alongside these other components, and each plays a distinct role in regulating information flow.

| Component | Function | Activation |
|---|---|---|
| Forget gate | Decides how much of the previous cell state to retain | [Sigmoid](/wiki/sigmoid_function) |
| Input gate | Controls how much of the new candidate values to add to the cell state | [Sigmoid](/wiki/sigmoid_function) |
| Candidate memory (**c̃**<sub>t</sub>) | Proposes new values to potentially add to the cell state | tanh |
| Output gate | Determines which parts of the cell state are exposed as the hidden state | [Sigmoid](/wiki/sigmoid_function) |
| Cell state (**c**<sub>t</sub>) | Carries information across time steps; updated by the forget and input gates | None (linear) |

The full set of LSTM equations at each time step is:

1. **Forget gate:** **f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **b**<sub>f</sub>)
2. **Input gate:** **i**<sub>t</sub> = σ(**W**<sub>i</sub> **x**<sub>t</sub> + **U**<sub>i</sub> **h**<sub>t-1</sub> + **b**<sub>i</sub>)
3. **Candidate memory:** **c̃**<sub>t</sub> = tanh(**W**<sub>c</sub> **x**<sub>t</sub> + **U**<sub>c</sub> **h**<sub>t-1</sub> + **b**<sub>c</sub>)
4. **Cell state update:** **c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + **i**<sub>t</sub> ⊙ **c̃**<sub>t</sub>
5. **Output gate:** **o**<sub>t</sub> = σ(**W**<sub>o</sub> **x**<sub>t</sub> + **U**<sub>o</sub> **h**<sub>t-1</sub> + **b**<sub>o</sub>)
6. **Hidden state:** **h**<sub>t</sub> = **o**<sub>t</sub> ⊙ tanh(**c**<sub>t</sub>)

The cell state update in step 4 is where the forget gate has its direct effect. It acts as a learned multiplier on the previous cell state, controlling the degree to which past information persists.

## Historical background

The development of the forget gate was a direct response to limitations discovered in the original LSTM design.

### The original LSTM (1997)

Sepp Hochreiter and Jurgen Schmidhuber introduced the LSTM in 1997 to address the [vanishing gradient problem](/wiki/vanishing_gradient_problem) that plagued standard [recurrent neural networks (RNNs)](/wiki/recurrent_neural_network). Their design centered on the "Constant Error Carousel" (CEC), a self-connected recurrent edge with a fixed weight of 1. This allowed error signals to flow backward through time without exponential decay, enabling the network to learn dependencies across hundreds or even thousands of time steps. The original architecture included input and output gates, but no forget gate. Once information was stored in a memory cell, it remained there indefinitely unless overwritten by sufficiently strong input gate activations.

### The forget gate addition (1999-2000)

Gers, Schmidhuber, and Cummins published "Learning to Forget: Continual Prediction with LSTM" at the 1999 International Conference on Artificial Neural Networks (ICANN), with an expanded journal version appearing in Neural Computation in 2000. They identified that the original LSTM architecture failed on tasks involving continuous input streams that were not segmented into subsequences with clearly marked endpoints. Without a mechanism to reset the cell state, internal values could grow without bound, eventually destabilizing the network.

The forget gate solved this by giving each memory cell the ability to learn when to clear its contents. In their experiments, standard LSTM and other [recurrent neural network](/wiki/recurrent_neural_network) algorithms failed on continual versions of benchmark problems, while LSTM with forget gates solved them reliably.

### Peephole connections (2000)

Gers and Schmidhuber later introduced peephole connections in 2000, published in the paper "Recurrent Nets that Time and Count." In standard LSTM, the gates receive input from the current input and the previous hidden state, but they cannot directly observe the cell state. Peephole connections add direct connections from the cell state to each gate, allowing gates to make decisions based on the actual stored values. This modification was intended to improve the network's ability to learn precise timing. However, a large-scale empirical study by Greff et al. (2017) found that peephole connections did not provide consistent performance improvements across tasks.

### Timeline of LSTM development

| Year | Development | Authors |
|---|---|---|
| 1997 | Original LSTM with input and output gates, Constant Error Carousel | Hochreiter and Schmidhuber |
| 1999 | Forget gate introduced | Gers, Schmidhuber, and Cummins |
| 2000 | Peephole connections added | Gers and Schmidhuber |
| 2014 | [Gated Recurrent Unit (GRU)](/wiki/recurrent_neural_network) proposed as a simpler alternative | Cho et al. |
| 2015 | Empirical study recommending forget gate bias initialization to 1 | Jozefowicz, Zaremba, and Sutskever |
| 2017 | Large-scale comparison of 8 LSTM variants; forget gate identified as most important component | Greff et al. |
| 2018 | "Unreasonable effectiveness of the forget gate" demonstrated | van der Westhuizen and Eloff |

## Gradient flow and the vanishing gradient problem

The forget gate plays a direct role in how [gradients](/wiki/gradient) flow through the LSTM during [backpropagation](/wiki/backpropagation) through time (BPTT). To understand this, consider the gradient of the loss with respect to the cell state at a previous time step.

During backpropagation, the partial derivative of the cell state at time *t* with respect to the cell state at time *t-1* is:

∂**c**<sub>t</sub> / ∂**c**<sub>t-1</sub> = **f**<sub>t</sub> + (other terms involving gate derivatives)

The key insight is that the dominant term in this derivative is the forget gate value **f**<sub>t</sub> itself. In a standard RNN, the analogous derivative involves multiplying by the derivative of a saturating nonlinearity (such as tanh), which typically produces values less than 1. Over many time steps, these multiplications cause the gradient to shrink exponentially, producing the vanishing gradient problem.

In an LSTM, the gradient through the cell state is multiplied by the forget gate activation at each step. If the forget gate learns to output values close to 1 for time steps where long-term information should be preserved, the gradient passes through with minimal attenuation. This is the mechanism behind Hochreiter and Schmidhuber's Constant Error Carousel: the cell state acts as a highway for gradient flow, and the forget gate is the toll operator deciding how much signal gets through.

When the forget gate is close to 0, the gradient is attenuated at that time step, which is the desired behavior since it corresponds to information that the network has chosen to discard. This selective gradient gating is what allows LSTMs to learn dependencies spanning hundreds or thousands of time steps, something that standard RNNs cannot do.

## Bias initialization

A practical detail with significant impact on LSTM training is the initialization of the forget gate bias (**b**<sub>f</sub>). If the bias is initialized to 0 or a small random value, the sigmoid output starts near 0.5, which means the network begins training by partially forgetting all previous cell states. This can reintroduce vanishing gradient issues early in training, before the network has learned which information to keep.

Gers et al. (2000) noted in their original forget gate paper that setting the initial bias to a positive value (such as 1) could improve performance. Jozefowicz, Zaremba, and Sutskever (2015) confirmed this empirically in their paper "An Empirical Exploration of Recurrent Network Architectures." They found that initializing the forget gate bias to 1 substantially improved LSTM performance, closing the gap between LSTMs and GRUs on several benchmarks. Their evaluation covered over 10,000 different RNN architectures, and they recommended this initialization for every LSTM implementation.

The rationale is straightforward: a bias of 1 shifts the sigmoid input so that the initial forget gate output is close to 1, meaning the network starts by retaining all information. The network then learns during training which cells should forget and when. This approach has become standard practice in major [deep learning](/wiki/deep_learning) frameworks, including [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch).

| Bias initialization | Initial forget gate output | Effect on training |
|---|---|---|
| 0 | ~0.5 | Partial forgetting from the start; can cause vanishing gradients early in training |
| 1 | ~0.73 | Mostly retains information; network learns to forget as needed |
| 2 | ~0.88 | Strong retention; useful for tasks requiring very long memory spans |
| 5 | ~0.99 | Nearly always retains; gate essentially disabled until training adjusts it |

## Empirical importance

Several large-scale empirical studies have investigated which components of the LSTM architecture are most important.

### Greff et al. (2017): "LSTM: A Search Space Odyssey"

Greff et al. tested eight LSTM variants across three tasks (speech recognition, handwriting recognition, and polyphonic music modeling) in approximately 5,400 experimental runs totaling roughly 15 years of CPU time. Their central finding was that the **forget gate and the output activation function are the most critical components** of the LSTM. Removing the forget gate consistently degraded performance across all datasets, while removing or modifying other components (such as peephole connections or the input gate) had less consistent effects. None of the eight variants significantly outperformed the standard LSTM architecture.

### van der Westhuizen and Eloff (2018): "The unreasonable effectiveness of the forget gate"

This study went further by proposing JANET (Just Another NETwork), an LSTM variant that uses **only the forget gate**, removing both the input gate and the output gate entirely. Despite this radical simplification, JANET matched or outperformed the standard LSTM on multiple benchmark datasets, including MNIST, permuted MNIST, and MIT-BIH arrhythmia classification. JANET also used approximately 50% fewer parameters and required roughly 5/6ths the computation of a standard LSTM. The authors attributed this success to the fact that a forget-gate-only architecture, combined with proper bias initialization (chrono initialization), creates implicit skip connections that facilitate gradient flow.

### Jozefowicz et al. (2015): "An Empirical Exploration of Recurrent Network Architectures"

In an evaluation of over 10,000 RNN architectures, Jozefowicz et al. found that the most impactful single modification to the standard LSTM was initializing the forget gate bias to 1. This change alone was sufficient to make the LSTM competitive with the GRU, which had been performing better in several benchmarks.

## Comparison with GRU gates

The [Gated Recurrent Unit (GRU)](/wiki/recurrent_neural_network), proposed by Cho et al. in 2014, takes a different approach to gating. Instead of separate forget and input gates, the GRU uses a single **update gate** that simultaneously controls what to forget and what to add. The relationship is that the GRU effectively couples the forget and input decisions: if the update gate outputs a value *z*<sub>t</sub>, then the fraction of old information retained is *z*<sub>t</sub> and the fraction of new information added is (1 - *z*<sub>t</sub>).

| Feature | LSTM forget gate | GRU update gate |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (update, reset) |
| Forget/input coupling | Independent | Coupled (f<sub>t</sub> = 1 - i<sub>t</sub>) |
| Separate cell state | Yes | No (hidden state only) |
| Parameters | More (separate weights for each gate) | Fewer |
| Output gating | Yes (output gate controls visibility) | No |
| Computational cost | Higher | Lower |
| Performance | Generally similar; slight advantages on tasks requiring fine-grained memory control | Generally similar; slight advantages when data is limited |

The coupled forget-input gate is also available as an LSTM variant (sometimes called CIFG, for Coupled Input and Forget Gate). In this variant, the input gate is set to **i**<sub>t</sub> = 1 - **f**<sub>t</sub>, so the LSTM only adds new information to positions in the cell state where old information is being discarded. Greff et al. (2017) found that this coupling did not significantly impair LSTM performance compared to independent gates.

## Variants involving the forget gate

Several LSTM variants modify how the forget gate operates.

### Peephole connections

In the standard LSTM, gate computations depend only on the current input and the previous hidden state. With peephole connections (Gers and Schmidhuber, 2000), the gates also receive the previous cell state as an additional input:

**f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **V**<sub>f</sub> **c**<sub>t-1</sub> + **b**<sub>f</sub>)

where **V**<sub>f</sub> is a diagonal weight matrix connecting the cell state to the forget gate. This allows the forget gate to consider the actual stored values when deciding what to discard. However, empirical results from Greff et al. (2017) showed that peephole connections did not consistently improve performance.

### JANET (forget-gate-only LSTM)

JANET removes both the input and output gates, retaining only the forget gate. The cell state update simplifies to:

**c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + (1 - **f**<sub>t</sub>) ⊙ **c̃**<sub>t</sub>

This effectively couples the forget and input mechanisms (similar to the GRU update gate) and produces the hidden state directly from the cell state without output gating.

### Chrono initialization

van der Westhuizen and Eloff (2018) proposed chrono initialization for the forget gate bias, where different units receive different initial bias values drawn from a log-uniform distribution over a specified range. This gives each unit a different initial "time constant," allowing some units to retain information over short periods and others over long periods from the start of training.

## Intuitive example: language modeling

Consider a [language model](/wiki/language_model) processing the sentence: "The cat, which had been sleeping on the windowsill since morning, suddenly jumped."

As the model processes this sentence token by token, the forget gate at each step decides how much of the accumulated context to keep. When the model encounters "The cat," it stores information about the subject. During the long relative clause ("which had been sleeping on the windowsill since morning"), the forget gate for the subject-related memory cells should remain close to 1, preserving the fact that "cat" is the subject. Meanwhile, cells storing less relevant information (like specific words in the relative clause) can have their forget gates set closer to 0.

When the model reaches "suddenly jumped," it needs to recall that "cat" is the subject to correctly predict and generate the sentence. The forget gate's ability to selectively preserve this information across a long intervening clause is what distinguishes LSTMs from standard RNNs on such tasks.

Another example comes from [time series](/wiki/time_series_analysis) forecasting. When predicting stock prices, daily fluctuations might be noise that the forget gate learns to discard, while longer-term trends (quarterly earnings patterns, market cycles) are retained across many time steps.

## Applications

The forget gate is active in every application that uses LSTM networks. Some notable domains where the forget gate's ability to manage long-term dependencies is particularly important include:

- **[Natural language processing](/wiki/natural_language_understanding):** [Machine translation](/wiki/machine_translation), [sentiment analysis](/wiki/sentiment_analysis), [text summarization](/wiki/text_summarization), and [language modeling](/wiki/language_model) all require tracking dependencies across variable-length sequences. The forget gate allows the model to clear irrelevant context while preserving syntactic and semantic information needed later in the sequence.
- **[Speech recognition](/wiki/speech_recognition):** Google adopted LSTM-based models for Google Voice Search in 2015, reducing transcription errors by 49%. The forget gate enables the model to retain phonological context over variable-length utterances while discarding acoustic noise.
- **[Time series](/wiki/time_series_analysis) forecasting:** Financial prediction, weather forecasting, and energy demand prediction rely on LSTMs to capture both short-term and long-term patterns. The forget gate allows the model to flush outdated observations while retaining seasonal or cyclical patterns.
- **Music generation:** Modeling musical sequences requires remembering key signatures, chord progressions, and melodic motifs over extended passages. The forget gate manages the balance between retaining structural elements and adapting to new musical phrases.
- **Handwriting recognition:** Recognizing handwritten text requires the model to process strokes sequentially. The forget gate clears completed character information while retaining positional context.

## Implementation example

The following pseudocode illustrates the forward pass of an LSTM cell, showing where the forget gate operates:

```
def lstm_cell(x_t, h_prev, c_prev, weights, biases):
    # Forget gate
    f_t = sigmoid(W_f @ x_t + U_f @ h_prev + b_f)
    
    # Input gate
    i_t = sigmoid(W_i @ x_t + U_i @ h_prev + b_i)
    
    # Candidate memory
    c_tilde = tanh(W_c @ x_t + U_c @ h_prev + b_c)
    
    # Cell state update (forget gate applied here)
    c_t = f_t * c_prev + i_t * c_tilde
    
    # Output gate
    o_t = sigmoid(W_o @ x_t + U_o @ h_prev + b_o)
    
    # Hidden state
    h_t = o_t * tanh(c_t)
    
    return h_t, c_t
```

In [PyTorch](/wiki/pytorch), the LSTM module handles this internally. The forget gate bias can be initialized to 1 by accessing the bias parameters directly after creating the module:

```python
import torch.nn as nn

lstm = nn.LSTM(input_size=128, hidden_size=256)

# Initialize forget gate bias to 1
# In PyTorch, biases are stored as [b_input, b_forget, b_cell, b_output]
for name, param in lstm.named_parameters():
    if 'bias' in name:
        n = param.size(0)
        param.data[n//4:n//2].fill_(1.0)  # forget gate bias
```

## Explain like I'm 5 (ELI5)

Imagine your brain is like a toy box. Every day, you get new toys (information). But your toy box can only hold so many toys. The forget gate is like a helper who looks at your toys each day and decides which ones you still play with and which ones you have outgrown. The toys you still use stay in the box, and the ones you do not need anymore get removed to make room for new toys. Without this helper, your toy box would overflow and you would not be able to find anything. The forget gate keeps things organized so the important toys are always easy to find when you need them.

## References

1. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
2. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." *Neural Computation*, 12(10), 2451-2471.
3. Gers, F. A., & Schmidhuber, J. (2000). "Recurrent Nets that Time and Count." *Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN)*, 3, 189-194.
4. Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). "Learning Precise Timing with LSTM Recurrent Networks." *Journal of Machine Learning Research*, 3, 115-143.
5. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*.
6. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
7. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232.
8. van der Westhuizen, J., & Eloff, J. (2018). "The Unreasonable Effectiveness of the Forget Gate." *arXiv preprint arXiv:1804.04849*.
9. Olah, C. (2015). "Understanding LSTM Networks." *colah's blog*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
10. Sak, H., Senior, A., & Beaufays, F. (2014). "Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling." *Proceedings of INTERSPEECH*.
11. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). "Dive into Deep Learning." Cambridge University Press. https://d2l.ai/
12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
