The forget gate is a component of Long Short-Term Memory (LSTM) neural networks that controls how much information from the previous cell state is retained or discarded at each time step. Introduced by Felix Gers, Jurgen Schmidhuber, and Fred Cummins in 1999, the forget gate was not part of the original LSTM architecture proposed by Hochreiter and Schmidhuber in 1997. Its addition solved a key limitation: without a mechanism to reset or clear memory cells, the internal state of an LSTM could grow indefinitely during continuous input streams, eventually causing the network to break down. The forget gate allows each memory cell to learn when to flush its contents, enabling the network to handle tasks where old information must be replaced by new information over time.
The forget gate receives two inputs at each time step t: the hidden state from the previous time step (ht-1) and the current input (xt). These are concatenated and multiplied by a learned weight matrix, with a bias term added. The result is passed through a sigmoid activation function, which squashes the output to a value between 0 and 1 for each element in the cell state vector.
The mathematical formulation is:
ft = σ(Wf xt + Uf ht-1 + bf)
where:
| Symbol | Meaning |
|---|---|
| ft | Forget gate activation vector at time step t |
| σ | Sigmoid function, outputting values in (0, 1) |
| Wf | Weight matrix for the current input |
| Uf | Weight matrix for the previous hidden state |
| ht-1 | Hidden state from the previous time step |
| xt | Input vector at the current time step |
| bf | Bias vector for the forget gate |
The output ft is then used in the cell state update through element-wise (Hadamard) multiplication with the previous cell state:
ct = ft ⊙ ct-1 + it ⊙ c̃t
Here, it is the input gate activation, c̃t is the candidate cell state (produced by a tanh layer), and ⊙ denotes element-wise multiplication. When a particular element of ft is close to 0, the corresponding value in the cell state is effectively erased. When it is close to 1, the value is carried forward almost unchanged.
The standard LSTM cell contains three gates and one candidate memory computation. The forget gate operates alongside these other components, and each plays a distinct role in regulating information flow.
| Component | Function | Activation |
|---|---|---|
| Forget gate | Decides how much of the previous cell state to retain | Sigmoid |
| Input gate | Controls how much of the new candidate values to add to the cell state | Sigmoid |
| Candidate memory (c̃t) | Proposes new values to potentially add to the cell state | tanh |
| Output gate | Determines which parts of the cell state are exposed as the hidden state | Sigmoid |
| Cell state (ct) | Carries information across time steps; updated by the forget and input gates | None (linear) |
The full set of LSTM equations at each time step is:
The cell state update in step 4 is where the forget gate has its direct effect. It acts as a learned multiplier on the previous cell state, controlling the degree to which past information persists.
The development of the forget gate was a direct response to limitations discovered in the original LSTM design.
Sepp Hochreiter and Jurgen Schmidhuber introduced the LSTM in 1997 to address the vanishing gradient problem that plagued standard recurrent neural networks (RNNs). Their design centered on the "Constant Error Carousel" (CEC), a self-connected recurrent edge with a fixed weight of 1. This allowed error signals to flow backward through time without exponential decay, enabling the network to learn dependencies across hundreds or even thousands of time steps. The original architecture included input and output gates, but no forget gate. Once information was stored in a memory cell, it remained there indefinitely unless overwritten by sufficiently strong input gate activations.
Gers, Schmidhuber, and Cummins published "Learning to Forget: Continual Prediction with LSTM" at the 1999 International Conference on Artificial Neural Networks (ICANN), with an expanded journal version appearing in Neural Computation in 2000. They identified that the original LSTM architecture failed on tasks involving continuous input streams that were not segmented into subsequences with clearly marked endpoints. Without a mechanism to reset the cell state, internal values could grow without bound, eventually destabilizing the network.
The forget gate solved this by giving each memory cell the ability to learn when to clear its contents. In their experiments, standard LSTM and other recurrent neural network algorithms failed on continual versions of benchmark problems, while LSTM with forget gates solved them reliably.
Gers and Schmidhuber later introduced peephole connections in 2000, published in the paper "Recurrent Nets that Time and Count." In standard LSTM, the gates receive input from the current input and the previous hidden state, but they cannot directly observe the cell state. Peephole connections add direct connections from the cell state to each gate, allowing gates to make decisions based on the actual stored values. This modification was intended to improve the network's ability to learn precise timing. However, a large-scale empirical study by Greff et al. (2017) found that peephole connections did not provide consistent performance improvements across tasks.
| Year | Development | Authors |
|---|---|---|
| 1997 | Original LSTM with input and output gates, Constant Error Carousel | Hochreiter and Schmidhuber |
| 1999 | Forget gate introduced | Gers, Schmidhuber, and Cummins |
| 2000 | Peephole connections added | Gers and Schmidhuber |
| 2014 | Gated Recurrent Unit (GRU) proposed as a simpler alternative | Cho et al. |
| 2015 | Empirical study recommending forget gate bias initialization to 1 | Jozefowicz, Zaremba, and Sutskever |
| 2017 | Large-scale comparison of 8 LSTM variants; forget gate identified as most important component | Greff et al. |
| 2018 | "Unreasonable effectiveness of the forget gate" demonstrated | van der Westhuizen and Eloff |
The forget gate plays a direct role in how gradients flow through the LSTM during backpropagation through time (BPTT). To understand this, consider the gradient of the loss with respect to the cell state at a previous time step.
During backpropagation, the partial derivative of the cell state at time t with respect to the cell state at time t-1 is:
∂ct / ∂ct-1 = ft + (other terms involving gate derivatives)
The key insight is that the dominant term in this derivative is the forget gate value ft itself. In a standard RNN, the analogous derivative involves multiplying by the derivative of a saturating nonlinearity (such as tanh), which typically produces values less than 1. Over many time steps, these multiplications cause the gradient to shrink exponentially, producing the vanishing gradient problem.
In an LSTM, the gradient through the cell state is multiplied by the forget gate activation at each step. If the forget gate learns to output values close to 1 for time steps where long-term information should be preserved, the gradient passes through with minimal attenuation. This is the mechanism behind Hochreiter and Schmidhuber's Constant Error Carousel: the cell state acts as a highway for gradient flow, and the forget gate is the toll operator deciding how much signal gets through.
When the forget gate is close to 0, the gradient is attenuated at that time step, which is the desired behavior since it corresponds to information that the network has chosen to discard. This selective gradient gating is what allows LSTMs to learn dependencies spanning hundreds or thousands of time steps, something that standard RNNs cannot do.
A practical detail with significant impact on LSTM training is the initialization of the forget gate bias (bf). If the bias is initialized to 0 or a small random value, the sigmoid output starts near 0.5, which means the network begins training by partially forgetting all previous cell states. This can reintroduce vanishing gradient issues early in training, before the network has learned which information to keep.
Gers et al. (2000) noted in their original forget gate paper that setting the initial bias to a positive value (such as 1) could improve performance. Jozefowicz, Zaremba, and Sutskever (2015) confirmed this empirically in their paper "An Empirical Exploration of Recurrent Network Architectures." They found that initializing the forget gate bias to 1 substantially improved LSTM performance, closing the gap between LSTMs and GRUs on several benchmarks. Their evaluation covered over 10,000 different RNN architectures, and they recommended this initialization for every LSTM implementation.
The rationale is straightforward: a bias of 1 shifts the sigmoid input so that the initial forget gate output is close to 1, meaning the network starts by retaining all information. The network then learns during training which cells should forget and when. This approach has become standard practice in major deep learning frameworks, including TensorFlow and PyTorch.
| Bias initialization | Initial forget gate output | Effect on training |
|---|---|---|
| 0 | ~0.5 | Partial forgetting from the start; can cause vanishing gradients early in training |
| 1 | ~0.73 | Mostly retains information; network learns to forget as needed |
| 2 | ~0.88 | Strong retention; useful for tasks requiring very long memory spans |
| 5 | ~0.99 | Nearly always retains; gate essentially disabled until training adjusts it |
Several large-scale empirical studies have investigated which components of the LSTM architecture are most important.
Greff et al. tested eight LSTM variants across three tasks (speech recognition, handwriting recognition, and polyphonic music modeling) in approximately 5,400 experimental runs totaling roughly 15 years of CPU time. Their central finding was that the forget gate and the output activation function are the most critical components of the LSTM. Removing the forget gate consistently degraded performance across all datasets, while removing or modifying other components (such as peephole connections or the input gate) had less consistent effects. None of the eight variants significantly outperformed the standard LSTM architecture.
This study went further by proposing JANET (Just Another NETwork), an LSTM variant that uses only the forget gate, removing both the input gate and the output gate entirely. Despite this radical simplification, JANET matched or outperformed the standard LSTM on multiple benchmark datasets, including MNIST, permuted MNIST, and MIT-BIH arrhythmia classification. JANET also used approximately 50% fewer parameters and required roughly 5/6ths the computation of a standard LSTM. The authors attributed this success to the fact that a forget-gate-only architecture, combined with proper bias initialization (chrono initialization), creates implicit skip connections that facilitate gradient flow.
In an evaluation of over 10,000 RNN architectures, Jozefowicz et al. found that the most impactful single modification to the standard LSTM was initializing the forget gate bias to 1. This change alone was sufficient to make the LSTM competitive with the GRU, which had been performing better in several benchmarks.
The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, takes a different approach to gating. Instead of separate forget and input gates, the GRU uses a single update gate that simultaneously controls what to forget and what to add. The relationship is that the GRU effectively couples the forget and input decisions: if the update gate outputs a value zt, then the fraction of old information retained is zt and the fraction of new information added is (1 - zt).
| Feature | LSTM forget gate | GRU update gate |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (update, reset) |
| Forget/input coupling | Independent | Coupled (ft = 1 - it) |
| Separate cell state | Yes | No (hidden state only) |
| Parameters | More (separate weights for each gate) | Fewer |
| Output gating | Yes (output gate controls visibility) | No |
| Computational cost | Higher | Lower |
| Performance | Generally similar; slight advantages on tasks requiring fine-grained memory control | Generally similar; slight advantages when data is limited |
The coupled forget-input gate is also available as an LSTM variant (sometimes called CIFG, for Coupled Input and Forget Gate). In this variant, the input gate is set to it = 1 - ft, so the LSTM only adds new information to positions in the cell state where old information is being discarded. Greff et al. (2017) found that this coupling did not significantly impair LSTM performance compared to independent gates.
Several LSTM variants modify how the forget gate operates.
In the standard LSTM, gate computations depend only on the current input and the previous hidden state. With peephole connections (Gers and Schmidhuber, 2000), the gates also receive the previous cell state as an additional input:
ft = σ(Wf xt + Uf ht-1 + Vf ct-1 + bf)
where Vf is a diagonal weight matrix connecting the cell state to the forget gate. This allows the forget gate to consider the actual stored values when deciding what to discard. However, empirical results from Greff et al. (2017) showed that peephole connections did not consistently improve performance.
JANET removes both the input and output gates, retaining only the forget gate. The cell state update simplifies to:
ct = ft ⊙ ct-1 + (1 - ft) ⊙ c̃t
This effectively couples the forget and input mechanisms (similar to the GRU update gate) and produces the hidden state directly from the cell state without output gating.
van der Westhuizen and Eloff (2018) proposed chrono initialization for the forget gate bias, where different units receive different initial bias values drawn from a log-uniform distribution over a specified range. This gives each unit a different initial "time constant," allowing some units to retain information over short periods and others over long periods from the start of training.
Consider a language model processing the sentence: "The cat, which had been sleeping on the windowsill since morning, suddenly jumped."
As the model processes this sentence token by token, the forget gate at each step decides how much of the accumulated context to keep. When the model encounters "The cat," it stores information about the subject. During the long relative clause ("which had been sleeping on the windowsill since morning"), the forget gate for the subject-related memory cells should remain close to 1, preserving the fact that "cat" is the subject. Meanwhile, cells storing less relevant information (like specific words in the relative clause) can have their forget gates set closer to 0.
When the model reaches "suddenly jumped," it needs to recall that "cat" is the subject to correctly predict and generate the sentence. The forget gate's ability to selectively preserve this information across a long intervening clause is what distinguishes LSTMs from standard RNNs on such tasks.
Another example comes from time series forecasting. When predicting stock prices, daily fluctuations might be noise that the forget gate learns to discard, while longer-term trends (quarterly earnings patterns, market cycles) are retained across many time steps.
The forget gate is active in every application that uses LSTM networks. Some notable domains where the forget gate's ability to manage long-term dependencies is particularly important include:
The following pseudocode illustrates the forward pass of an LSTM cell, showing where the forget gate operates:
def lstm_cell(x_t, h_prev, c_prev, weights, biases):
# Forget gate
f_t = sigmoid(W_f @ x_t + U_f @ h_prev + b_f)
# Input gate
i_t = sigmoid(W_i @ x_t + U_i @ h_prev + b_i)
# Candidate memory
c_tilde = tanh(W_c @ x_t + U_c @ h_prev + b_c)
# Cell state update (forget gate applied here)
c_t = f_t * c_prev + i_t * c_tilde
# Output gate
o_t = sigmoid(W_o @ x_t + U_o @ h_prev + b_o)
# Hidden state
h_t = o_t * tanh(c_t)
return h_t, c_t
In PyTorch, the LSTM module handles this internally. The forget gate bias can be initialized to 1 by accessing the bias parameters directly after creating the module:
import torch.nn as nn
lstm = nn.LSTM(input_size=128, hidden_size=256)
# Initialize forget gate bias to 1
# In PyTorch, biases are stored as [b_input, b_forget, b_cell, b_output]
for name, param in lstm.named_parameters():
if 'bias' in name:
n = param.size(0)
param.data[n//4:n//2].fill_(1.0) # forget gate bias
Imagine your brain is like a toy box. Every day, you get new toys (information). But your toy box can only hold so many toys. The forget gate is like a helper who looks at your toys each day and decides which ones you still play with and which ones you have outgrown. The toys you still use stay in the box, and the ones you do not need anymore get removed to make room for new toys. Without this helper, your toy box would overflow and you would not be able to find anything. The forget gate keeps things organized so the important toys are always easy to find when you need them.