# Forget Gate

> Source: https://aiwiki.ai/wiki/forget_gate
> Updated: 2026-06-28
> Categories: Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **forget gate** is a sigmoid layer inside a [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm) [recurrent neural network](/wiki/recurrent_neural_network) that decides, element by element, how much of the previous cell state to keep and how much to discard at each time step. It looks at the previous hidden state and the current input and outputs a value between 0 and 1 for every number in the cell state, where 1 means "completely keep this" and 0 means "completely get rid of this" [1][2]. The forget gate is one of the three LSTM gates (input, forget, and output) and is the component most responsible for letting an LSTM reset its memory, which is central to mitigating the [vanishing gradient problem](/wiki/vanishing_gradient_problem) over long sequences [3].

The forget gate was not part of the original LSTM. It was added three years later by Felix Gers, Jurgen Schmidhuber, and Fred Cummins in their paper "Learning to Forget: Continual Prediction with LSTM" (published in *Neural Computation* in 2000) to fix a specific failure: without a way to reset a memory cell, the internal state of an LSTM processing a continuous input stream could grow without bound and eventually cause the network to break down [4]. The forget gate gives each memory cell the ability to learn when to flush its contents, so the network can replace old information with new information over time. A 2017 large-scale study of LSTM variants concluded that "the forget gate and the output activation function are its most critical components" [3].

## What is the forget gate?

The forget gate is the first of the gating mechanisms an LSTM applies at each time step. In Christopher Olah's widely cited explainer "Understanding LSTM Networks," he describes it directly: "The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the 'forget gate layer'" [1]. The gate produces one number per cell-state element, and "a 1 represents 'completely keep this' while a 0 represents 'completely get rid of this'" [1].

Because the output is bounded to the open interval (0, 1) by the [sigmoid activation function](/wiki/sigmoid_function), the forget gate acts as a learned, per-element multiplier on memory. Values near 1 preserve a piece of stored information across the time step; values near 0 erase it. This soft, differentiable gating (rather than a hard on/off switch) is what lets the forget gate be trained end to end by [backpropagation](/wiki/backpropagation) through time.

## How does the forget gate work?

The forget gate receives two inputs at each time step *t*: the hidden state from the previous time step (*h*<sub>t-1</sub>) and the current input (*x*<sub>t</sub>). These are concatenated and multiplied by a learned weight matrix, with a bias term added. The result is passed through a [sigmoid activation function](/wiki/sigmoid_function), which squashes the output to a value between 0 and 1 for each element in the cell state vector.

The mathematical formulation is:

**f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **b**<sub>f</sub>)

This is often written in the equivalent concatenated form **f**<sub>t</sub> = σ(**W**<sub>f</sub> · [**h**<sub>t-1</sub>, **x**<sub>t</sub>] + **b**<sub>f</sub>), where the two weight matrices are stacked into a single matrix **W**<sub>f</sub> [1]. The symbols are:

| Symbol | Meaning |
|---|---|
| **f**<sub>t</sub> | Forget gate activation vector at time step *t* |
| σ | [Sigmoid function](/wiki/sigmoid_function), outputting values in (0, 1) |
| **W**<sub>f</sub> | Weight matrix for the current input |
| **U**<sub>f</sub> | Weight matrix for the previous hidden state |
| **h**<sub>t-1</sub> | Hidden state from the previous time step |
| **x**<sub>t</sub> | Input vector at the current time step |
| **b**<sub>f</sub> | Bias vector for the forget gate |

The output **f**<sub>t</sub> is then used in the cell state update through element-wise (Hadamard) multiplication with the previous cell state:

**c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + **i**<sub>t</sub> ⊙ **c̃**<sub>t</sub>

Here, **i**<sub>t</sub> is the input gate activation, **c̃**<sub>t</sub> is the candidate cell state (produced by a tanh layer), and ⊙ denotes element-wise multiplication. When a particular element of **f**<sub>t</sub> is close to 0, the corresponding value in the cell state is effectively erased. When it is close to 1, the value is carried forward almost unchanged.

## Role within the LSTM architecture

The standard LSTM cell contains three gates and one candidate memory computation. The forget gate operates alongside these other components, and each plays a distinct role in regulating information flow.

| Component | Function | Activation |
|---|---|---|
| Forget gate | Decides how much of the previous cell state to retain | [Sigmoid](/wiki/sigmoid_function) |
| Input gate | Controls how much of the new candidate values to add to the cell state | [Sigmoid](/wiki/sigmoid_function) |
| Candidate memory (**c̃**<sub>t</sub>) | Proposes new values to potentially add to the cell state | tanh |
| Output gate | Determines which parts of the cell state are exposed as the hidden state | [Sigmoid](/wiki/sigmoid_function) |
| Cell state (**c**<sub>t</sub>) | Carries information across time steps; updated by the forget and input gates | None (linear) |

The full set of LSTM equations at each time step is:

1. **Forget gate:** **f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **b**<sub>f</sub>)
2. **Input gate:** **i**<sub>t</sub> = σ(**W**<sub>i</sub> **x**<sub>t</sub> + **U**<sub>i</sub> **h**<sub>t-1</sub> + **b**<sub>i</sub>)
3. **Candidate memory:** **c̃**<sub>t</sub> = tanh(**W**<sub>c</sub> **x**<sub>t</sub> + **U**<sub>c</sub> **h**<sub>t-1</sub> + **b**<sub>c</sub>)
4. **Cell state update:** **c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + **i**<sub>t</sub> ⊙ **c̃**<sub>t</sub>
5. **Output gate:** **o**<sub>t</sub> = σ(**W**<sub>o</sub> **x**<sub>t</sub> + **U**<sub>o</sub> **h**<sub>t-1</sub> + **b**<sub>o</sub>)
6. **Hidden state:** **h**<sub>t</sub> = **o**<sub>t</sub> ⊙ tanh(**c**<sub>t</sub>)

The cell state update in step 4 is where the forget gate has its direct effect. It acts as a learned multiplier on the previous cell state, controlling the degree to which past information persists.

## Why do LSTMs need a forget gate?

The development of the forget gate was a direct response to limitations discovered in the original LSTM design.

### The original LSTM (1997)

Sepp Hochreiter and Jurgen Schmidhuber introduced the LSTM in 1997 to address the [vanishing gradient problem](/wiki/vanishing_gradient_problem) that plagued standard [recurrent neural networks (RNNs)](/wiki/recurrent_neural_network) [5]. Their design centered on the "Constant Error Carousel" (CEC), a self-connected recurrent edge with a fixed weight of 1. This allowed error signals to flow backward through time without exponential decay, enabling the network to learn dependencies across hundreds or even thousands of time steps. The original architecture included input and output gates, but no forget gate. Once information was stored in a memory cell, it remained there indefinitely unless overwritten by sufficiently strong input gate activations.

### The forget gate addition (1999-2000)

Gers, Schmidhuber, and Cummins published "Learning to Forget: Continual Prediction with LSTM" at the 1999 International Conference on Artificial Neural Networks (ICANN), with an expanded journal version appearing in *Neural Computation* in 2000 (volume 12, issue 10, pages 2451-2471) [4]. They identified that the original LSTM architecture failed on tasks involving continuous input streams that were not segmented into subsequences with clearly marked endpoints. As they put it, in such cases "the internal values of the cells may grow without bound" and the network can eventually break down [4]. The forget gate, which they also called a "keep gate," gives the cell an "adaptive" way to "learn to reset itself at appropriate times, thus releasing internal resources" [4].

The forget gate solved this by giving each memory cell the ability to learn when to clear its contents. In their experiments, standard LSTM and other [recurrent neural network](/wiki/recurrent_neural_network) algorithms failed on continual versions of benchmark problems, while LSTM with forget gates solved them reliably [4].

### Peephole connections (2000)

Gers and Schmidhuber later introduced peephole connections in 2000, published in the paper "Recurrent Nets that Time and Count." In standard LSTM, the gates receive input from the current input and the previous hidden state, but they cannot directly observe the cell state. Peephole connections add direct connections from the cell state to each gate, allowing gates to make decisions based on the actual stored values. This modification was intended to improve the network's ability to learn precise timing. However, a large-scale empirical study by Greff et al. (2017) found that peephole connections did not provide consistent performance improvements across tasks [3].

### Timeline of LSTM development

| Year | Development | Authors |
|---|---|---|
| 1997 | Original LSTM with input and output gates, Constant Error Carousel | Hochreiter and Schmidhuber |
| 1999 | Forget gate introduced | Gers, Schmidhuber, and Cummins |
| 2000 | Peephole connections added | Gers and Schmidhuber |
| 2014 | [Gated Recurrent Unit (GRU)](/wiki/recurrent_neural_network) proposed as a simpler alternative | Cho et al. |
| 2015 | Empirical study recommending forget gate bias initialization to 1 | Jozefowicz, Zaremba, and Sutskever |
| 2017 | Large-scale comparison of 8 LSTM variants; forget gate identified as most important component | Greff et al. |
| 2018 | "Unreasonable effectiveness of the forget gate" demonstrated | van der Westhuizen and Lasenby |

## How does the forget gate affect gradient flow?

The forget gate plays a direct role in how [gradients](/wiki/gradient) flow through the LSTM during [backpropagation](/wiki/backpropagation) through time (BPTT). To understand this, consider the gradient of the loss with respect to the cell state at a previous time step.

During backpropagation, the partial derivative of the cell state at time *t* with respect to the cell state at time *t-1* is:

∂**c**<sub>t</sub> / ∂**c**<sub>t-1</sub> = **f**<sub>t</sub> + (other terms involving gate derivatives)

The key insight is that the dominant term in this derivative is the forget gate value **f**<sub>t</sub> itself. In a standard RNN, the analogous derivative involves multiplying by the derivative of a saturating nonlinearity (such as tanh), which typically produces values less than 1. Over many time steps, these multiplications cause the gradient to shrink exponentially, producing the vanishing gradient problem.

In an LSTM, the gradient through the cell state is multiplied by the forget gate activation at each step. If the forget gate learns to output values close to 1 for time steps where long-term information should be preserved, the gradient passes through with minimal attenuation. This is the mechanism behind Hochreiter and Schmidhuber's Constant Error Carousel: the cell state acts as a highway for gradient flow, and the forget gate is the toll operator deciding how much signal gets through.

When the forget gate is close to 0, the gradient is attenuated at that time step, which is the desired behavior since it corresponds to information that the network has chosen to discard. This selective gradient gating is what allows LSTMs to learn dependencies spanning hundreds or thousands of time steps, something that standard RNNs cannot do [5].

## Why does forget gate bias initialization matter?

A practical detail with significant impact on LSTM training is the initialization of the forget gate bias (**b**<sub>f</sub>). If the bias is initialized to 0 or a small random value, the sigmoid output starts near 0.5, which means the network begins training by partially forgetting all previous cell states. This can reintroduce vanishing gradient issues early in training, before the network has learned which information to keep.

Gers et al. (2000) noted in their original forget gate paper that setting the initial bias to a positive value (such as 1) could improve performance [4]. Jozefowicz, Zaremba, and Sutskever (2015) confirmed this empirically in their paper "An Empirical Exploration of Recurrent Network Architectures" [6]. They found that initializing the forget gate bias to 1 substantially improved LSTM performance, and stated plainly that "adding a bias of 1 to the LSTM's forget gate closes the gap between the LSTM and the GRU" [6]. Their evaluation covered over 10,000 different RNN architectures, and they recommended this initialization for every LSTM implementation.

The rationale is straightforward: a bias of 1 shifts the sigmoid input so that the initial forget gate output is close to 1, meaning the network starts by retaining all information. The network then learns during training which cells should forget and when. This approach has become standard practice in major [deep learning](/wiki/deep_learning) frameworks, including [TensorFlow](/wiki/tensorflow) and [PyTorch](/wiki/pytorch).

| Bias initialization | Initial forget gate output | Effect on training |
|---|---|---|
| 0 | ~0.5 | Partial forgetting from the start; can cause vanishing gradients early in training |
| 1 | ~0.73 | Mostly retains information; network learns to forget as needed |
| 2 | ~0.88 | Strong retention; useful for tasks requiring very long memory spans |
| 5 | ~0.99 | Nearly always retains; gate essentially disabled until training adjusts it |

## How important is the forget gate, empirically?

Several large-scale empirical studies have investigated which components of the LSTM architecture are most important.

### Greff et al. (2017): "LSTM: A Search Space Odyssey"

Greff et al. tested eight LSTM variants across three tasks (speech recognition, handwriting recognition, and polyphonic music modeling) in approximately 5,400 experimental runs totaling roughly 15 years of CPU time, the largest study of its kind at the time [3]. Their central finding was that the **forget gate and the output activation function are the most critical components** of the LSTM. Removing the forget gate consistently degraded performance across all datasets, while removing or modifying other components (such as peephole connections or the input gate) had less consistent effects. None of the eight variants significantly outperformed the standard LSTM architecture [3].

### van der Westhuizen and Lasenby (2018): "The unreasonable effectiveness of the forget gate"

This study went further by proposing JANET (Just Another NETwork), an LSTM variant that uses **only the forget gate**, removing both the input gate and the output gate entirely [7]. Despite this radical simplification, JANET matched or outperformed the standard LSTM on multiple benchmark datasets. On MNIST it reached 99% accuracy versus 98.5% for the standard LSTM, and on permuted MNIST (pMNIST) it reached 92.5% versus 91% [7]. JANET also uses far fewer parameters (roughly 2N gate parameters instead of the 4N of a standard LSTM) and less computation. The authors attributed this success to the fact that a forget-gate-only architecture, combined with proper bias initialization (chrono initialization), creates implicit skip connections that facilitate gradient flow.

### Jozefowicz et al. (2015): "An Empirical Exploration of Recurrent Network Architectures"

In an evaluation of over 10,000 RNN architectures, Jozefowicz et al. found that the most impactful single modification to the standard LSTM was initializing the forget gate bias to 1 [6]. This change alone was sufficient to make the LSTM competitive with the GRU, which had been performing better in several benchmarks.

## Comparison with GRU gates

The [Gated Recurrent Unit (GRU)](/wiki/recurrent_neural_network), proposed by Cho et al. in 2014, takes a different approach to gating [8]. Instead of separate forget and input gates, the GRU uses a single **update gate** that simultaneously controls what to forget and what to add. The relationship is that the GRU effectively couples the forget and input decisions: if the update gate outputs a value *z*<sub>t</sub>, then the fraction of old information retained is *z*<sub>t</sub> and the fraction of new information added is (1 - *z*<sub>t</sub>).

| Feature | LSTM forget gate | GRU update gate |
|---|---|---|
| Number of gates | 3 (forget, input, output) | 2 (update, reset) |
| Forget/input coupling | Independent | Coupled (f<sub>t</sub> = 1 - i<sub>t</sub>) |
| Separate cell state | Yes | No (hidden state only) |
| Parameters | More (separate weights for each gate) | Fewer |
| Output gating | Yes (output gate controls visibility) | No |
| Computational cost | Higher | Lower |
| Performance | Generally similar; slight advantages on tasks requiring fine-grained memory control | Generally similar; slight advantages when data is limited |

The coupled forget-input gate is also available as an LSTM variant (sometimes called CIFG, for Coupled Input and Forget Gate). In this variant, the input gate is set to **i**<sub>t</sub> = 1 - **f**<sub>t</sub>, so the LSTM only adds new information to positions in the cell state where old information is being discarded. Greff et al. (2017) found that this coupling did not significantly impair LSTM performance compared to independent gates [3].

## Variants involving the forget gate

Several LSTM variants modify how the forget gate operates.

### Peephole connections

In the standard LSTM, gate computations depend only on the current input and the previous hidden state. With peephole connections (Gers and Schmidhuber, 2000), the gates also receive the previous cell state as an additional input:

**f**<sub>t</sub> = σ(**W**<sub>f</sub> **x**<sub>t</sub> + **U**<sub>f</sub> **h**<sub>t-1</sub> + **V**<sub>f</sub> **c**<sub>t-1</sub> + **b**<sub>f</sub>)

where **V**<sub>f</sub> is a diagonal weight matrix connecting the cell state to the forget gate. This allows the forget gate to consider the actual stored values when deciding what to discard. However, empirical results from Greff et al. (2017) showed that peephole connections did not consistently improve performance [3].

### JANET (forget-gate-only LSTM)

JANET removes both the input and output gates, retaining only the forget gate. The cell state update simplifies to:

**c**<sub>t</sub> = **f**<sub>t</sub> ⊙ **c**<sub>t-1</sub> + (1 - **f**<sub>t</sub>) ⊙ **c̃**<sub>t</sub>

This effectively couples the forget and input mechanisms (similar to the GRU update gate) and produces the hidden state directly from the cell state without output gating [7].

### Chrono initialization

van der Westhuizen and Lasenby (2018) proposed chrono initialization for the forget gate bias, where different units receive different initial bias values drawn from a log-uniform distribution over a specified range [7]. This gives each unit a different initial "time constant," allowing some units to retain information over short periods and others over long periods from the start of training.

## Intuitive example: language modeling

Consider a [language model](/wiki/language_model) processing the sentence: "The cat, which had been sleeping on the windowsill since morning, suddenly jumped."

As the model processes this sentence token by token, the forget gate at each step decides how much of the accumulated context to keep. When the model encounters "The cat," it stores information about the subject. During the long relative clause ("which had been sleeping on the windowsill since morning"), the forget gate for the subject-related memory cells should remain close to 1, preserving the fact that "cat" is the subject. Meanwhile, cells storing less relevant information (like specific words in the relative clause) can have their forget gates set closer to 0.

When the model reaches "suddenly jumped," it needs to recall that "cat" is the subject to correctly predict and generate the sentence. The forget gate's ability to selectively preserve this information across a long intervening clause is what distinguishes LSTMs from standard RNNs on such tasks.

Another example comes from [time series](/wiki/time_series_analysis) forecasting. When predicting stock prices, daily fluctuations might be noise that the forget gate learns to discard, while longer-term trends (quarterly earnings patterns, market cycles) are retained across many time steps.

## Applications

The forget gate is active in every application that uses LSTM networks. Some notable domains where the forget gate's ability to manage long-term dependencies is particularly important include:

- **[Natural language processing](/wiki/natural_language_understanding):** [Machine translation](/wiki/machine_translation), [sentiment analysis](/wiki/sentiment_analysis), [text summarization](/wiki/text_summarization), and [language modeling](/wiki/language_model) all require tracking dependencies across variable-length sequences. The forget gate allows the model to clear irrelevant context while preserving syntactic and semantic information needed later in the sequence.
- **[Speech recognition](/wiki/speech_recognition):** Google deployed deep LSTM recurrent acoustic models (trained with connectionist temporal classification) for large-vocabulary speech recognition and Google Voice Search around 2014-2015 [10]. The forget gate enables the model to retain phonological context over variable-length utterances while discarding acoustic noise.
- **[Time series](/wiki/time_series_analysis) forecasting:** Financial prediction, weather forecasting, and energy demand prediction rely on LSTMs to capture both short-term and long-term patterns. The forget gate allows the model to flush outdated observations while retaining seasonal or cyclical patterns.
- **Music generation:** Modeling musical sequences requires remembering key signatures, chord progressions, and melodic motifs over extended passages. The forget gate manages the balance between retaining structural elements and adapting to new musical phrases.
- **Handwriting recognition:** Recognizing handwritten text requires the model to process strokes sequentially. The forget gate clears completed character information while retaining positional context.

## Implementation example

The following pseudocode illustrates the forward pass of an LSTM cell, showing where the forget gate operates:

```
def lstm_cell(x_t, h_prev, c_prev, weights, biases):
    # Forget gate
    f_t = sigmoid(W_f @ x_t + U_f @ h_prev + b_f)
    
    # Input gate
    i_t = sigmoid(W_i @ x_t + U_i @ h_prev + b_i)
    
    # Candidate memory
    c_tilde = tanh(W_c @ x_t + U_c @ h_prev + b_c)
    
    # Cell state update (forget gate applied here)
    c_t = f_t * c_prev + i_t * c_tilde
    
    # Output gate
    o_t = sigmoid(W_o @ x_t + U_o @ h_prev + b_o)
    
    # Hidden state
    h_t = o_t * tanh(c_t)
    
    return h_t, c_t
```

In [PyTorch](/wiki/pytorch), the LSTM module handles this internally. The forget gate bias can be initialized to 1 by accessing the bias parameters directly after creating the module:

```python
import torch.nn as nn

lstm = nn.LSTM(input_size=128, hidden_size=256)

# Initialize forget gate bias to 1
# In PyTorch, biases are stored as [b_input, b_forget, b_cell, b_output]
for name, param in lstm.named_parameters():
    if 'bias' in name:
        n = param.size(0)
        param.data[n//4:n//2].fill_(1.0)  # forget gate bias
```

## Explain like I'm 5 (ELI5)

Imagine your brain is like a toy box. Every day, you get new toys (information). But your toy box can only hold so many toys. The forget gate is like a helper who looks at your toys each day and decides which ones you still play with and which ones you have outgrown. The toys you still use stay in the box, and the ones you do not need anymore get removed to make room for new toys. Without this helper, your toy box would overflow and you would not be able to find anything. The forget gate keeps things organized so the important toys are always easy to find when you need them.

## See also

- [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [Sigmoid function](/wiki/sigmoid_function)
- [Vanishing gradient problem](/wiki/vanishing_gradient_problem)
- [Backpropagation](/wiki/backpropagation)

## References

1. Olah, C. (2015). "Understanding LSTM Networks." *colah's blog*. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
2. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). "Dive into Deep Learning" (chapter on Long Short-Term Memory). Cambridge University Press. https://d2l.ai/
3. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). "LSTM: A Search Space Odyssey." *IEEE Transactions on Neural Networks and Learning Systems*, 28(10), 2222-2232. https://arxiv.org/abs/1503.04069
4. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." *Neural Computation*, 12(10), 2451-2471. https://direct.mit.edu/neco/article/12/10/2451/6415
5. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
6. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*. https://proceedings.mlr.press/v37/jozefowicz15.html
7. van der Westhuizen, J., & Lasenby, J. (2018). "The Unreasonable Effectiveness of the Forget Gate." *arXiv preprint arXiv:1804.04849*. https://arxiv.org/abs/1804.04849
8. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*.
9. Gers, F. A., & Schmidhuber, J. (2000). "Recurrent Nets that Time and Count." *Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN)*, 3, 189-194.
10. Sak, H., Senior, A., & Beaufays, F. (2014). "Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling." *Proceedings of INTERSPEECH*.
11. Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). "Learning Precise Timing with LSTM Recurrent Networks." *Journal of Machine Learning Research*, 3, 115-143.
12. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/