LSTM

See also: Machine learning terms

Long Short-Term Memory (LSTM) is a type of [[recurrent_neural_network]] architecture designed to learn long-range dependencies in sequential data. It was introduced by Sepp [[sepp_hochreiter]] and Jürgen [[jurgen_schmidhuber]] in a 1997 paper in Neural Computation[^1], and it became the dominant approach to sequence modeling from roughly 2013 to 2018, when it was largely supplanted in language tasks by the [[transformer]][^2]. LSTM cells maintain a separate cell state (a long-term memory) and use small learned gates to decide what to store, what to forget, and what to read out at each time step. That structure gives the network an additive, gradient-friendly path through time, which is the main reason LSTMs can train on sequences thousands of steps long without the gradients collapsing the way plain recurrent networks do.

Despite the rise of attention-based models, LSTMs remain in active use for speech recognition front-ends, on-device inference, time series, control policies, and any setting where strict left-to-right semantics and constant per-step compute matter more than raw throughput on a GPU. The architecture also enjoyed a small revival in 2024 with the publication of xLSTM[^3], which scaled the original idea to billions of parameters and put it back in conversation with [[mamba]] and modern [[state_space_model]] approaches.

Background

The story of LSTM starts with a problem rather than a solution. In April 1991, Sepp Hochreiter, then a master's student at the Technical University of Munich, submitted his diploma thesis Untersuchungen zu dynamischen neuronalen Netzen ("Investigations into Dynamic Neural Networks") under Schmidhuber's advisorship[^4]. The thesis contains the first detailed analysis of what is now called the [[vanishing_gradient]] problem: when [[backpropagation_through_time]] is unrolled across many steps, the gradient of an early input with respect to a late output decays (or, less often, blows up) exponentially with the number of steps in between. Hochreiter showed this both empirically and analytically. The thesis was written in German and was not widely read at the time, which delayed the field's recognition of the problem by several years.

A standard recurrent neural network updates its hidden state with h_t = tanh(W x_t + U h_{t-1} + b). When trained by backpropagation through time, the gradient of the loss with respect to an early hidden state involves a long product of Jacobians of that recurrence. If those Jacobians have spectral radius below 1, the product shrinks toward zero (vanishing gradient); if above 1, it explodes. The practical consequence is that vanilla RNNs cannot reliably learn dependencies that span more than 10 to 20 steps. Pascanu, Mikolov, and Bengio later gave the widely cited modern treatment[^5]. Gradient clipping fixes the explosion case but not the vanishing case; LSTM attacks the vanishing problem at the source by replacing the multiplicative chain with an additive cell-state path.

Authors and history

Hochreiter and Schmidhuber kept working on the issue identified in the 1991 thesis and eventually proposed a fix in the 1997 paper "Long Short-Term Memory" in Neural Computation 9(8):1735-1780[^1]. Their core idea was the constant error carousel: a self-connected linear unit, with weight 1.0 on its self-loop, that lets the gradient flow backward through time without being multiplied by anything that could shrink or explode it. To control what gets written to and read from this carousel, they added two multiplicative "gates": an input gate and an output gate.

This original 1997 cell had no forget gate. Once a value was written into the cell state, it stayed there until the input gate wrote on top of it. That worked in benchmark tasks with clean sequence boundaries, but it caused problems on continuous streams where the cell state would drift and eventually saturate. Felix Gers, Jürgen Schmidhuber, and Fred Cummins fixed this in "Learning to Forget: Continual Prediction with LSTM" (Neural Computation 12(10):2451-2471, October 2000), adding a third gate that learns when to reset the cell[^6]. An earlier conference version appeared in 1999, and the IDSIA technical report carries the number IDSIA-01-99, but the canonical reference is the 2000 journal paper. Almost every modern reference to "the LSTM" actually means this 2000 variant.

The same year, Gers and Schmidhuber added peephole connections, which let the gates inspect the cell state directly when deciding whether to open or close[^7]. From there the architecture branched in many directions: bidirectional LSTM (Graves & Schmidhuber, 2005[^8]), the simpler [[gru]] (Cho et al., 2014[^9]), ConvLSTM for spatiotemporal grids (Shi et al., 2015[^10]), and many more. By the mid 2010s LSTM was the default sequence model in deep learning, powering Google Voice Search, Apple's Siri dictation, Google Translate, and most academic NLP papers. The 2017 publication of the Transformer in "Attention Is All You Need" started a fast migration toward [[attention_mechanism]]-based models, but LSTM stayed in the toolbox and, with xLSTM (Beck et al., 2024[^3]), came back into research focus.

Priority disputes

Jürgen Schmidhuber has, for more than a decade, publicly argued that he and his collaborators have not received adequate credit for several foundational deep learning contributions, including LSTM[^11]. His critique has been pointed at Geoffrey Hinton, Yann LeCun, and Yoshua Bengio (recipients of the 2018 Turing Award) and at the broader public narrative around the deep learning revolution. The technical claims about the 1991 thesis and 1997 LSTM paper are uncontroversial; what is disputed is the relative weight given to recurrent versus feedforward work in the standard history. This article takes no position on the dispute and attributes each idea to its earliest verifiable publication.

Hochreiter went on to lead the Institute for Machine Learning at Johannes Kepler University Linz, and in 2021 he received the IEEE Computational Intelligence Society's Neural Networks Pioneer Award, in significant part for the LSTM work[^12]. Schmidhuber received the same award in 2016. The 1997 Neural Computation paper has been cited tens of thousands of times.

Architecture

A modern LSTM cell, as it appears in textbooks and in deep learning libraries, has six computations per time step. Let x_t be the input vector at step t, h_{t-1} the previous hidden state, and c_{t-1} the previous cell state. Let sigma denote the sigmoid function and tanh the hyperbolic tangent. The cell computes:

step	name	equation	role
1	forget gate	`f_t = sigma(W_f [h_{t-1}, x_t] + b_f)`	how much of the old cell state to keep
2	input gate	`i_t = sigma(W_i [h_{t-1}, x_t] + b_i)`	how much of the new candidate to write
3	candidate	`c_tilde_t = tanh(W_c [h_{t-1}, x_t] + b_c)`	proposed new content
4	cell update	`c_t = f_t * c_{t-1} + i_t * c_tilde_t`	additive memory update
5	output gate	`o_t = sigma(W_o [h_{t-1}, x_t] + b_o)`	how much of the cell to expose
6	hidden output	`h_t = o_t * tanh(c_t)`	what the rest of the network sees

The brackets [h_{t-1}, x_t] denote vector concatenation; * denotes elementwise multiplication. Each gate is a small linear layer followed by a sigmoid, so its output is a vector of values between 0 and 1 that act as soft binary masks over the cell-state coordinates. The candidate vector c_tilde_t lives in (-1, 1) thanks to the tanh, and the gates decide how much of it to add and how much of the previous c_{t-1} to keep. The hidden state h_t is a gated, squashed view of the cell state, and it is the only thing other layers (or the next time step) can see.

PyTorch implements an equivalent form with separate input and recurrent weight matrices, which is mostly a notational difference[^13]. Its documentation writes the equations as i_t = sigma(W_ii x_t + b_ii + W_hi h_{t-1} + b_hi), and so on for f_t, g_t (the candidate), and o_t, with c_t = f_t * c_{t-1} + i_t * g_t and h_t = o_t * tanh(c_t). Two bias vectors per gate look redundant on paper, since b_ii + b_hi could be folded into a single bias, but keeping them separate matches the cuDNN kernels and avoids extra memory copies during training.

Cell state and the constant error carousel

The partial derivative of c_t with respect to c_{t-1} is just f_t. When the forget gate is open (close to 1), the gradient flowing backward through that step is roughly the identity. Compose this across many steps and the product is the elementwise product of forget gates, which can stay near 1 if the network has learned to keep certain coordinates open. This is the modern version of the constant error carousel. The 1997 paper hard-coded the self-loop weight at 1.0; the 2000 paper let the network learn that weight per coordinate per time step.

Bias initialization

A practical detail that matters a lot is the forget gate bias. With all biases zero, the forget gate sigmoid starts at 0.5, and the cell state decays by half per step on average. Jozefowicz, Zaremba, and Sutskever showed in 2015 that initializing the forget gate bias to 1 (gate sigmoid starts near 0.73) closes most of the gap between LSTM and GRU on a large empirical benchmark[^14]. Many libraries now do this by default.

Variants

LSTM is less a single architecture than a family. Greff and colleagues published "LSTM: A Search Space Odyssey" in 2017, comparing eight variants across speech, handwriting, and music tasks with about 5,400 training runs[^15]. Their conclusion was blunt: none of the standard variants meaningfully beats the modern (forget-gate, no peephole) LSTM on average, the forget gate and the output activation are the most important pieces, and the rest is mostly noise. Still, the family tree is worth knowing.

variant	year	authors	distinguishing change
Vanilla LSTM	1997	Hochreiter & Schmidhuber	Constant error carousel, input and output gates, no forget gate
Forget-gate LSTM	2000	Gers, Schmidhuber, Cummins	Adds the forget gate, what most people now call "the LSTM"
Peephole LSTM	2000	Gers & Schmidhuber	Gates can inspect `c_{t-1}` directly, helps with precise timing tasks
Bidirectional LSTM	2005	Graves & Schmidhuber	Two LSTMs run forward and backward over the same sequence, outputs concatenated
Tree LSTM	2015	Tai, Socher & Manning	Recurrence over a tree structure rather than a linear chain
ConvLSTM	2015	Shi et al.	Replaces matrix multiplies with convolutions, designed for video and weather radar
Highway LSTM	2015	Zhang et al.	Adds highway connections between stacked LSTM layers for deeper stacks
Multiplicative LSTM	2016	Krause et al.	Input-conditioned recurrence, used in OpenAI's sentiment neuron
Mogrifier LSTM	2020	Melis, Kocišký, Blunsom	Pre-mixes the input and previous hidden state several times before the LSTM update
xLSTM (sLSTM, mLSTM)	2024	Beck et al.	Exponential gating, scalar and matrix memory, parallelizable variant for billions of parameters

Peephole LSTM

Peephole connections, introduced by Gers and Schmidhuber in 2000[^7], let each gate inspect the previous cell state c_{t-1} in addition to the hidden state h_{t-1}. The change is small (one extra term per gate) but matters for tasks that need precise timing, such as counting beats. Peepholes are not the default in modern libraries because they did not improve average performance in the Search Space Odyssey benchmark.

Bidirectional LSTM

Most real systems stack two or more LSTM layers and often run the bottom layer in both directions. Graves and Schmidhuber's 2005 paper on framewise phoneme classification with bidirectional LSTM[^8] was the first to show that adding a backward pass meaningfully improves performance on labeled sequence tasks. The trick cannot be used for autoregressive generation, since the backward pass would peek at future tokens, but it is the default for tagging, classification, and acoustic modeling.

Tree LSTM

Tree LSTM (Tai, Socher, Manning, ACL 2015[^16]) generalises the chain-structured recurrence to arbitrary tree topologies, with each cell receiving inputs from its child nodes. The two main variants are the Child-Sum Tree LSTM and the N-ary Tree LSTM. Tree LSTMs improved on chain LSTMs for semantic relatedness on SemEval 2014 Task 1 and for sentiment classification on the Stanford Sentiment Treebank. The architecture has been largely displaced by attention-based parsers in production NLP.

GRU

The Gated Recurrent Unit, introduced by Cho and colleagues in EMNLP 2014[^9], is the most widely used LSTM relative. It collapses the input and forget gates into a single update gate z_t and adds a reset gate r_t controlling how much of the previous hidden state contributes to the candidate. There is no separate cell state. The update is h_t = (1 - z_t) * h_{t-1} + z_t * tanh(W [r_t * h_{t-1}, x_t]). GRU has roughly 25 percent fewer parameters than LSTM at the same hidden size and trains slightly faster; the two are usually within a percentage point on most tasks. See [[gru]] for a longer treatment.

ConvLSTM

ConvLSTM (Shi et al., NeurIPS 2015[^10]) replaces the matrix multiplications inside the gates with convolutions. The cell state and hidden state become 3D tensors (channels by height by width) instead of vectors. This is the natural fit for any task where each time step is itself a spatial grid: weather radar, video, or fluid simulation. The original paper used it for short-term rainfall prediction.

Training

Training an LSTM looks like training any other recurrent network: the sequence is unrolled in time, the loss is summed across supervised steps, and gradients flow back through [[backpropagation_through_time]]. Three details deserve attention.

Gradient clipping. Even with the additive cell-state path, the gates and input-to-hidden weights can produce large gradients on rare events. Clipping the global gradient norm to 1 or 5 is standard; Pascanu, Mikolov, and Bengio (2013) gave the now-standard analysis[^5].

Truncated BPTT. Full BPTT over very long sequences is expensive. Truncated BPTT processes the sequence in chunks of, say, 100 to 200 steps, carrying the cell and hidden state forward but only backpropagating gradients within the current chunk. This is what .detach() on the hidden state between batches achieves.

Dropout. Naive dropout on the recurrent connections destroys the long-term memory because the mask changes at every step. Variational dropout (Gal & Ghahramani, 2016[^17]) fixes a single mask per sequence; zoneout (Krueger et al., 2017) randomly copies the previous hidden state. Most production code applies dropout only between stacked LSTM layers, which is what dropout controls in torch.nn.LSTM.

Famous applications

LSTMs powered most of the practical sequence learning systems in the 2010s. A small selection:

year	system	role of LSTM	reference
2013	Deep RNN for TIMIT phoneme recognition	Deep bidirectional LSTM with CTC achieves 17.7% error on TIMIT	Graves, Mohamed, Hinton, ICASSP 2013[^18]
2014	Sequence to Sequence Learning	Two stacked LSTMs as encoder and decoder for machine translation	Sutskever, Vinyals, Le, NeurIPS 2014[^19]
2014	Sak, Senior, Beaufays acoustic model	Distributed LSTM acoustic models for speech recognition at Google	Sak et al., Interspeech 2014[^20]
2015	Show and Tell image captioning	CNN encoder feeding an LSTM caption decoder	Vinyals, Toshev, Bengio, Erhan, CVPR 2015[^21]
2015	Google Voice Search	Production deployment of LSTM acoustic models, cut transcription errors substantially	Sak et al. blog post
2015	Karpathy's char-rnn	Popular blog post showing LSTMs generating Shakespeare and Linux source code	"The Unreasonable Effectiveness of Recurrent Neural Networks"[^22]
2016	Google Neural Machine Translation	8-layer LSTM encoder, 8-layer LSTM decoder, [[attention_mechanism]], wordpiece tokens, replaced phrase-based MT in Google Translate	Wu et al., arXiv:1609.08144[^23]
2016	DeepMind WaveNet baselines	LSTM language model used as comparison point for raw-audio generation	Van den Oord et al.
2017	ELMo contextual embeddings	Bidirectional LSTM language model whose hidden states are used as word features	Peters et al., NAACL 2018
2018	OpenAI Five (Dota 2)	LSTM-based policy network trained with PPO to play 5v5 Dota at professional level	OpenAI
2019	DeepMind AlphaStar (StarCraft II)	LSTM core inside a transformer-LSTM hybrid policy that beat top human players	Vinyals et al., Nature 575
2019	Apple Siri offline dictation	On-device LSTM acoustic model running locally on iPhone	Apple ML Journal

Speech recognition

Speech recognition was arguably the first commercially important domain where LSTM beat alternative approaches at scale. Graves, Mohamed, and Hinton's 2013 paper used deep bidirectional LSTM stacks trained with Connectionist Temporal Classification to set a TIMIT phoneme recognition record at 17.7% error[^18]. Sak, Senior, and Beaufays then showed at Interspeech 2014 that LSTM acoustic models could be trained on Google-scale distributed infrastructure and matched or beat existing hybrid DNN-HMM systems[^20]. Within a year these models were running in Google Voice Search, and similar architectures soon powered Apple's Siri and Amazon's Alexa speech front-ends.

Sequence-to-sequence translation

Sutskever, Vinyals, and Le's 2014 paper[^19] introduced the [[seq2seq]] framework: an encoder LSTM compresses an input sequence into a fixed vector and a decoder LSTM expands that vector into the output. With Bahdanau attention added in 2015 and the wordpiece encoding of GNMT in 2016[^23], LSTM-based seq2seq became the engine of production neural machine translation until the Transformer replaced it.

Image captioning

Vinyals, Toshev, Bengio, and Erhan's "Show and Tell" model[^21] glued a pretrained convolutional image encoder to an LSTM language decoder, conditioning the LSTM on the visual features at the first step. The system won the 2015 MSCOCO captioning challenge and established the encoder-decoder framework that dominated multimodal generation for several years.

Language modeling and other domains

LSTM language models held state of the art on Penn Treebank and WikiText benchmarks until Transformer language models took over. ELMo (Peters et al., NAACL 2018) was the last big LSTM-based pretrained model before BERT. In time-series forecasting LSTM remains a common baseline. In reinforcement learning LSTM cores are standard for partially observed environments; DRQN, IMPALA, R2D2, OpenAI Five, and AlphaStar all used recurrent value or policy networks. In computational biology LSTM has been used for protein structure features and DNA sequence analysis, although Transformers have largely taken over there as well.

Limitations

The two big problems with LSTM are both about scale.

First, the recurrence is inherently sequential. The cell state at step t depends on the cell state at step t-1, so you cannot parallelize the forward pass over time the way you can for self-attention. On modern GPUs a model that uses 5 percent of the silicon for 100 percent of the time is slower than one that uses 80 percent some of the time, even if the second does more arithmetic. cuDNN's fused LSTM kernel and tricks like quasi-RNN or simple recurrent unit (SRU) recover some of this, but not all of it.

Second, LSTMs are hard to scale to extremely long contexts. Even with a forget gate close to 1, the cell state is a fixed-size vector; you cannot cram a million tokens of context into a few thousand floats without losing information. Transformers handle this by keeping a key-value cache that grows with the input, paying quadratic compute for the privilege. State space models and xLSTM-mLSTM aim at the middle ground.

LSTM hyperparameters (learning rate, gradient clipping, hidden size, dropout) are also tightly coupled, and the same recipe rarely transfers across tasks. Practitioners report that getting an LSTM to train well takes more babysitting than a Transformer, mostly because Transformer pretraining recipes are now extremely well-documented and LSTM ones are not.

Decline

The Transformer, introduced by Vaswani and colleagues in "Attention Is All You Need" (NeurIPS 2017[^2]), removed recurrence entirely and replaced it with self-attention over the whole sequence at once. This was good news on a GPU because every position can be processed in parallel, and bad news on the memory bill because the attention matrix is quadratic in sequence length. For most NLP tasks the trade-off favored Transformers, and from roughly 2018 onward the field shifted accordingly. ELMo (2018) was the last famous LSTM-based pretrained language model; BERT, GPT, and everything that followed used self-attention.

That said, LSTMs and Transformers have different inductive biases and different cost structures, and the choice is not as one-sided as the trend suggests.

property	LSTM	Transformer
time per training step	O(T) sequential	O(T^2) parallel
time per token at inference	O(1), constant memory	O(T) attention over the cache, O(T) memory
state	fixed-size hidden + cell	growing key-value cache
parallelism over sequence	poor	excellent
typical context length in practice	thousands of steps with care	millions of tokens with hardware tricks
inductive bias	strong recency, sequential	none, position must be encoded
works well with very small data	yes	usually needs more
dominant in speech front-ends (2026)	still common	catching up

The practical upshot: Transformers win when you have lots of data, lots of compute, and care about throughput. LSTMs win, or at least compete, when you care about constant memory at inference (streaming speech, on-device, embedded), when sequences are very long but mostly recent context matters, when data is small, or when you want a strong online learning algorithm. Speech recognition front-ends, on-device keyboard prediction, real-time translation pipelines, and a long tail of tabular and time-series models still rely on LSTM cells.

Modern revivals

After several years where new sequence architectures meant some variant of attention, the late 2023 and 2024 wave of papers brought recurrence back. The headline names are Mamba, RWKV, and xLSTM.

Mamba and state space models

[[mamba]] (Gu and Dao, arXiv:2312.00752, December 2023[^24]) is a [[state_space_model]], not an LSTM, but it inherits the recurrent flavor: linear time in sequence length, constant memory at inference, and, in the selective version, input-dependent gating that closely mirrors what an LSTM does with its forget gate. Mamba-3B was the first sub-quadratic model to clearly match a Transformer of the same size on language modeling, and on long sequences it is significantly faster. Mamba builds on Gu's earlier S4 family of structured state space models and the "selective scan" idea that makes the recurrence data-dependent.

RWKV

[[rwkv]] (Peng et al., EMNLP Findings 2023[^25]) reformulates attention so that it can be computed as a recurrence, giving it Transformer-like training and RNN-like inference. RWKV models have been trained up to 14 billion parameters, the largest dense RNN trained at the time of publication.

xLSTM

[[xlstm]] (Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, and Hochreiter, arXiv:2405.04517, May 2024[^3]) is the closest direct revival of the original idea. The paper introduces two new cell types: sLSTM, which keeps the scalar memory of the original LSTM but adds exponential gating with stabilization, and mLSTM, which replaces the scalar cell state with a matrix and uses a covariance update rule that can be parallelized like attention. Stacked into residual blocks, xLSTM models compete with Llama-style Transformers and Mamba at billion-parameter scale. NXAI, the company Hochreiter co-founded in 2023, has continued to push xLSTM as a basis for European foundation models.

Linear RNN renaissance

Mamba, RWKV, xLSTM, and their relatives (Griffin, Hawk, Retentive Networks, GLA, RetNet) are sometimes called the linear RNN renaissance or post-attention architectures. They share a few common moves: keep a fixed or slowly growing recurrent state, use input-dependent gating to compensate for lost expressive power of attention, and arrange the math so the recurrence can be evaluated as a parallel prefix scan during training. They have already won the long-context, on-device, and streaming markets where Transformers struggle.

Implementations

Every major deep learning framework ships an LSTM implementation backed by a fast GPU kernel. The interfaces are similar enough to swap with a one-line change.

framework	module	notes
PyTorch	`torch.nn.LSTM`, `torch.nn.LSTMCell`	cuDNN backend, supports stacked, bidirectional, dropout between layers
TensorFlow / Keras	`tf.keras.layers.LSTM`, `LSTMCell`	XLA and cuDNN paths, the layer auto-selects the fast kernel when conditions are met
JAX	`flax.linen.OptimizedLSTMCell`, `haiku.LSTM`	functional API, scan over time
MXNet	`mx.gluon.rnn.LSTM`	similar API to PyTorch
ONNX	`LSTM` operator	for model export and cross-framework deployment
Apple Core ML	`LSTMLayer`	converts from PyTorch and TensorFlow for on-device inference

import torch
import torch.nn as nn

lstm = nn.LSTM(
    input_size=64,
    hidden_size=128,
    num_layers=2,
    batch_first=True,
    bidirectional=False,
    dropout=0.2,
)

x = torch.randn(32, 100, 64)         # (batch, time, features)
outputs, (h_n, c_n) = lstm(x)
# outputs: (32, 100, 128) - hidden state at every step
# h_n, c_n: (2, 32, 128) - final hidden and cell states for each layer

Cultural impact

For about five years, from roughly 2013 through 2018, LSTM was synonymous with sequence modeling in deep learning. The original 1997 paper has been cited well over 100,000 times. Andrej Karpathy's 2015 blog post "The Unreasonable Effectiveness of Recurrent Neural Networks"[^22] introduced an entire generation of practitioners to LSTM by showing it generating plausible Shakespeare and compilable C code from a character-level model. Christopher Olah's 2015 blog post "Understanding LSTM Networks"[^26] remains, ten years later, the diagram people reach for when explaining the gates.

The Transformer mostly displaced LSTM in research after 2018, but in production the transition has been slower. With xLSTM, Mamba, and RWKV bringing recurrence back to the frontier, the architecture's second act is still being written.

Background

Authors and history

Priority disputes

Architecture

Cell state and the constant error carousel

Bias initialization

Variants

Peephole LSTM

Bidirectional LSTM

Tree LSTM

GRU

ConvLSTM

Training

Famous applications

Speech recognition

Sequence-to-sequence translation

Image captioning

Language modeling and other domains

Limitations

Decline

Modern revivals

Mamba and state space models

RWKV

xLSTM

Linear RNN renaissance

Implementations

Cultural impact

See also

References

Improve this article

Related Articles

RNN

Forget Gate

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Encoder

Background

Authors and history

Priority disputes

Architecture

Cell state and the constant error carousel

Bias initialization

Variants

Peephole LSTM

Bidirectional LSTM

Tree LSTM

GRU

ConvLSTM

Training

Famous applications

Speech recognition

Sequence-to-sequence translation

Image captioning

Language modeling and other domains

Limitations

Decline

Modern revivals

Mamba and state space models

RWKV

xLSTM

Linear RNN renaissance

Implementations

Cultural impact

See also

References

Related Articles

RNN

Forget Gate

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Encoder