xLSTM
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,005 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,005 words
Add missing citations, update stale details, or suggest a clearer explanation.
xLSTM (Extended Long Short-Term Memory) is a recurrent neural network architecture introduced in May 2024 by Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and collaborators at Johannes Kepler University Linz, the NXAI research lab, and Google DeepMind, designed to scale the classical LSTM cell to billions of parameters and challenge transformer dominance in language modeling.[^1] The architecture replaces the original LSTM with two redesigned cell variants: sLSTM, which keeps a scalar memory but adds exponential gating, a normalizer state, and head-wise memory mixing; and mLSTM, which replaces the scalar cell state with a d times d matrix memory updated by a covariance (outer product) rule and is fully parallelizable across the sequence dimension.[^1][^2] Both cells sit inside residual xLSTM blocks with LayerNorm and projection layers, and stacks of these blocks form xLSTM language models that the original paper reports as competitive with same-size transformers, Mamba, and RWKV on validation perplexity, downstream zero-shot benchmarks, and long-context extrapolation.[^1][^3] In March 2025, NXAI released a 7-billion-parameter xLSTM 7B model trained on roughly 2.3 trillion tokens of DCLM-derived data, alongside JAX training code and inference kernels intended to demonstrate the design at production scale.[^4][^5] Follow-up work has applied the same block to vision (Vision-LSTM), robotics (Large Recurrent Action Model), and long-horizon time-series forecasting (xLSTMTime, xLSTM-Mixer).[^6][^7][^8][^9]
LSTM cells, introduced by Sepp Hochreiter and Jürgen Schmidhuber in a 1997 Neural Computation paper that built on Hochreiter's 1991 diploma thesis, were the dominant recurrent architecture for sequence modeling through the 2010s and remain the canonical answer to the vanishing gradient problem of plain RNNs.[^10] An LSTM cell maintains a cell state c_t updated through sigmoid-gated input, forget, and output gates and a tanh candidate, allowing gradients to flow across hundreds of timesteps when the gates remain near saturation. LSTMs powered early neural machine translation, speech recognition, and the language models that preceded large-scale transformer systems, and Hochreiter received the IEEE Computational Intelligence Society Neural Networks Pioneer Award in 2021 for the contribution.[^11]
Three limitations of the classical LSTM became prominent once the field scaled to billions of parameters. First, the sigmoid input and forget gates compress activations into the unit interval, which makes it numerically difficult for the cell to revise a stored value with a much larger new one: once a token has been written with a gate near one, an LSTM cannot easily overwrite it with a competing later token. Second, the cell state is a fixed-size vector, which limits the amount of information that can be retained, especially for tasks like multi-query associative recall where many key-value pairs must be stored. Third, the sequential dependency c_t = f_t odot c_{t-1} + i_t odot z_t prevents parallel training over the sequence dimension, which is why LSTMs lost ground to transformers that compute all timesteps in parallel during training.[^1][^2]
By 2023 two new families of architectures revived interest in sub-quadratic alternatives to attention. Mamba and other selective state space models generalized linear time invariant systems with input-dependent parameters and an associated parallel scan, achieving transformer-like quality at smaller scales and linear time inference.[^12] RWKV and related linear attention variants reformulated attention as a recurrent update with constant memory at inference, also reaching billions of parameters. These results suggested that the bottleneck of LSTM was less the recurrent paradigm itself and more the specific choice of gating and storage. xLSTM was framed by its authors as a direct attempt to repair the LSTM rather than to replace it, asking how far one could push the classical cell when armed with modern gating, normalization, parallel formulations, and contemporary training infrastructure.[^1]
The work was anchored at the Institute for Machine Learning at Johannes Kepler University Linz, where Hochreiter has led research since 2018 and headed the LIT AI Lab since 2017, and at NXAI, an Austrian AI lab co-founded by Hochreiter, Netural X, and PIERER Digital Holding in December 2023 to develop European foundation models around the xLSTM idea.[^11][^13] The first arXiv preprint (2405.04517) appeared on May 7, 2024, with a revised v2 on December 6, 2024, and the paper was accepted as a spotlight presentation at NeurIPS 2024.[^1][^14] Reference code was open-sourced under Apache 2.0 at github.com/NX-AI/xlstm as a pip-installable Python package shortly after publication.[^2][^15]
The xLSTM paper's central modification to the LSTM equation is to replace the sigmoid in the input and forget gates with an exponential. In an sLSTM cell, the pre-activation values for the input gate ~i_t and forget gate ~f_t are computed from the same linear projection of the input and previous hidden state used in a standard LSTM, but the gates themselves are i_t = exp(~i_t) and f_t = exp(~f_t) (for the forget gate, with sigmoid still permitted as an alternative).[^1][^16] Because exp is unbounded above, two changes are necessary to keep training stable.
A normalizer state n_t is added to the cell. It tracks a running estimate of the cumulative input gate contributions, satisfying n_t = f_t cdot n_{t-1} + i_t, so that the hidden state output is computed as h_t = o_t odot (c_t / n_t), where o_t is the output gate. This normalization, in effect, divides each retrieval by the sum of write magnitudes that contributed to the current memory, mimicking the role of the softmax denominator in attention.[^1][^16]
A stabilizer state m_t prevents numerical overflow in the exponentials by tracking the running maximum log gate value. The paper defines m_t = max(log(f_t) + m_{t-1}, log(i_t)) and then computes the actual gates after subtracting m_t in log space, yielding i'_t = exp(log(i_t) - m_t) and f't = exp(log(f_t) + m{t-1} - m_t). This is structurally identical to the safe softmax trick that subtracts the maximum logit before exponentiating, and it lets exponential gates be used safely even at long sequence lengths.[^1][^16]
The practical effect of exponential gating is that the cell can now revise stored values. If a later token deserves more weight than an earlier token, an exponentially larger input gate can override the forget gate's previous write, something that the sigmoid in the original LSTM cannot achieve without saturating.[^1]
The sLSTM cell retains the classical scalar cell state c_t in R but adds exponential gating, the normalizer n_t, and the stabilizer m_t described above. It also introduces a multi-head structure inspired by multi-head attention, but with one key restriction. The recurrent weight matrices that govern memory mixing are block-diagonal across heads, so each head mixes only within itself. There is no cross-head mixing within an sLSTM layer, which preserves an inductive bias for state tracking that the matrix memory of mLSTM, by virtue of being fully parallelizable, must give up.[^1][^16]
This design lets sLSTM solve problems that require true sequential state tracking, such as parity computation and certain context-free language tasks. The paper argues that sLSTM occupies a position in the architecture-power hierarchy that fully parallelizable models (including transformers, Mamba, and mLSTM) cannot reach, which is why a small fraction of sLSTM blocks is mixed in even when most blocks are mLSTM.[^1]
The mLSTM cell discards scalar storage entirely and replaces c_t with a matrix C_t in R^{d times d}. At each timestep the input is projected to a query q_t, a key k_t, and a value v_t (sharing notation with attention), and the memory is updated by an outer-product, or covariance, rule
C_t = f_t cdot C_{t-1} + i_t cdot v_t k_t^T
with a normalizer vector n_t = f_t cdot n_{t-1} + i_t cdot k_t.[^1][^16] Retrieval at time t computes h_t = o_t odot (C_t q_t) / max(|n_t^T q_t|, 1). LayerNorm is applied to keys and values before the outer product to keep their magnitudes controlled. Because there is no recurrent memory mixing across heads or time apart from the diagonal forget gate, the update is a linear function of the sequence and admits a parallel formulation: the entire matrix memory across a sequence can be computed as a sum of outer products weighted by cumulative gate products, which is amenable to chunkwise parallel scans on GPU.[^1][^2][^16]
The covariance rule is mathematically related to fast weight programmers, linear attention, and modern linear RNNs, but xLSTM specifically combines it with exponential input gates and the n_t and m_t stabilization machinery, giving the matrix memory the ability to overwrite stored associations rather than merely accumulate them.[^1] On a multi-query associative recall benchmark in the paper, mLSTM is reported to handle up to 256 key-value pairs in a single sequence, outperforming Mamba and other sub-quadratic baselines at the same scale.[^1][^16]
The two cell types are embedded into residual blocks. The paper describes two block templates: a post up-projection block (typically used for sLSTM) that summarizes the past in the original embedding space and then projects up through a gated MLP, and a pre up-projection block (typically used for mLSTM) that first projects up, runs the recurrent core in the higher-dimensional space, and then projects back down. Both templates use pre-LayerNorm and a residual connection, mirroring the now-standard transformer block topology.[^1][^16]
An xLSTM language model is then a stack of such blocks. The paper denotes configurations as xLSTM[a:b] for a stack with a mLSTM blocks for every b sLSTM blocks. xLSTM[1:0] is pure mLSTM. xLSTM[7:1] interleaves seven mLSTM blocks with one sLSTM block and was the most studied configuration in the paper. The motivation for any sLSTM at all is the state-tracking inductive bias described above; the motivation to keep most blocks as mLSTM is parallel training and matrix capacity.[^1][^16]
The reference implementation at NX-AI/xlstm contains an xLSTMBlockStack module intended as a drop-in transformer replacement and an xLSTMLMModel wrapper for autoregressive language modeling. The repository ships both native PyTorch and Triton kernel implementations of mLSTM and includes a custom CUDA extension for sLSTM that requires GPUs of compute capability 8.0 or higher (Ampere or newer). The recommended development environment is PyTorch 2.4 with CUDA 12.4. The package is published as xlstm on PyPI under the Apache 2.0 license.[^2][^15] A separate xlstm-jax repository contains a JAX implementation used for the 7B model release.[^17]
The May 2024 paper reports two main classes of experiments on synthetic and natural language data.
To probe state tracking and storage capacity in isolation, the paper evaluates xLSTM[1:0] (pure mLSTM), xLSTM[1:1] (an even mix), and xLSTM[0:1] (pure sLSTM) on tasks including multi-query associative recall, parity, modular arithmetic, and formal-language tasks. mLSTM-heavy stacks dominate the associative recall task, while sLSTM-heavy stacks dominate the parity and other state-tracking tasks, consistent with the theoretical argument that fully parallelizable models cannot track certain non-regular dependencies whereas sequentially mixing scalar memories can.[^1][^16]
The first language-modeling experiment trains 350M-parameter models on 15 billion tokens of the SlimPajama dataset and compares validation perplexity across thirteen architectures: GPT-3 style transformer, Llama, H3, Mamba, RWKV-4, RWKV-5, RWKV-6, GLA, HGRN2, RetNet, Hyena, xLSTM[1:0], and xLSTM[7:1].[^1][^16][^18] Both xLSTM variants attain the lowest perplexities; the paper highlights that xLSTM[7:1] in particular outperforms all other models. The 15B token sweep is paired with a parameter sweep at 125M, 350M, 760M, 1.3B, and 2.7B that fits scaling-law style curves and shows xLSTM curves below those of Llama and Mamba across the studied range.[^1][^16]
A second set of runs trains 1.3B-parameter models on 300 billion tokens of SlimPajama, matching the token budget used in earlier comparisons of Mamba and Griffin. The paper reports that xLSTM[1:0] and xLSTM[7:1] achieve lower perplexity than Llama, Mamba, and RWKV-4 at the same compute, and that xLSTM[1:0] sustains higher inference throughput than Mamba because the mLSTM update does not require the selective scan.[^1][^16]
To test sequence-length extrapolation, models trained at context length 2048 are evaluated up to 16384 tokens without any retraining. xLSTM maintains roughly flat per-token perplexity across the longer contexts, whereas transformer baselines diverge once they exceed their training context.[^1][^16] The paper attributes this to the recurrent state, which has fixed size and therefore does not encode any explicit length-dependent positional structure.
The 1.3B xLSTM models are evaluated zero-shot on standard reasoning and commonsense benchmarks including PIQA, HellaSwag, LAMBADA, ARC-Easy, ARC-Challenge, WinoGrande, and OpenBookQA, plus the PALOMA suite of language-modeling perplexities across domains.[^1][^16] xLSTM ranks first on PALOMA among the recurrent baselines and is competitive with the transformer.
NXAI scaled the recipe to seven billion parameters in a follow-up release described in the paper "xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference" (Beck, Pöppel, Lippe, Kurle, Blies, Klambauer, Böck, Hochreiter), posted to arXiv on March 17, 2025 and accepted at ICML 2025.[^4][^15] The 7B model uses an optimized architecture variant, labeled xLSTM Large in the public repository, that drops the sLSTM blocks and uses only mLSTM in a stack designed for maximum training throughput and inference efficiency.[^2][^4]
| xLSTM 7B specification | Value |
|---|---|
| Parameters | 7 billion |
| Training tokens | approximately 2.3 trillion[^5] |
| Training data | DCLM-Baseline plus selected high-quality sources[^5] |
| Training framework | xlstm-jax (JAX)[^17] |
| Released by | NXAI, Linz, Austria[^4] |
| Release | March 2025[^4] |
| License | NXAI Community License[^5] |
| Block type | mLSTM only (sLSTM removed)[^2] |
The model card on Hugging Face reports lighteval Leaderboard v1 zero-shot scores of 0.584 on ARC-Challenge (25-shot), 0.589 on MMLU (5-shot), 0.710 on HellaSwag (10-shot), 0.742 on WinoGrande (5-shot), 0.420 on TruthfulQA, 0.817 on PIQA, and 0.443 on OpenBookQA, with much weaker performance on math (GSM8K at 0.004 in the base model without instruction tuning).[^5] Hochreiter, quoted on the NXAI release page in his role as Chief Scientist, claimed the model as the strongest large RNN-based language model of its time and highlighted targeted use cases in robotics, automotive, and medical technology.[^4]
The accompanying inference kernels, distributed as the mlstm_kernels package alongside the main repository, are written in Triton and target H100-class accelerators. The xlstm-jax codebase contains the full distributed training pipeline used to produce the model.[^17] The 7B paper emphasizes that, unlike transformer language models, the time and memory cost of decoding a token in xLSTM 7B is constant in the context length, making it well suited to long-output reasoning loops where transformer KV caches grow without bound.[^4]
Vision-LSTM (ViL), introduced by Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and Johannes Brandstetter in arXiv:2406.04303 (June 2024, ICLR 2025), adapts the xLSTM block as a backbone for ImageNet-scale vision tasks.[^6] Like Vision Mamba, ViL treats an image as a sequence of patch tokens and processes them with alternating directions: odd-indexed blocks scan top-to-bottom, even-indexed blocks bottom-to-top, providing the bidirectional context that pure unidirectional recurrence lacks. The reported configurations include ViL-T (6M parameters, 192 latent dim, 24 blocks), ViL-S (23M, 384), and ViL-B (89M, 768), all using 16x16 patches. ImageNet-1K top-1 accuracies are 78.3 percent for ViL-T, 81.5 percent for ViL-S, and 82.4 percent for ViL-B, with ViL-T exceeding DeiT-III-T at 76.2 percent and ViL-S matching DeiT-III-S at 81.4 percent. ViL-B trails DeiT-III-B (83.7 percent) at base scale.[^6][^19]
LRAM, by Schmied, Adler, Patil, Beck, Pöppel, Brandstetter, Klambauer, Pascanu, and Hochreiter, builds an xLSTM-based decision transformer-style action model evaluated on 432 tasks across six robotics domains.[^7][^20] The motivation is the same as in language modeling: transformer-based action models pay quadratic decoding cost during real-time control, whereas an xLSTM at the core has linear-time inference and natural length extrapolation. The paper reports competitive task performance to a transformer baseline at substantially faster inference, particularly at longer context windows. It was presented at the NeurIPS 2024 OWA workshop on open-world agents and is open sourced at ml-jku/LRAM.[^7][^20]
Two notable time-series adaptations followed. xLSTMTime (arXiv:2407.10240) reuses the xLSTM block with patch-style tokenization for long-horizon time-series forecasting and reports improvements over DLinear and PatchTST on standard datasets (ETTh, ETTm, weather, traffic), with reported gains over DLinear of around 18 percent at horizon 96 and 13 percent at horizon 192 on the weather dataset.[^8] xLSTM-Mixer (arXiv:2410.16928) applies the sLSTM scalar memory in a mixer-style architecture for multivariate forecasting and reports competitive or superior MSE to a broad set of baselines on the GIFT-Eval benchmark.[^9]
Several additional papers in 2024 and 2025 have used xLSTM as a building block: U-VixLSTM for medical 3D image segmentation, MAL (cluster-masked and multi-task pretraining) for vision representation learning, and various exploratory applications in speech, finance, and combined transformer hybrid stacks.[^21][^22] The repository ecosystem also includes xlstm-jax and mlstm_kernels for performance-critical deployments, both maintained by NXAI alongside the main package.[^17][^2]
The official implementations span three repositories under the NX-AI GitHub organization.[^15]
| Repository | Purpose | Framework |
|---|---|---|
| NX-AI/xlstm | Reference research code, xLSTM blocks and language model wrapper, Triton/CUDA kernels[^2] | PyTorch |
| NX-AI/xlstm-jax | Production training code used for xLSTM 7B[^17] | JAX |
| NX-AI/mlstm_kernels | Optimized inference kernels for mLSTM, targeting H100[^2] | Triton / CUDA |
The package is distributed on PyPI as xlstm under Apache 2.0. Pretrained weights of the 7B model are hosted on Hugging Face under the NXAI Community License, with about 4,800 downloads in the month before the most recent snapshot of the model card.[^5][^15] The repository documentation lists PyTorch 2.4 with CUDA 12.4 as the recommended environment, and the sLSTM kernel additionally requires compute capability 8.0 or higher.[^2]
The release pattern parallels the rollout of Mamba and RWKV open-source ecosystems: a primary paper plus a reference implementation, a scaled-up model with separate paper and weights release, and supplementary kernel work to make inference competitive on commodity GPUs.[^15][^17]
Because xLSTM offers transformer-like training parallelism plus constant-cost decoding, the natural application domains are those where transformer KV caches become an inference bottleneck or where input sequences extend well beyond standard context windows. NXAI has publicly emphasized industrial, robotics, automotive, and medical use cases.[^4]
In language modeling, xLSTM 7B is presented as an inference-efficient alternative to Llama and Mamba at the same parameter count, particularly attractive for batch decoding of long completions where transformer memory grows linearly with sequence length.[^4] In robotics, the LRAM line of work positions xLSTM at the core of large action models for real-time control, where latency dominates.[^7] In computer vision, ViL serves as a generic backbone for classification and downstream transfer learning, with attention-free scaling characteristics similar to Vision Transformer but with linear-time blocks.[^6] In time-series forecasting, xLSTMTime and xLSTM-Mixer leverage the matrix and scalar memories respectively for long forecast horizons that exceed typical transformer training lengths.[^8][^9]
The xLSTM line of work has acknowledged constraints. The sLSTM cell, by retaining recurrent mixing across heads via block-diagonal matrices, is not fully parallelizable along the sequence dimension and is generally slower to train than the mLSTM. As a consequence, the production xLSTM 7B drops sLSTM entirely from its block stack, trading away the parity and state-tracking capabilities that motivated the dual-cell design in the original paper.[^2][^4]
The mLSTM matrix memory has size d times d per layer, where d is the head dimension. This is fixed at training time and cannot be enlarged at inference, unlike the unbounded KV cache of a transformer. Models can therefore saturate on tasks that require recall of more independent key-value pairs than the matrix dimension can hold, although the paper reports favorable scaling up to 256 keys at the studied scales.[^1][^16]
Downstream benchmark scores for xLSTM 7B, while competitive with mid-2024 7B transformers on commonsense reasoning, lag the strongest 2025 7B transformer instruction-tuned models on math and on instruction-following benchmarks; the model card itself reports a GSM8K score of 0.004 in the base model and 0.244 on IfEval, both well below leading 7B instruction-tuned baselines.[^5] The model is released as a base model intended for further fine-tuning, which partly accounts for these gaps.[^4][^5]
Finally, the broader competitive landscape for sub-quadratic architectures continues to evolve. Mamba 2 and newer hybrid attention models have closed quality gaps relative to xLSTM in some 2024 to 2025 evaluations, and direct head-to-head benchmarks at identical training budgets remain comparatively scarce in the literature, with most public comparisons drawn from the xLSTM authors' own ablation tables.[^1][^4]
| Architecture | Year | Memory mechanism | Parallel training | Decoding cost per token |
|---|---|---|---|---|
| LSTM | 1997 | Scalar cell, sigmoid gates | No | O(1) state, sequential |
| Transformer | 2017 | Attention over full sequence | Yes | O(n) (KV cache grows) |
| Retentive Network (RetNet) | 2023 | Multi-scale retention | Yes (parallel form) | O(1) |
| Mamba | 2023 | Selective SSM with parallel scan | Yes | O(1) |
| RWKV | 2023+ | Linear attention with recurrent rewrite | Yes | O(1) |
| sLSTM (in xLSTM) | 2024 | Scalar memory + exponential gates + head mixing | Partial | O(1) |
| mLSTM (in xLSTM) | 2024 | Matrix memory + covariance update | Yes | O(1), fixed-size matrix |
The xLSTM block sits closest to linear attention and to Mamba in the design space, sharing the recurrent rewrite-based parallel training pattern, while differing from both in its explicit covariance update and exponential gating. The full xLSTM paper compares directly against H3, GLA, HGRN2, RetNet, Hyena, Mamba, and three generations of RWKV on shared validation perplexity and benchmark suites.[^1][^16]