xLSTM

Deep Learning Model Architecture Neural Networks

20 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v3 · 4,009 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

xLSTM (Extended Long Short-Term Memory) is a recurrent neural network architecture introduced in May 2024 by Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and collaborators at Johannes Kepler University Linz, the NXAI research lab, and Google DeepMind, designed to scale the classical LSTM cell to billions of parameters and challenge transformer dominance in language modeling.^[1] The architecture replaces the original LSTM with two redesigned cell variants: sLSTM, which keeps a scalar memory but adds exponential gating, a normalizer state, and head-wise memory mixing; and mLSTM, which replaces the scalar cell state with a $d \times d$ matrix memory updated by a covariance (outer product) rule and is fully parallelizable across the sequence dimension.^[1]^[2] Both cells sit inside residual xLSTM blocks with LayerNorm and projection layers, and stacks of these blocks form xLSTM language models that the original paper reports as competitive with same-size transformers, Mamba, and RWKV on validation perplexity, downstream zero-shot benchmarks, and long-context extrapolation.^[1]^[3] In March 2025, NXAI released a 7-billion-parameter xLSTM 7B model trained on roughly 2.3 trillion tokens of DCLM-derived data, alongside JAX training code and inference kernels intended to demonstrate the design at production scale.^[4]^[5] Follow-up work has applied the same block to vision (Vision-LSTM), robotics (Large Recurrent Action Model), and long-horizon time-series forecasting (xLSTMTime, xLSTM-Mixer).^[6]^[7]^[8]^[9]

Background

LSTM cells, introduced by Sepp Hochreiter and Jürgen Schmidhuber in a 1997 Neural Computation paper that built on Hochreiter's 1991 diploma thesis, were the dominant recurrent architecture for sequence modeling through the 2010s and remain the canonical answer to the vanishing gradient problem of plain RNNs.^[10] An LSTM cell maintains a cell state c_t updated through sigmoid-gated input, forget, and output gates and a tanh candidate, allowing gradients to flow across hundreds of timesteps when the gates remain near saturation. LSTMs powered early neural machine translation, speech recognition, and the language models that preceded large-scale transformer systems, and Hochreiter received the IEEE Computational Intelligence Society Neural Networks Pioneer Award in 2021 for the contribution.^[11]

Three limitations of the classical LSTM became prominent once the field scaled to billions of parameters. First, the sigmoid input and forget gates compress activations into the unit interval, which makes it numerically difficult for the cell to revise a stored value with a much larger new one: once a token has been written with a gate near one, an LSTM cannot easily overwrite it with a competing later token. Second, the cell state is a fixed-size vector, which limits the amount of information that can be retained, especially for tasks like multi-query associative recall where many key-value pairs must be stored. Third, the sequential dependency $c_t = f_t \odot c_{t-1} + i_t \odot z_t$ prevents parallel training over the sequence dimension, which is why LSTMs lost ground to transformers that compute all timesteps in parallel during training.^[1]^[2]

By 2023 two new families of architectures revived interest in sub-quadratic alternatives to attention. Mamba and other selective state space models generalized linear time invariant systems with input-dependent parameters and an associated parallel scan, achieving transformer-like quality at smaller scales and linear time inference.^[12] RWKV and related linear attention variants reformulated attention as a recurrent update with constant memory at inference, also reaching billions of parameters. These results suggested that the bottleneck of LSTM was less the recurrent paradigm itself and more the specific choice of gating and storage. xLSTM was framed by its authors as a direct attempt to repair the LSTM rather than to replace it, asking how far one could push the classical cell when armed with modern gating, normalization, parallel formulations, and contemporary training infrastructure.^[1]

The work was anchored at the Institute for Machine Learning at Johannes Kepler University Linz, where Hochreiter has led research since 2018 and headed the LIT AI Lab since 2017, and at NXAI, an Austrian AI lab co-founded by Hochreiter, Netural X, and PIERER Digital Holding in December 2023 to develop European foundation models around the xLSTM idea.^[11]^[13] The first arXiv preprint (2405.04517) appeared on May 7, 2024, with a revised v2 on December 6, 2024, and the paper was accepted as a spotlight presentation at NeurIPS 2024.^[1]^[14] Reference code was open-sourced under Apache 2.0 at github.com/NX-AI/xlstm as a pip-installable Python package shortly after publication.^[2]^[15]

Technical details

Exponential gating

The xLSTM paper's central modification to the LSTM equation is to replace the sigmoid in the input and forget gates with an exponential. In an sLSTM cell, the pre-activation values for the input gate $\tilde{i}_t$ and forget gate $\tilde{f}_t$ are computed from the same linear projection of the input and previous hidden state used in a standard LSTM, but the gates themselves are $i_t = \exp(\tilde{i}_t)$ and $f_t = \exp(\tilde{f}_t)$ (for the forget gate, with sigmoid still permitted as an alternative).^[1]^[16] Because exp is unbounded above, two changes are necessary to keep training stable.

A normalizer state $n_t$ is added to the cell. It tracks a running estimate of the cumulative input gate contributions, satisfying $n_t = f_t \cdot n_{t-1} + i_t$ , so that the hidden state output is computed as $h_t = o_t \odot (c_t / n_t)$ , where $o_t$ is the output gate. This normalization, in effect, divides each retrieval by the sum of write magnitudes that contributed to the current memory, mimicking the role of the softmax denominator in attention.^[1]^[16]

A stabilizer state $m_t$ prevents numerical overflow in the exponentials by tracking the running maximum log gate value. The paper defines $m_t = \max(\log(f_t) + m_{t-1}, \log(i_t))$ and then computes the actual gates after subtracting $m_t$ in log space, yielding $i'_t = \exp(\log(i_t) - m_t)$ and $f'_t = \exp(\log(f_t) + m_{t-1} - m_t)$ . This is structurally identical to the safe softmax trick that subtracts the maximum logit before exponentiating, and it lets exponential gates be used safely even at long sequence lengths.^[1]^[16]

The practical effect of exponential gating is that the cell can now revise stored values. If a later token deserves more weight than an earlier token, an exponentially larger input gate can override the forget gate's previous write, something that the sigmoid in the original LSTM cannot achieve without saturating.^[1]

sLSTM: scalar memory with head-wise mixing

The sLSTM cell retains the classical scalar cell state $c_t \in \mathbb{R}$ but adds exponential gating, the normalizer $n_t$ , and the stabilizer $m_t$ described above. It also introduces a multi-head structure inspired by multi-head attention, but with one key restriction. The recurrent weight matrices that govern memory mixing are block-diagonal across heads, so each head mixes only within itself. There is no cross-head mixing within an sLSTM layer, which preserves an inductive bias for state tracking that the matrix memory of mLSTM, by virtue of being fully parallelizable, must give up.^[1]^[16]

This design lets sLSTM solve problems that require true sequential state tracking, such as parity computation and certain context-free language tasks. The paper argues that sLSTM occupies a position in the architecture-power hierarchy that fully parallelizable models (including transformers, Mamba, and mLSTM) cannot reach, which is why a small fraction of sLSTM blocks is mixed in even when most blocks are mLSTM.^[1]

mLSTM: matrix memory and the covariance update rule

The mLSTM cell discards scalar storage entirely and replaces $c_t$ with a matrix $C_t \in \mathbb{R}^{d \times d}$ . At each timestep the input is projected to a query $q_t$ , a key $k_t$ , and a value $v_t$ (sharing notation with attention), and the memory is updated by an outer-product, or covariance, rule

C_t = f_t \cdot C_{t-1} + i_t \cdot v_t k_t^\top

with a normalizer vector $n_t = f_t \cdot n_{t-1} + i_t \cdot k_t$ .^[1]^[16] Retrieval at time t computes $h_t = o_t \odot (C_t q_t) / \max(\lvert n_t^\top q_t \rvert, 1)$ . LayerNorm is applied to keys and values before the outer product to keep their magnitudes controlled. Because there is no recurrent memory mixing across heads or time apart from the diagonal forget gate, the update is a linear function of the sequence and admits a parallel formulation: the entire matrix memory across a sequence can be computed as a sum of outer products weighted by cumulative gate products, which is amenable to chunkwise parallel scans on GPU.^[1]^[2]^[16]

The covariance rule is mathematically related to fast weight programmers, linear attention, and modern linear RNNs, but xLSTM specifically combines it with exponential input gates and the $n_t$ and $m_t$ stabilization machinery, giving the matrix memory the ability to overwrite stored associations rather than merely accumulate them.^[1] On a multi-query associative recall benchmark in the paper, mLSTM is reported to handle up to 256 key-value pairs in a single sequence, outperforming Mamba and other sub-quadratic baselines at the same scale.^[1]^[16]

xLSTM blocks and stacking

The two cell types are embedded into residual blocks. The paper describes two block templates: a post up-projection block (typically used for sLSTM) that summarizes the past in the original embedding space and then projects up through a gated MLP, and a pre up-projection block (typically used for mLSTM) that first projects up, runs the recurrent core in the higher-dimensional space, and then projects back down. Both templates use pre-LayerNorm and a residual connection, mirroring the now-standard transformer block topology.^[1]^[16]

An xLSTM language model is then a stack of such blocks. The paper denotes configurations as xLSTM[a:b] for a stack with a mLSTM blocks for every b sLSTM blocks. xLSTM[1:0] is pure mLSTM. xLSTM[7:1] interleaves seven mLSTM blocks with one sLSTM block and was the most studied configuration in the paper. The motivation for any sLSTM at all is the state-tracking inductive bias described above; the motivation to keep most blocks as mLSTM is parallel training and matrix capacity.^[1]^[16]

Reference implementation and kernels

The reference implementation at NX-AI/xlstm contains an xLSTMBlockStack module intended as a drop-in transformer replacement and an xLSTMLMModel wrapper for autoregressive language modeling. The repository ships both native PyTorch and Triton kernel implementations of mLSTM and includes a custom CUDA extension for sLSTM that requires GPUs of compute capability 8.0 or higher (Ampere or newer). The recommended development environment is PyTorch 2.4 with CUDA 12.4. The package is published as xlstm on PyPI under the Apache 2.0 license.^[2]^[15] A separate xlstm-jax repository contains a JAX implementation used for the 7B model release.^[17]

Scaling experiments in the original paper

The May 2024 paper reports two main classes of experiments on synthetic and natural language data.

Synthetic and formal language tasks

To probe state tracking and storage capacity in isolation, the paper evaluates xLSTM[1:0] (pure mLSTM), xLSTM[1:1] (an even mix), and xLSTM[0:1] (pure sLSTM) on tasks including multi-query associative recall, parity, modular arithmetic, and formal-language tasks. mLSTM-heavy stacks dominate the associative recall task, while sLSTM-heavy stacks dominate the parity and other state-tracking tasks, consistent with the theoretical argument that fully parallelizable models cannot track certain non-regular dependencies whereas sequentially mixing scalar memories can.^[1]^[16]

Validation perplexity on SlimPajama

The first language-modeling experiment trains 350M-parameter models on 15 billion tokens of the SlimPajama dataset and compares validation perplexity across thirteen architectures: GPT-3 style transformer, Llama, H3, Mamba, RWKV-4, RWKV-5, RWKV-6, GLA, HGRN2, RetNet, Hyena, xLSTM[1:0], and xLSTM[7:1].^[1]^[16]^[18] Both xLSTM variants attain the lowest perplexities; the paper highlights that xLSTM[7:1] in particular outperforms all other models. The 15B token sweep is paired with a parameter sweep at 125M, 350M, 760M, 1.3B, and 2.7B that fits scaling-law style curves and shows xLSTM curves below those of Llama and Mamba across the studied range.^[1]^[16]

Larger token budget

A second set of runs trains 1.3B-parameter models on 300 billion tokens of SlimPajama, matching the token budget used in earlier comparisons of Mamba and Griffin. The paper reports that xLSTM[1:0] and xLSTM[7:1] achieve lower perplexity than Llama, Mamba, and RWKV-4 at the same compute, and that xLSTM[1:0] sustains higher inference throughput than Mamba because the mLSTM update does not require the selective scan.^[1]^[16]

Long context

To test sequence-length extrapolation, models trained at context length 2048 are evaluated up to 16384 tokens without any retraining. xLSTM maintains roughly flat per-token perplexity across the longer contexts, whereas transformer baselines diverge once they exceed their training context.^[1]^[16] The paper attributes this to the recurrent state, which has fixed size and therefore does not encode any explicit length-dependent positional structure.

Downstream zero-shot benchmarks

The 1.3B xLSTM models are evaluated zero-shot on standard reasoning and commonsense benchmarks including PIQA, HellaSwag, LAMBADA, ARC-Easy, ARC-Challenge, WinoGrande, and OpenBookQA, plus the PALOMA suite of language-modeling perplexities across domains.^[1]^[16] xLSTM ranks first on PALOMA among the recurrent baselines and is competitive with the transformer.

xLSTM 7B

NXAI scaled the recipe to seven billion parameters in a follow-up release described in the paper "xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference" (Beck, Pöppel, Lippe, Kurle, Blies, Klambauer, Böck, Hochreiter), posted to arXiv on March 17, 2025 and accepted at ICML 2025.^[4]^[15] The 7B model uses an optimized architecture variant, labeled xLSTM Large in the public repository, that drops the sLSTM blocks and uses only mLSTM in a stack designed for maximum training throughput and inference efficiency.^[2]^[4]

xLSTM 7B specification	Value
Parameters	7 billion
Training tokens	approximately 2.3 trillion^[5]
Training data	DCLM-Baseline plus selected high-quality sources^[5]
Training framework	xlstm-jax (JAX)^[17]
Released by	NXAI, Linz, Austria^[4]
Release	March 2025^[4]
License	NXAI Community License^[5]
Block type	mLSTM only (sLSTM removed)^[2]

The model card on Hugging Face reports lighteval Leaderboard v1 zero-shot scores of 0.584 on ARC-Challenge (25-shot), 0.589 on MMLU (5-shot), 0.710 on HellaSwag (10-shot), 0.742 on WinoGrande (5-shot), 0.420 on TruthfulQA, 0.817 on PIQA, and 0.443 on OpenBookQA, with much weaker performance on math (GSM8K at 0.004 in the base model without instruction tuning).^[5] Hochreiter, quoted on the NXAI release page in his role as Chief Scientist, claimed the model as the strongest large RNN-based language model of its time and highlighted targeted use cases in robotics, automotive, and medical technology.^[4]

The accompanying inference kernels, distributed as the mlstm_kernels package alongside the main repository, are written in Triton and target H100-class accelerators. The xlstm-jax codebase contains the full distributed training pipeline used to produce the model.^[17] The 7B paper emphasizes that, unlike transformer language models, the time and memory cost of decoding a token in xLSTM 7B is constant in the context length, making it well suited to long-output reasoning loops where transformer KV caches grow without bound.^[4]

Variants and follow-up work

Vision-LSTM

Vision-LSTM (ViL), introduced by Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and Johannes Brandstetter in arXiv:2406.04303 (June 2024, ICLR 2025), adapts the xLSTM block as a backbone for ImageNet-scale vision tasks.^[6] Like Vision Mamba, ViL treats an image as a sequence of patch tokens and processes them with alternating directions: odd-indexed blocks scan top-to-bottom, even-indexed blocks bottom-to-top, providing the bidirectional context that pure unidirectional recurrence lacks. The reported configurations include ViL-T (6M parameters, 192 latent dim, 24 blocks), ViL-S (23M, 384), and ViL-B (89M, 768), all using 16x16 patches. ImageNet-1K top-1 accuracies are 78.3 percent for ViL-T, 81.5 percent for ViL-S, and 82.4 percent for ViL-B, with ViL-T exceeding DeiT-III-T at 76.2 percent and ViL-S matching DeiT-III-S at 81.4 percent. ViL-B trails DeiT-III-B (83.7 percent) at base scale.^[6]^[19]

Large Recurrent Action Model (LRAM)

LRAM, by Schmied, Adler, Patil, Beck, Pöppel, Brandstetter, Klambauer, Pascanu, and Hochreiter, builds an xLSTM-based decision transformer-style action model evaluated on 432 tasks across six robotics domains.^[7]^[20] The motivation is the same as in language modeling: transformer-based action models pay quadratic decoding cost during real-time control, whereas an xLSTM at the core has linear-time inference and natural length extrapolation. The paper reports competitive task performance to a transformer baseline at substantially faster inference, particularly at longer context windows. It was presented at the NeurIPS 2024 OWA workshop on open-world agents and is open sourced at ml-jku/LRAM.^[7]^[20]

Time-series variants

Two notable time-series adaptations followed. xLSTMTime (arXiv:2407.10240) reuses the xLSTM block with patch-style tokenization for long-horizon time-series forecasting and reports improvements over DLinear and PatchTST on standard datasets (ETTh, ETTm, weather, traffic), with reported gains over DLinear of around 18 percent at horizon 96 and 13 percent at horizon 192 on the weather dataset.^[8] xLSTM-Mixer (arXiv:2410.16928) applies the sLSTM scalar memory in a mixer-style architecture for multivariate forecasting and reports competitive or superior MSE to a broad set of baselines on the GIFT-Eval benchmark.^[9]

Other adaptations

Several additional papers in 2024 and 2025 have used xLSTM as a building block: U-VixLSTM for medical 3D image segmentation, MAL (cluster-masked and multi-task pretraining) for vision representation learning, and various exploratory applications in speech, finance, and combined transformer hybrid stacks.^[21]^[22] The repository ecosystem also includes xlstm-jax and mlstm_kernels for performance-critical deployments, both maintained by NXAI alongside the main package.^[17]^[2]

Implementations and ecosystem

The official implementations span three repositories under the NX-AI GitHub organization.^[15]

Repository	Purpose	Framework
NX-AI/xlstm	Reference research code, xLSTM blocks and language model wrapper, Triton/CUDA kernels^[2]	PyTorch
NX-AI/xlstm-jax	Production training code used for xLSTM 7B^[17]	JAX
NX-AI/mlstm_kernels	Optimized inference kernels for mLSTM, targeting H100^[2]	Triton / CUDA

The package is distributed on PyPI as xlstm under Apache 2.0. Pretrained weights of the 7B model are hosted on Hugging Face under the NXAI Community License, with about 4,800 downloads in the month before the most recent snapshot of the model card.^[5]^[15] The repository documentation lists PyTorch 2.4 with CUDA 12.4 as the recommended environment, and the sLSTM kernel additionally requires compute capability 8.0 or higher.^[2]

The release pattern parallels the rollout of Mamba and RWKV open-source ecosystems: a primary paper plus a reference implementation, a scaled-up model with separate paper and weights release, and supplementary kernel work to make inference competitive on commodity GPUs.^[15]^[17]

Applications

Because xLSTM offers transformer-like training parallelism plus constant-cost decoding, the natural application domains are those where transformer KV caches become an inference bottleneck or where input sequences extend well beyond standard context windows. NXAI has publicly emphasized industrial, robotics, automotive, and medical use cases.^[4]

In language modeling, xLSTM 7B is presented as an inference-efficient alternative to Llama and Mamba at the same parameter count, particularly attractive for batch decoding of long completions where transformer memory grows linearly with sequence length.^[4] In robotics, the LRAM line of work positions xLSTM at the core of large action models for real-time control, where latency dominates.^[7] In computer vision, ViL serves as a generic backbone for classification and downstream transfer learning, with attention-free scaling characteristics similar to Vision Transformer but with linear-time blocks.^[6] In time-series forecasting, xLSTMTime and xLSTM-Mixer leverage the matrix and scalar memories respectively for long forecast horizons that exceed typical transformer training lengths.^[8]^[9]

Limitations and criticisms

The xLSTM line of work has acknowledged constraints. The sLSTM cell, by retaining recurrent mixing across heads via block-diagonal matrices, is not fully parallelizable along the sequence dimension and is generally slower to train than the mLSTM. As a consequence, the production xLSTM 7B drops sLSTM entirely from its block stack, trading away the parity and state-tracking capabilities that motivated the dual-cell design in the original paper.^[2]^[4]

The mLSTM matrix memory has size $d \times d$ per layer, where d is the head dimension. This is fixed at training time and cannot be enlarged at inference, unlike the unbounded KV cache of a transformer. Models can therefore saturate on tasks that require recall of more independent key-value pairs than the matrix dimension can hold, although the paper reports favorable scaling up to 256 keys at the studied scales.^[1]^[16]

Downstream benchmark scores for xLSTM 7B, while competitive with mid-2024 7B transformers on commonsense reasoning, lag the strongest 2025 7B transformer instruction-tuned models on math and on instruction-following benchmarks; the model card itself reports a GSM8K score of 0.004 in the base model and 0.244 on IfEval, both well below leading 7B instruction-tuned baselines.^[5] The model is released as a base model intended for further fine-tuning, which partly accounts for these gaps.^[4]^[5]

Finally, the broader competitive landscape for sub-quadratic architectures continues to evolve. Mamba 2 and newer hybrid attention models have closed quality gaps relative to xLSTM in some 2024 to 2025 evaluations, and direct head-to-head benchmarks at identical training budgets remain comparatively scarce in the literature, with most public comparisons drawn from the xLSTM authors' own ablation tables.^[1]^[4]

Architecture	Year	Memory mechanism	Parallel training	Decoding cost per token
LSTM	1997	Scalar cell, sigmoid gates	No	O(1) state, sequential
Transformer	2017	Attention over full sequence	Yes	O(n) (KV cache grows)
Retentive Network (RetNet)	2023	Multi-scale retention	Yes (parallel form)	O(1)
Mamba	2023	Selective SSM with parallel scan	Yes	O(1)
RWKV	2023+	Linear attention with recurrent rewrite	Yes	O(1)
sLSTM (in xLSTM)	2024	Scalar memory + exponential gates + head mixing	Partial	O(1)
mLSTM (in xLSTM)	2024	Matrix memory + covariance update	Yes	O(1), fixed-size matrix

The xLSTM block sits closest to linear attention and to Mamba in the design space, sharing the recurrent rewrite-based parallel training pattern, while differing from both in its explicit covariance update and exponential gating. The full xLSTM paper compares directly against H3, GLA, HGRN2, RetNet, Hyena, Mamba, and three generations of RWKV on shared validation perplexity and benchmark suites.^[1]^[16]

References

Beck, Pöppel, Spanring, Auer, Prudnikova, Kopp, Klambauer, Brandstetter, Hochreiter, "xLSTM: Extended Long Short-Term Memory", arXiv, 2024-05-07 (v1) and 2024-12-06 (v2). https://arxiv.org/abs/2405.04517. Accessed 2026-05-20. ↩
NX-AI, "NX-AI/xlstm: Official repository of the xLSTM", GitHub, 2024-05. https://github.com/NX-AI/xlstm. Accessed 2026-05-20. ↩
NX-AI, "xLSTM: Extended Long Short-Term Memory (OpenReview discussion)", OpenReview, 2024-09-25. https://openreview.net/forum?id=ARAxPPIAhq. Accessed 2026-05-20. ↩
NXAI, "xLSTM 7B: NXAI releases its new xLSTM 7B model", NXAI News, 2025-03. https://www.nx-ai.com/en/news/xlstm-7b-nxai-releases-its-new-xlstm-7b-model. Accessed 2026-05-20. ↩
NX-AI, "NX-AI/xLSTM-7b model card", Hugging Face, 2025. https://huggingface.co/NX-AI/xLSTM-7b. Accessed 2026-05-20. ↩
Alkin, Beck, Pöppel, Hochreiter, Brandstetter, "Vision-LSTM: xLSTM as Generic Vision Backbone", arXiv, 2024-06-06. https://arxiv.org/abs/2406.04303. Accessed 2026-05-20. ↩
Schmied, Adler, Patil, Beck, Pöppel, Brandstetter, Klambauer, Pascanu, Hochreiter, "A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks", arXiv, 2024-10-29. https://arxiv.org/abs/2410.22391. Accessed 2026-05-20. ↩
Alharthi, Mahmood, "xLSTMTime: Long-term Time Series Forecasting With xLSTM", arXiv, 2024-07-14. https://arxiv.org/abs/2407.10240. Accessed 2026-05-20. ↩
Kraus et al., "xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories", arXiv, 2024-10-22. https://arxiv.org/abs/2410.16928. Accessed 2026-05-20. ↩
Hochreiter, Schmidhuber, "Long Short-Term Memory", Neural Computation 9(8), 1735-1780, 1997-11. https://doi.org/10.1162/neco.1997.9.8.1735. Accessed 2026-05-20. ↩
Wikipedia contributors, "Sepp Hochreiter", Wikipedia, 2026. https://en.wikipedia.org/wiki/Sepp_Hochreiter. Accessed 2026-05-20. ↩
JKU Linz, "AI Made in Europe: Entrepreneurial Reinforcement for Leading Researcher Sepp Hochreiter and His xLSTM", JKU LIT Open Innovation Center, 2023-12. https://www.jku.at/en/lit-open-innovation-center/news-events/news/detail/news/ai-made-in-europe-spitzenforscher-sepp-hochreiter-und-sein-xlstm-erhalten-unternehmerische-verstaerkung-fuer-europaeisches-large-language-model/. Accessed 2026-05-20. ↩
Invest in Austria, "NXAI: Pioneering Industrial AI Solutions from Austria", Austrian Business Agency, 2024. https://investinaustria.at/en/blog/nxai-top-researchers-in-austria-develop-solutions-for-industrial-companies/. Accessed 2026-05-20. ↩
NeurIPS, "xLSTM: Extended Long Short-Term Memory (NeurIPS 2024 poster page)", NeurIPS, 2024-12. https://neurips.cc/virtual/2024/poster/96260. Accessed 2026-05-20. ↩
NX-AI organization, "NX-AI repositories", GitHub, 2024-2025. https://github.com/NX-AI. Accessed 2026-05-20. ↩
Sapunov, "xLSTM: Extended Long Short-Term Memory", gonzoML Substack, 2024-05. https://gonzoml.substack.com/p/xlstm-extended-long-short-term-memory. Accessed 2026-05-20. ↩
NX-AI, "NX-AI/xlstm-jax: Official JAX implementation of xLSTM", GitHub, 2025-03. https://github.com/NX-AI/xlstm-jax. Accessed 2026-05-20. ↩
KDnuggets, "LSTMs Rise Again: Extended-LSTM Models Challenge the Transformer Superiority", KDnuggets, 2024-06. https://www.kdnuggets.com/lstms-rise-again-extended-lstm-models-challenge-the-transformer-superiority. Accessed 2026-05-20. ↩
Alkin et al., "Vision-LSTM: xLSTM as Generic Vision Backbone (HTML version)", arXiv, 2024-06. https://arxiv.org/html/2406.04303v2. Accessed 2026-05-20. ↩
NeurIPS, "A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks", NeurIPS 2024 OWA Workshop, 2024-12. https://neurips.cc/virtual/2024/100902. Accessed 2026-05-20. ↩
Authors of MAL, "MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance", arXiv, 2024-12-14. https://arxiv.org/abs/2412.10730. Accessed 2026-05-20. ↩
Dutta et al., "Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?", arXiv, 2024-06-24. https://arxiv.org/abs/2406.16993. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

LSTM RNN Temporal data

Background

Technical details

Exponential gating

sLSTM: scalar memory with head-wise mixing

mLSTM: matrix memory and the covariance update rule

xLSTM blocks and stacking

Reference implementation and kernels

Scaling experiments in the original paper

Synthetic and formal language tasks

Validation perplexity on SlimPajama

Larger token budget

Long context

Downstream zero-shot benchmarks

xLSTM 7B

Variants and follow-up work

Vision-LSTM

Large Recurrent Action Model (LRAM)

Time-series variants

Other adaptations

Implementations and ecosystem

Applications

Limitations and criticisms

Related work

See also

References

Improve this article

Related Articles

LSTM

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Transformers

What links here

Related Articles

LSTM

Long Short-Term Memory (LSTM)

Multi-head Latent Attention

Multi-Head Self-Attention

Recurrent Neural Network

Transformers

What links here