See also: Transformer, Recurrent neural network, Deep learning
A state space model (SSM) in deep learning is a class of sequence model derived from continuous-time state space equations that are discretized for use in neural networks. SSMs map an input sequence to an output sequence through a latent state vector, drawing on decades of theory from control systems and signal processing. In the context of modern machine learning, the term most commonly refers to structured state space models, a family of architectures that began with the S4 model introduced by Albert Gu, Karan Goel, and Christopher Re at Stanford University in 2021 [1].
SSMs have attracted intense interest because they offer an alternative to the Transformer architecture that scales linearly with sequence length rather than quadratically. While Transformers rely on attention mechanisms that compute pairwise interactions between all tokens (yielding O(n^2) complexity), SSMs process sequences through a fixed-size recurrent state, achieving O(n) complexity for both training and inference. This property makes SSMs particularly appealing for tasks involving very long sequences, such as genomics, audio modeling, and long-document understanding.
The most influential SSM architecture to date is Mamba, introduced by Albert Gu and Tri Dao in December 2023 [2]. Mamba introduced selective state spaces, a mechanism that makes SSM parameters input-dependent, and combined this with a hardware-aware implementation optimized for modern GPUs. As of early 2026, SSMs and Transformer-SSM hybrids represent the most active area of research in efficient sequence modeling, with production deployments by NVIDIA, AI21 Labs, Google DeepMind, and others.
State space models are rooted in the classical state space representation from control theory. A continuous-time, linear, time-invariant (LTI) system is defined by four matrices (A, B, C, D) and two equations:
Here, u(t) is the input signal, x(t) is the hidden state vector of dimension N, and y(t) is the output signal. The matrix A (of size N x N) governs how the state evolves over time. B (N x 1) controls how the input influences the state. C (1 x N) maps the state to the output. D (1 x 1) provides a direct skip connection from input to output and is often omitted or set to zero in practice.
Since deep learning operates on discrete sequences rather than continuous signals, the continuous system must be discretized. Given a step size delta, the continuous matrices are converted to discrete counterparts using a discretization rule. The most common approach is the zero-order hold (ZOH):
The discrete recurrence then becomes:
This recurrence can be unrolled as a convolution during training (enabling parallelism on GPUs) or executed step-by-step as a recurrence during inference (enabling constant memory per step). This dual view, convolution for training and recurrence for generation, is one of the key advantages of SSMs over both Transformers and traditional RNNs.
By unrolling the discrete recurrence, the output sequence y can be expressed as a convolution of the input sequence u with a kernel K:
This convolutional form allows efficient parallel computation using Fast Fourier Transforms (FFT) during training, achieving O(n log n) complexity. The ability to switch between convolutional mode (for training) and recurrent mode (for inference) is fundamental to all structured SSM architectures.
The S4 (Structured State Spaces for Sequence Modeling) architecture, published by Gu, Goel, and Re in late 2021 [1], was the breakthrough that made SSMs competitive with Transformers on a range of sequence modeling tasks. The paper was presented at ICLR 2022.
A critical ingredient in S4 is the HiPPO (High-order Polynomial Projection Operators) initialization, developed by Gu et al. in 2020 [3]. HiPPO provides a principled way to initialize the state matrix A so that the hidden state maintains a compressed representation of the input history. Specifically, HiPPO derives special matrices that, when used as A, cause the state vector to store coefficients of an optimal polynomial approximation to the input signal seen so far.
The HiPPO-LegS (Legendre Scaled) variant uses a particular matrix that projects the input history onto a basis of scaled Legendre polynomials. This initialization proved empirically essential for S4's ability to capture long-range dependencies. Without HiPPO, SSMs tend to forget distant information rapidly. With it, they can maintain useful representations across thousands or even tens of thousands of time steps.
Naively computing the SSM convolution kernel requires materializing powers of the N x N matrix A, which is prohibitively expensive. S4 overcame this through a structured parameterization that constrains A to be a diagonal plus low-rank (DPLR) matrix. This structure allows the convolution kernel to be computed efficiently via a Cauchy kernel, reducing the cost from O(N^2) to O(N). Combined with FFT-based convolution, S4 achieves O(n log n) total complexity for a sequence of length n.
The result was dramatic. S4 achieved 91% accuracy on sequential CIFAR-10 (processing images one pixel at a time) and was the first model to solve the Path-X task, a synthetic benchmark requiring reasoning over sequences of length 16,384 that all prior architectures, including Transformers, had failed on [1].
Several refinements followed S4 quickly:
Mamba was introduced by Albert Gu (Carnegie Mellon University) and Tri Dao (Princeton University) in December 2023, in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" [2]. It addressed a fundamental limitation of all prior SSMs: their linear time-invariance.
In S4 and its variants, the matrices A, B, C, and the step size delta are fixed parameters that do not change based on the input. This makes the system linear and time-invariant (LTI), which is precisely what allows the efficient convolutional computation. However, LTI systems treat all inputs identically. They cannot selectively attend to or ignore specific tokens based on content. This is a severe limitation for language modeling, where the relevance of a token depends entirely on context.
Consider a simple example: given the sentence "The capital of France is Paris," an LTI system processes "The," "capital," "of," "France," "is," and "Paris" through the same fixed dynamics. It has no mechanism to recognize that "France" is the key piece of information for determining the output "Paris." Transformers handle this naturally through content-based attention, but prior SSMs could not.
Mamba's core innovation is making the SSM parameters input-dependent (selective). Specifically, the matrices B, C, and the step size delta become functions of the current input token:
This selectivity allows the model to control what information enters the state, what information is retrieved from the state, and how quickly the state transitions, all conditioned on the current input. In effect, a large delta causes the model to focus on the current input and reset the state, while a small delta causes it to retain the existing state and largely ignore the current input.
This mechanism provides functionality analogous to the gating in LSTM networks or the content-based filtering in attention, but within the SSM framework.
Making the parameters input-dependent breaks the LTI property, which means the efficient convolutional mode of S4 can no longer be used. Mamba compensates for this through a carefully designed hardware-aware algorithm that computes the selective SSM recurrence efficiently on modern GPUs.
The key insight is that the recurrence can be computed using a parallel scan algorithm. Gu and Dao developed a custom CUDA kernel that:
This approach is directly inspired by the IO-aware principles behind FlashAttention (also developed by Tri Dao). The result is that Mamba achieves true linear-time complexity, O(n), for both training and inference, with wall-clock speeds that are practical at scale.
The full Mamba block combines the selective SSM with a gated architecture. Each block consists of:
Notably, Mamba does not use attention or explicit MLP blocks. The entire architecture is a simple stack of identical Mamba blocks with residual connections and RMSNorm.
Mamba-3B outperformed Transformers of the same size and matched Transformers with twice as many parameters on language modeling benchmarks [2]. On inference, Mamba achieved 5x higher throughput than comparably sized Transformers because it does not require a KV cache. Its memory footprint during generation is constant regardless of sequence length, compared to the linearly growing KV cache of Transformers. Mamba also demonstrated strong results on audio modeling (SaShiMi benchmark) and DNA sequence modeling, with performance improving on real data up to million-length sequences.
In May 2024, Tri Dao and Albert Gu published "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" [5], introducing Mamba-2. This paper established a deep theoretical connection between SSMs and attention.
The central contribution is the Structured State Space Duality (SSD) framework, which proves that a structured SSM with a scalar-times-identity state matrix is mathematically equivalent to a form of masked self-attention using a 1-semiseparable causal mask. In other words, certain SSMs and certain attention mechanisms compute exactly the same function, just expressed through different mathematical decompositions.
This duality has both theoretical and practical implications. Theoretically, it unifies two seemingly disparate families of sequence models under a single mathematical framework. Practically, it means that SSM computations can be decomposed into matrix multiplications, enabling much more efficient use of GPU tensor cores.
Mamba-2 refines the selective SSM layer from Mamba-1 using insights from SSD. The key changes include:
The result is a model that is 2-8x faster to train than Mamba-1 while maintaining competitive quality with Transformers on language modeling. Mamba-2 was presented at ICML 2024.
In early 2025, Gu and collaborators introduced Mamba-3, which builds on the SSM framework with three axes of improvement [6]:
At the 1.5B parameter scale, Mamba-3 improved average downstream accuracy by 1.8 points over the prior best linear-complexity model (Gated DeltaNet) while achieving comparable perplexity to Mamba-2 with half the state size.
The comparison between SSMs and Transformers involves trade-offs across multiple dimensions.
The most cited advantage of SSMs is their linear complexity. Standard Transformer self-attention computes pairwise interactions between all n tokens, resulting in O(n^2) time and memory complexity. SSMs process each token through a fixed-size state update, achieving O(n) complexity. For a sequence of 100,000 tokens, this difference is enormous: roughly 10 billion operations for attention versus 100,000 for the SSM recurrence (ignoring constant factors).
In practice, Transformers are faster at short sequence lengths (under roughly 8,000 tokens) due to highly optimized attention implementations like FlashAttention and the overhead of SSM scan operations. But SSMs become dramatically faster at longer sequences. Empirical measurements show SSMs can be up to 4x faster than Transformers at context lengths around 57,000 tokens, with the gap widening further at longer lengths [7].
During autoregressive generation, Transformers must store a KV cache that grows linearly with the number of generated tokens. For a large model generating a long sequence, this cache can consume tens of gigabytes of GPU memory. SSMs maintain a fixed-size state regardless of sequence length, resulting in roughly 64% reduced memory footprint in long-context scenarios [7].
Transformers excel at in-context learning, the ability to learn new tasks from examples provided in the prompt without any parameter updates. This capability emerges from the attention mechanism's ability to perform arbitrary content-based lookups across the entire context. SSMs, by contrast, must compress all past information into a fixed-size state vector, which limits their ability to perform precise retrieval of arbitrary facts from the context.
NVIDIA's empirical study of 8B-parameter models confirmed this gap: while pure Mamba and Mamba-2 models matched or exceeded Transformers on many standard benchmarks, they lagged behind on tasks requiring strong copying, in-context learning (e.g., five-shot MMLU), or long-context reasoning such as phonebook lookup [8].
| Property | Transformer | SSM (e.g., Mamba) | Hybrid (e.g., Jamba) |
|---|---|---|---|
| Training complexity | O(n^2) per layer | O(n) per layer | O(n) to O(n^2) depending on layer type |
| Inference memory | O(n) KV cache, grows with context | O(1) fixed state | Reduced KV cache (fewer attention layers) |
| Inference throughput | Bottlenecked by KV cache at long contexts | 5x higher than Transformer at similar size | 3-5x higher than Transformer |
| Long-range dependencies | Strong via direct attention | Strong with HiPPO initialization | Strong (both mechanisms available) |
| In-context learning | Excellent | Weaker, especially on recall tasks | Near-Transformer quality |
| Training parallelism | Fully parallel (matrix multiply) | Parallel via scan or convolution | Fully parallel |
| Hardware utilization | Highly optimized (FlashAttention) | Improving (Mamba-2/3 SSD algorithm) | Benefits from both |
The complementary strengths and weaknesses of Transformers and SSMs have motivated hybrid architectures that combine both.
Jamba was the first production-grade hybrid Transformer-Mamba model, released by AI21 Labs in March 2024 [9]. Its architecture interleaves Mamba layers and Transformer attention layers at a ratio of approximately 7:1 (seven Mamba layers for every one attention layer). It also incorporates Mixture of Experts (MoE) layers to increase parameter count without proportionally increasing compute.
Jamba demonstrated that the hybrid approach could achieve state-of-the-art performance on standard language model benchmarks while supporting context lengths up to 256K tokens with significantly lower memory consumption than a pure Transformer of equivalent quality. AI21 later released Jamba 1.5 in late 2024, scaling to 398B total parameters (94B active) across 72 layers, using grouped-query attention and 16 MoE experts [10].
NVIDIA's Nemotron-H family demonstrated hybrid Mamba-Transformer models at scale [11]. The Nemotron-H-8B model consists of 24 Mamba-2 layers, 24 MLP layers, and 4 self-attention layers, pre-trained on 15 trillion tokens. A larger 56B variant was pre-trained on 20 trillion tokens using FP8 precision. NVIDIA also released a compressed 47B variant designed to support roughly 1-million-token context on a single NVIDIA RTX 5090 GPU.
Nemotron-H models offered accuracy on par with or better than similarly sized pure Transformer models (such as Qwen-2.5 and Llama-3.1) while providing up to 3x faster inference. The architecture was also used as the backbone for NVIDIA's Cosmos-Reason 1 vision-language model.
In a systematic 2024 study, NVIDIA trained 8B-parameter Mamba, Mamba-2, Transformer, and hybrid models on up to 3.5 trillion tokens using identical datasets [8]. The hybrid model (43% Mamba-2 layers, 7% self-attention layers, 50% MLP layers) outperformed the pure Transformer on all 12 standard benchmarks by an average of 2.65 points and was projected to be up to 8x faster at token generation during inference.
The SSM family is part of a broader wave of architectures seeking to replace or complement Transformer attention with sub-quadratic alternatives.
H3 (Hungry Hungry Hippos), developed by the same Stanford/Hazy Research group behind S4, uses two stacked SSM layers sandwiched around multiplicative gating [12]. Each layer contains a short convolution (for local patterns) and a long SSM convolution (for global patterns). H3 was one of the first models to show that SSMs could match Transformer perplexity on language modeling at moderate scale, achieving results within 0.4 perplexity points of a similarly sized Transformer.
Hyena, also from the Hazy Research lab at Stanford, replaced the SSM convolution with implicitly parameterized long convolutions, generating the convolution filters through a small neural network rather than deriving them from state space equations [13]. Hyena achieved sub-quadratic O(n log n) complexity and showed competitive results with attention-based models on language tasks up to moderate context lengths.
RWKV (Receptance Weighted Key Value) takes a different approach, combining the parallelizable training of Transformers with the efficient O(1) inference of RNNs [14]. Developed primarily by Bo Peng (BlinkDL) and an open-source community, RWKV can be formulated as either a Transformer-like model (for parallel training) or an RNN (for efficient inference). RWKV uses a linear attention mechanism with learned decay factors rather than explicit state space equations.
The architecture has gone through multiple versions. RWKV-5 (Eagle) and RWKV-6 (Finch) introduced matrix-valued states and dynamic recurrence, significantly improving expressiveness [14]. RWKV-7 (Goose) followed, and by early 2026, RWKV-X added hybrid elements combining linear complexity with sparse attention.
Microsoft's RetNet (Retentive Network) frames efficient sequence modeling through a retention mechanism that can be computed in three equivalent ways: a parallel form (similar to attention, for training), a recurrent form (for O(1) inference), and a chunk-wise form (balancing parallelism and memory) [15]. RetNet is closely related to H3 but simplifies the SSM component to a state dimension of N=1, which allows parallelization through a variant of multi-head attention with exponential decay rather than through convolutions.
Google DeepMind's Griffin architecture mixes gated linear recurrences with local attention [16]. It uses a recurrent block (similar to an SSM) for global sequence mixing and windowed attention for local interactions. The companion model Hawk uses only the gated linear recurrence without any attention. Griffin matched the performance of Llama-2 despite being trained on more than 6x fewer tokens, suggesting strong data efficiency. Google released RecurrentGemma, an open-weights model based on the Griffin architecture, for production use.
The relative performance of SSMs, Transformers, and hybrids has been evaluated across multiple benchmarks and scales.
At the 3B parameter scale, Mamba matched or outperformed Transformers of equal size on perplexity and downstream tasks, while matching Transformers of twice the size on several evaluations [2]. At the 8B scale, NVIDIA's study found that pure Mamba-2 achieved competitive perplexity but trailed Transformers on recall-heavy tasks. The hybrid Mamba-2-Hybrid closed this gap entirely, exceeding the Transformer on all evaluated tasks [8].
The Long-Range Arena (LRA) benchmark, designed specifically to test long-range dependency modeling, has been a standard evaluation suite for SSMs. S4 achieved near-perfect scores on several LRA tasks, including the first solution to Path-X (sequence length 16,384). Subsequent SSM variants have continued to perform strongly on LRA, generally outperforming Transformer baselines that struggle with the longest sequences.
Mamba demonstrated particularly strong results in domains with inherently long sequences. On DNA sequence modeling, Mamba outperformed prior specialized architectures. On the SaShiMi audio generation benchmark, SSM-based models have consistently outperformed Transformer baselines, likely because audio waveforms are naturally continuous signals that align well with the state space formulation.
Mamba's inference advantage is most pronounced during autoregressive generation. Without a KV cache, Mamba can process much larger batch sizes on the same hardware, yielding 4-5x higher throughput than a similarly sized Transformer. For applications requiring long-form generation or serving many concurrent requests, this efficiency advantage is substantial.
Despite their advantages, SSMs have several notable limitations.
In-context learning and retrieval. The most significant limitation is weaker performance on tasks requiring precise information retrieval from context. Because SSMs compress all history into a fixed-size state vector, they struggle with tasks like phonebook lookup, where a specific fact must be retrieved from among many stored facts. Transformers, with their O(n^2) attention, can directly access any position in the context.
Associative recall. Related to in-context learning, SSMs have difficulty with associative recall tasks where the model must remember arbitrary key-value mappings presented in context. Research from 2025 has shown that the mechanisms underlying in-context learning differ significantly across SSM variants, and hybrid models partially but not fully close the gap with Transformers [7].
Ecosystem maturity. As of early 2026, the Transformer ecosystem is far more mature. Optimized inference engines (vLLM, TensorRT-LLM), training frameworks (Megatron-LM, DeepSpeed), and hardware (NVIDIA tensor cores optimized for matrix multiplication) are all designed primarily for Transformers. SSM-specific tooling is improving, with NVIDIA's NeMo framework adding hybrid SSM support, but the gap remains.
Scaling uncertainty. While SSMs have shown strong results up to roughly 8B parameters, there is less empirical evidence about their behavior at the 70B+ scale that frontier Transformer LLMs operate at. Nemotron-H-56B and Jamba 1.5 (398B total, 94B active) provide early data points, but the scaling behavior of pure SSMs at very large scale remains an open question.
Training speed at short contexts. For sequences shorter than roughly 8,000 tokens, highly optimized Transformer implementations (FlashAttention-3 on H100/H200 GPUs) can be faster than SSM implementations. The SSM advantage materializes primarily at longer context lengths.
SSMs have found applications across several domains:
As of early 2026, the landscape of state space models is evolving rapidly.
The dominant trend is hybridization. Pure SSM models, while impressively efficient, have not convincingly matched Transformers across all tasks at scale. Hybrid architectures that combine a majority of SSM layers with a small number of attention layers appear to offer the best of both worlds: near-linear complexity and memory efficiency from the SSM layers, combined with strong in-context learning and recall from the attention layers. NVIDIA's Nemotron-H, AI21's Jamba, and Google's RecurrentGemma all follow this pattern.
The theoretical understanding of SSMs has deepened considerably. The SSD framework from Mamba-2 demonstrated a formal equivalence between structured SSMs and structured attention. Mamba-3's complex-valued states and MIMO formulation addressed known expressivity gaps. IBM's work on structured sparse transition matrices, a NeurIPS 2025 spotlight, tackled the expressivity-efficiency balance from yet another angle.
On the hardware front, SSMs are beginning to benefit from dedicated optimization. NVIDIA's NeMo framework supports hybrid SSM training, and the Mamba-2/3 kernels take advantage of tensor core operations. Research published in Nature Communications in 2025 demonstrated SSM implementation in compute-in-memory hardware for energy-efficient, event-driven processing.
A comprehensive survey published in March 2025, "From S4 to Mamba," traced the full evolution of structured state space models, cataloging dozens of variants and their applications [18]. The survey highlights both the rapid progress in the field and the remaining open questions, particularly around scaling SSMs to frontier model sizes and closing the gap with Transformers on recall-intensive tasks.
The Transformer remains the dominant architecture for frontier AI systems. But SSMs have established themselves as a legitimate and increasingly practical alternative, particularly for long-context, memory-constrained, or latency-sensitive applications. Whether the future belongs to pure SSMs, pure Transformers, or hybrids remains one of the most consequential open questions in deep learning architecture research.
Imagine you are listening to a very long story. A Transformer is like writing down every single word of the story on sticky notes and spreading them all across a huge table so you can look at any two words at the same time. This works great, but it takes a lot of sticky notes and a really big table for a long story.
A state space model is more like keeping a small notebook. As you hear each word, you update your notes with the important stuff and erase what you do not need. You never need a bigger notebook no matter how long the story gets. The tricky part is deciding what to write down and what to erase. Mamba is a clever version that looks at each new word and decides on the spot whether it is important enough to remember.