# State space model (deep learning)

> Source: https://aiwiki.ai/wiki/state_space_model
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Transformer](/wiki/transformer), [Recurrent neural network](/wiki/recurrent_neural_network), [Deep learning](/wiki/deep_learning)*

## What is a state space model?

A **state space model** (SSM) in [deep learning](/wiki/deep_learning) is a class of sequence model that maps an input sequence to an output sequence through a fixed-size latent state, using continuous-time state space equations that are discretized for neural networks. SSMs scale linearly with sequence length, O(n), rather than the quadratic O(n^2) of [Transformer](/wiki/transformer) [attention](/wiki/attention), which makes them efficient on very long sequences such as genomics, audio, and long documents. The term most commonly refers to **structured state space models**, a family that began with the S4 model introduced by Albert Gu, Karan Goel, and Christopher Re at Stanford University in 2021 [1], and whose most influential member is **[Mamba](/wiki/mamba)**, introduced by Albert Gu and Tri Dao in December 2023 [2].

SSMs have attracted intense interest because they offer an alternative to the Transformer architecture that scales linearly with sequence length rather than quadratically. While Transformers rely on attention mechanisms that compute pairwise interactions between all tokens (yielding O(n^2) complexity), SSMs process sequences through a fixed-size recurrent state, achieving O(n) complexity for both training and inference. This property makes SSMs particularly appealing for tasks involving very long sequences, such as genomics, audio modeling, and long-document understanding.

Mamba introduced selective state spaces, a mechanism that makes SSM parameters input-dependent, and combined this with a hardware-aware implementation optimized for modern GPUs. The original Mamba paper reports that the model "enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences" [2]. As of early 2026, SSMs and Transformer-SSM hybrids represent the most active area of research in efficient sequence modeling, with production deployments by NVIDIA, [AI21 Labs](/wiki/ai21_labs), [Google DeepMind](/wiki/google_deepmind), and others. Albert Gu is an assistant professor in the Machine Learning Department at Carnegie Mellon University and co-founder and chief scientist of the SSM startup Cartesia; Tri Dao is an assistant professor of computer science at Princeton University and co-founder and chief scientist of Together AI [19].

## Mathematical Formulation

State space models are rooted in the classical state space representation from control theory. A continuous-time, linear, time-invariant (LTI) system is defined by four matrices (A, B, C, D) and two equations:

- **State equation**: x'(t) = Ax(t) + Bu(t)
- **Output equation**: y(t) = Cx(t) + Du(t)

Here, u(t) is the input signal, x(t) is the hidden state vector of dimension N, and y(t) is the output signal. The matrix A (of size N x N) governs how the state evolves over time. B (N x 1) controls how the input influences the state. C (1 x N) maps the state to the output. D (1 x 1) provides a direct skip connection from input to output and is often omitted or set to zero in practice.

### Discretization

Since deep learning operates on discrete sequences rather than continuous signals, the continuous system must be discretized. Given a step size delta, the continuous matrices are converted to discrete counterparts using a discretization rule. The most common approach is the zero-order hold (ZOH):

- A_bar = exp(delta * A)
- B_bar = (delta * A)^(-1) * (exp(delta * A) - I) * delta * B

The discrete recurrence then becomes:

- x_k = A_bar * x_{k-1} + B_bar * u_k
- y_k = C * x_k + D * u_k

This recurrence can be unrolled as a convolution during training (enabling parallelism on GPUs) or executed step-by-step as a recurrence during inference (enabling constant memory per step). This dual view, convolution for training and recurrence for generation, is one of the key advantages of SSMs over both Transformers and traditional [RNNs](/wiki/recurrent_neural_network).

### Connection to Convolutions

By unrolling the discrete recurrence, the output sequence y can be expressed as a convolution of the input sequence u with a kernel K:

- K = (C * B_bar, C * A_bar * B_bar, C * A_bar^2 * B_bar, ...)
- y = K * u (where * denotes convolution)

This convolutional form allows efficient parallel computation using [Fast Fourier Transforms](/wiki/fast_fourier_transform) (FFT) during training, achieving O(n log n) complexity. The ability to switch between convolutional mode (for training) and recurrent mode (for inference) is fundamental to all structured SSM architectures.

## S4: Structured State Spaces for Sequence Modeling

The **S4** (Structured State Spaces for Sequence Modeling) architecture, published by Gu, Goel, and Re in late 2021 [1], was the breakthrough that made SSMs competitive with Transformers on a range of sequence modeling tasks. The paper was presented at ICLR 2022, where it received an Outstanding Paper Honorable Mention. Its abstract claims S4 is the first model "solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors" [1].

### The HiPPO Framework

A critical ingredient in S4 is the **HiPPO** (High-order Polynomial Projection Operators) initialization, developed by Gu et al. in 2020 [3]. HiPPO provides a principled way to initialize the state matrix A so that the hidden state maintains a compressed representation of the input history. Specifically, HiPPO derives special matrices that, when used as A, cause the state vector to store coefficients of an optimal polynomial approximation to the input signal seen so far.

The HiPPO-LegS (Legendre Scaled) variant uses a particular matrix that projects the input history onto a basis of scaled Legendre polynomials. This initialization proved empirically essential for S4's ability to capture long-range dependencies. Without HiPPO, SSMs tend to forget distant information rapidly. With it, they can maintain useful representations across thousands or even tens of thousands of time steps.

### Structured Parameterization

Naively computing the SSM convolution kernel requires materializing powers of the N x N matrix A, which is prohibitively expensive. S4 overcame this through a structured parameterization that constrains A to be a **diagonal plus low-rank** (DPLR) matrix. This structure allows the convolution kernel to be computed efficiently via a Cauchy kernel, reducing the cost from O(N^2) to O(N). Combined with FFT-based convolution, S4 achieves O(n log n) total complexity for a sequence of length n.

The result was dramatic. S4 achieved 91% accuracy on sequential CIFAR-10 (processing images one pixel at a time), on par with a larger 2-D ResNet, and performed autoregressive generation 60x faster than equivalent baselines [1]. It was also the first model to solve the Path-X task, a synthetic benchmark requiring reasoning over sequences of length 16,384 (128 x 128 pixels) that all prior architectures, including Transformers, had failed on, and it reached state of the art on every task in the Long Range Arena benchmark [1].

### S4 Variants

Several refinements followed S4 quickly:

- **S4D** (2022) simplified S4 by showing that a purely diagonal state matrix (without the low-rank correction) could achieve similar performance, greatly simplifying implementation [4].
- **S5** (2022) introduced a MIMO (multi-input, multi-output) SSM formulation that processes all input channels through a single shared state, improving efficiency.
- **DSS** (Diagonal State Spaces) further explored diagonal parameterizations with complex-valued states.

## Mamba: Selective State Spaces

**Mamba** was introduced by Albert Gu (Carnegie Mellon University) and Tri Dao (Princeton University) on December 1, 2023, in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" [2]. It addressed a fundamental limitation of all prior SSMs: their linear time-invariance.

### Why are linear time-invariant SSMs limited?

In S4 and its variants, the matrices A, B, C, and the step size delta are fixed parameters that do not change based on the input. This makes the system linear and time-invariant (LTI), which is precisely what allows the efficient convolutional computation. However, LTI systems treat all inputs identically. They cannot selectively attend to or ignore specific tokens based on content. This is a severe limitation for language modeling, where the relevance of a token depends entirely on context.

Consider a simple example: given the sentence "The capital of France is Paris," an LTI system processes "The," "capital," "of," "France," "is," and "Paris" through the same fixed dynamics. It has no mechanism to recognize that "France" is the key piece of information for determining the output "Paris." Transformers handle this naturally through content-based attention, but prior SSMs could not. The Mamba authors identified that "a key weakness of such models is their inability to perform content-based reasoning" [2].

### Selective State Spaces

Mamba's core innovation is making the SSM parameters **input-dependent** (selective). Specifically, the matrices B, C, and the step size delta become functions of the current input token:

- B_k = Linear(x_k)
- C_k = Linear(x_k)
- delta_k = softplus(Linear(x_k))

This selectivity allows the model to control what information enters the state, what information is retrieved from the state, and how quickly the state transitions, all conditioned on the current input. In effect, a large delta causes the model to focus on the current input and reset the state, while a small delta causes it to retain the existing state and largely ignore the current input.

This mechanism provides functionality analogous to the gating in [LSTM](/wiki/long_short-term_memory_lstm) networks or the content-based filtering in attention, but within the SSM framework.

### Hardware-Aware Implementation

Making the parameters input-dependent breaks the LTI property, which means the efficient convolutional mode of S4 can no longer be used. Mamba compensates for this through a carefully designed **hardware-aware algorithm** that computes the selective SSM recurrence efficiently on modern GPUs.

The key insight is that the recurrence can be computed using a parallel scan algorithm. Gu and Dao developed a custom [CUDA](/wiki/cuda) kernel that:

1. Avoids materializing the full state in GPU high-bandwidth memory (HBM)
2. Performs computation in fast on-chip SRAM using kernel fusion
3. Uses recomputation in the backward pass instead of storing intermediate states

This approach is directly inspired by the IO-aware principles behind [FlashAttention](/wiki/flash_attention) (also developed by Tri Dao). The result is that Mamba achieves true linear-time complexity, O(n), for both training and inference, with wall-clock speeds that are practical at scale.

### Mamba Architecture

The full Mamba block combines the selective SSM with a gated architecture. Each block consists of:

1. A linear projection that expands the input dimension
2. A 1D depthwise convolution
3. The selective SSM
4. A SiLU activation gate
5. A linear projection back to the model dimension

Notably, Mamba does not use attention or explicit MLP blocks. The entire architecture is a simple stack of identical Mamba blocks with [residual connections](/wiki/residual_connection) and [RMSNorm](/wiki/layer_normalization).

### How well does Mamba perform?

Mamba-3B outperformed Transformers of the same size and matched Transformers with twice as many parameters on language modeling benchmarks [2]. On inference, Mamba achieved 5x higher throughput than comparably sized Transformers because it does not require a KV cache. Its memory footprint during generation is constant regardless of sequence length, compared to the linearly growing KV cache of Transformers. Mamba also demonstrated strong results on audio modeling (SaShiMi benchmark) and DNA sequence modeling, with performance improving on real data up to million-length sequences.

## Mamba-2: State Space Duality

In May 2024, Tri Dao and Albert Gu published "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" [5], introducing **Mamba-2**. This paper established a deep theoretical connection between SSMs and attention.

### The SSD Framework

The central contribution is the **Structured State Space Duality** (SSD) framework, which proves that a structured SSM with a scalar-times-identity state matrix is mathematically equivalent to a form of masked self-attention using a 1-semiseparable causal mask. In other words, certain SSMs and certain attention mechanisms compute exactly the same function, just expressed through different mathematical decompositions.

This duality has both theoretical and practical implications. Theoretically, it unifies two seemingly disparate families of sequence models under a single mathematical framework. Practically, it means that SSM computations can be decomposed into matrix multiplications, enabling much more efficient use of GPU tensor cores. The accompanying "minimal SSD" reference implementation expresses the selective SSM in roughly 25 lines of code [5].

### Mamba-2 Architecture

Mamba-2 refines the selective SSM layer from Mamba-1 using insights from SSD. The key changes include:

- A **multi-head SSM** structure analogous to multi-head attention, where multiple independent SSM heads operate in parallel
- Larger state dimensions enabled by the more efficient SSD algorithm
- Simplified gating compared to Mamba-1

The result is a model whose core layer is 2-8x faster to train than Mamba-1 while remaining competitive with Transformers on language modeling [5]. Mamba-2 was presented at ICML 2024.

## Mamba-3

In early 2026, Gu and collaborators introduced **Mamba-3** (an ICLR 2026 conference paper), which builds on the SSM framework with three axes of improvement rooted in SSM principles [6]:

1. **Trapezoidal discretization**: Replacing the zero-order hold or Euler discretization with a generalized trapezoidal rule yields a second-order-accurate, more expressive recurrence.
2. **Complex-valued states**: Reintroducing complex-valued state matrices allows the model to perform richer state tracking. The paper draws a theoretical bridge between complex SSMs and data-dependent rotary embeddings (RoPE). Prior work had shown that restricting states to real, non-negative values degraded performance on tasks requiring precise state tracking.
3. **MIMO formulation**: Instead of the standard single-input, single-output (SISO) SSM where the state update involves a low-arithmetic-intensity outer product, Mamba-3 uses a multi-input, multi-output formulation with matrix-valued inputs and matrix products. This pushes the computation into a compute-bound regime, improving hardware utilization without increasing the recurrent state size.

At the 1.5B parameter scale, the base Mamba-3 improved average downstream accuracy by 0.6 points over the prior best linear-complexity model (Gated DeltaNet), and the MIMO variant added a further 1.2 points, for a total gain of 1.8 points, while matching the pretraining perplexity of Mamba-2 with half the state size (for example, Mamba-3 with state size 64 matches Mamba-2 with state size 128) [6].

## How do SSMs differ from Transformers?

The comparison between SSMs and Transformers involves trade-offs across multiple dimensions.

### Computational Complexity

The most cited advantage of SSMs is their linear complexity. Standard Transformer self-attention computes pairwise interactions between all n tokens, resulting in O(n^2) time and memory complexity. SSMs process each token through a fixed-size state update, achieving O(n) complexity. For a sequence of 100,000 tokens, this difference is enormous: roughly 10 billion operations for attention versus 100,000 for the SSM recurrence (ignoring constant factors).

In practice, Transformers are faster at short sequence lengths (under roughly 8,000 tokens) due to highly optimized attention implementations like FlashAttention and the overhead of SSM scan operations. But SSMs become dramatically faster at longer sequences. Empirical measurements show SSMs can be up to 4x faster than Transformers at context lengths around 57,000 tokens, with the gap widening further at longer lengths [7].

### Memory Efficiency

During autoregressive generation, Transformers must store a KV cache that grows linearly with the number of generated tokens. For a large model generating a long sequence, this cache can consume tens of gigabytes of GPU memory. SSMs maintain a fixed-size state regardless of sequence length, resulting in roughly 64% reduced memory footprint in long-context scenarios [7].

### In-Context Learning

Transformers excel at [in-context learning](/wiki/in_context_learning), the ability to learn new tasks from examples provided in the prompt without any parameter updates. This capability emerges from the attention mechanism's ability to perform arbitrary content-based lookups across the entire context. SSMs, by contrast, must compress all past information into a fixed-size state vector, which limits their ability to perform precise retrieval of arbitrary facts from the context.

NVIDIA's empirical study of 8B-parameter models confirmed this gap: while pure Mamba and Mamba-2 models matched or exceeded Transformers on many standard benchmarks, they lagged behind on tasks requiring strong copying, in-context learning (e.g., five-shot [MMLU](/wiki/mmlu)), or long-context reasoning such as phonebook lookup [8].

### Comparison Table

| Property | Transformer | SSM (e.g., Mamba) | Hybrid (e.g., Jamba) |
|---|---|---|---|
| Training complexity | O(n^2) per layer | O(n) per layer | O(n) to O(n^2) depending on layer type |
| Inference memory | O(n) KV cache, grows with context | O(1) fixed state | Reduced KV cache (fewer attention layers) |
| Inference throughput | Bottlenecked by KV cache at long contexts | 5x higher than Transformer at similar size | 3-5x higher than Transformer |
| Long-range dependencies | Strong via direct attention | Strong with HiPPO initialization | Strong (both mechanisms available) |
| In-context learning | Excellent | Weaker, especially on recall tasks | Near-Transformer quality |
| Training parallelism | Fully parallel (matrix multiply) | Parallel via scan or convolution | Fully parallel |
| Hardware utilization | Highly optimized (FlashAttention) | Improving (Mamba-2/3 SSD algorithm) | Benefits from both |

## Hybrid Architectures

The complementary strengths and weaknesses of Transformers and SSMs have motivated hybrid architectures that combine both.

### Jamba (AI21 Labs, 2024)

**[Jamba](/wiki/jamba)** was the first production-grade hybrid Transformer-Mamba model, released by [AI21 Labs](/wiki/ai21_labs) in March 2024 [9]. Its architecture interleaves Mamba layers and Transformer attention layers at a ratio of approximately 7:1 (seven Mamba layers for every one attention layer). It also incorporates [Mixture of Experts](/wiki/mixture_of_experts) (MoE) layers to increase parameter count without proportionally increasing compute. The base Jamba model has 52B total parameters but activates only 12B during inference, and is released with open weights under the Apache 2.0 license [9].

Jamba demonstrated that the hybrid approach could achieve state-of-the-art performance on standard [language model](/wiki/large_language_model) benchmarks while supporting context lengths up to 256K tokens (about 210 pages of text) with significantly lower memory consumption than a pure Transformer of equivalent quality, fitting up to 140K tokens on a single 80GB GPU and delivering roughly 3x the throughput of a similarly sized Transformer such as Mixtral 8x7B on long contexts [9]. AI21 later released **Jamba 1.5** in late 2024, scaling to 398B total parameters (94B active) across 72 layers, using grouped-query attention and 16 MoE experts [10].

### NVIDIA Nemotron-H (2025)

NVIDIA's **Nemotron-H** family demonstrated hybrid Mamba-Transformer models at scale [11]. The Nemotron-H-8B model consists of 24 Mamba-2 layers, 24 MLP layers, and 4 self-attention layers, pre-trained on 15 trillion tokens. A larger 56B variant was pre-trained on 20 trillion tokens using FP8 precision. NVIDIA also released a compressed 47B variant (created from the 56B model using a pruning-and-distillation technique called MiniPuzzle) designed to support roughly 1-million-token context on a single NVIDIA RTX 5090 GPU.

Nemotron-H models offered accuracy on par with or better than similarly sized pure Transformer models (such as [Qwen](/wiki/qwen)-2.5 and Llama-3.1) while providing up to 3x faster inference [11]. The architecture was also used as the backbone for NVIDIA's Cosmos-Reason 1 vision-language model.

### NVIDIA Mamba-2-Hybrid Study

In a systematic 2024 study, NVIDIA trained 8B-parameter Mamba, Mamba-2, Transformer, and hybrid models on up to 3.5 trillion tokens using identical datasets [8]. The hybrid model (43% Mamba-2 layers, 7% self-attention layers, 50% MLP layers) outperformed the pure Transformer on all 12 standard benchmarks by an average of 2.65 points and was projected to be up to 8x faster at token generation during inference.

## Other SSM and Linear-Complexity Variants

The SSM family is part of a broader wave of architectures seeking to replace or complement Transformer attention with sub-quadratic alternatives.

### H3 (2022)

H3 (Hungry Hungry Hippos), developed by the same Stanford/Hazy Research group behind S4, uses two stacked SSM layers sandwiched around multiplicative gating [12]. Each layer contains a short convolution (for local patterns) and a long SSM convolution (for global patterns). H3 was one of the first models to show that SSMs could match Transformer perplexity on language modeling at moderate scale, achieving results within 0.4 perplexity points of a similarly sized Transformer.

### Hyena (2023)

Hyena, also from the Hazy Research lab at Stanford, replaced the SSM convolution with implicitly parameterized long convolutions, generating the convolution filters through a small neural network rather than deriving them from state space equations [13]. Hyena achieved sub-quadratic O(n log n) complexity and showed competitive results with attention-based models on language tasks up to moderate context lengths.

### RWKV (2023-2024)

[RWKV](/wiki/rwkv) (Receptance Weighted Key Value) takes a different approach, combining the parallelizable training of Transformers with the efficient O(1) inference of RNNs [14]. Developed primarily by Bo Peng (BlinkDL) and an open-source community, RWKV can be formulated as either a Transformer-like model (for parallel training) or an RNN (for efficient inference). RWKV uses a linear attention mechanism with learned decay factors rather than explicit state space equations.

The architecture has gone through multiple versions. RWKV-5 (Eagle) and RWKV-6 (Finch) introduced matrix-valued states and dynamic recurrence, significantly improving expressiveness [14]. RWKV-7 (Goose) followed, and by early 2026, RWKV-X added hybrid elements combining linear complexity with sparse attention.

### RetNet (2023)

Microsoft's **RetNet** (Retentive Network) frames efficient sequence modeling through a retention mechanism that can be computed in three equivalent ways: a parallel form (similar to attention, for training), a recurrent form (for O(1) inference), and a chunk-wise form (balancing parallelism and memory) [15]. RetNet is closely related to H3 but simplifies the SSM component to a state dimension of N=1, which allows parallelization through a variant of multi-head attention with exponential decay rather than through convolutions.

### Griffin (Google DeepMind, 2024)

[Google DeepMind](/wiki/google_deepmind)'s **Griffin** architecture mixes gated linear recurrences with local attention [16]. It uses a recurrent block (similar to an SSM) for global sequence mixing and windowed attention for local interactions. The companion model **Hawk** uses only the gated linear recurrence without any attention. Griffin matched the performance of Llama-2 despite being trained on more than 6x fewer tokens, suggesting strong data efficiency. Google released **RecurrentGemma**, an open-weights model based on the Griffin architecture, for production use.

## Benchmarks and Empirical Results

The relative performance of SSMs, Transformers, and hybrids has been evaluated across multiple benchmarks and scales.

### Language Modeling

At the 3B parameter scale, Mamba matched or outperformed Transformers of equal size on perplexity and downstream tasks, while matching Transformers of twice the size on several evaluations [2]. At the 8B scale, NVIDIA's study found that pure Mamba-2 achieved competitive perplexity but trailed Transformers on recall-heavy tasks. The hybrid Mamba-2-Hybrid closed this gap entirely, exceeding the Transformer on all evaluated tasks [8].

### Long-Range Arena

The Long-Range Arena (LRA) benchmark, designed specifically to test long-range dependency modeling, has been a standard evaluation suite for SSMs. S4 reached state of the art on every task in LRA, including the first solution to Path-X (sequence length 16,384) [1]. Subsequent SSM variants have continued to perform strongly on LRA, generally outperforming Transformer baselines that struggle with the longest sequences.

### Genomics and Audio

Mamba demonstrated particularly strong results in domains with inherently long sequences. On DNA sequence modeling, Mamba outperformed prior specialized architectures. On the SaShiMi audio generation benchmark, SSM-based models have consistently outperformed Transformer baselines, likely because audio waveforms are naturally continuous signals that align well with the state space formulation.

### Inference Throughput

Mamba's inference advantage is most pronounced during autoregressive generation. Without a KV cache, Mamba can process much larger batch sizes on the same hardware, yielding 4-5x higher throughput than a similarly sized Transformer. For applications requiring long-form generation or serving many concurrent requests, this efficiency advantage is substantial.

## Limitations

Despite their advantages, SSMs have several notable limitations.

**In-context learning and retrieval.** The most significant limitation is weaker performance on tasks requiring precise information retrieval from context. Because SSMs compress all history into a fixed-size state vector, they struggle with tasks like phonebook lookup, where a specific fact must be retrieved from among many stored facts. Transformers, with their O(n^2) attention, can directly access any position in the context.

**Associative recall.** Related to in-context learning, SSMs have difficulty with associative recall tasks where the model must remember arbitrary key-value mappings presented in context. Research from 2025 has shown that the mechanisms underlying in-context learning differ significantly across SSM variants, and hybrid models partially but not fully close the gap with Transformers [7].

**Ecosystem maturity.** As of early 2026, the Transformer ecosystem is far more mature. Optimized inference engines ([vLLM](/wiki/vllm), [TensorRT](/wiki/tensorrt)-LLM), training frameworks (Megatron-LM, [DeepSpeed](/wiki/deepspeed)), and hardware (NVIDIA tensor cores optimized for matrix multiplication) are all designed primarily for Transformers. SSM-specific tooling is improving, with NVIDIA's NeMo framework adding hybrid SSM support, but the gap remains.

**Scaling uncertainty.** While SSMs have shown strong results up to roughly 8B parameters, there is less empirical evidence about their behavior at the 70B+ scale that frontier Transformer LLMs operate at. Nemotron-H-56B and Jamba 1.5 (398B total, 94B active) provide early data points, but the scaling behavior of pure SSMs at very large scale remains an open question.

**Training speed at short contexts.** For sequences shorter than roughly 8,000 tokens, highly optimized Transformer implementations (FlashAttention-3 on H100/H200 GPUs) can be faster than SSM implementations. The SSM advantage materializes primarily at longer context lengths.

## What are state space models used for?

SSMs have found applications across several domains:

- **Language modeling**: Mamba, Mamba-2, and hybrid architectures serve as general-purpose language model backbones, with production deployments by AI21 Labs (Jamba) and NVIDIA (Nemotron-H).
- **Genomics**: DNA and protein sequences are often hundreds of thousands of nucleotides long, making the linear complexity of SSMs particularly valuable. Mamba-based models have achieved state-of-the-art results on genomic benchmarks.
- **Audio and speech**: The continuous-signal origins of SSMs make them natural fits for audio processing. SSM-based models have been used for speech recognition, audio generation, and music modeling.
- **[Computer vision](/wiki/computer_vision)**: MambaVision (NVIDIA, CVPR 2025) introduced a hybrid Mamba-Transformer backbone for vision tasks, achieving competitive results with Vision Transformers while being more efficient at high resolutions.
- **Edge computing**: SSMs' lower memory requirements make them attractive for deployment on resource-constrained hardware. BrainChip demonstrated a 1-billion-parameter SSM model (TENN) running on dedicated hardware at under 0.5 watts, suitable for dashcams, medical devices, and IoT applications [17].
- **Physical AI and robotics**: NVIDIA used Nemotron-H as the backbone for Cosmos-Reason 1, a vision-language model designed for physical AI and robotics applications.

## Current State (2025-2026)

As of early 2026, the landscape of state space models is evolving rapidly.

The dominant trend is **hybridization**. Pure SSM models, while impressively efficient, have not convincingly matched Transformers across all tasks at scale. Hybrid architectures that combine a majority of SSM layers with a small number of attention layers appear to offer the best of both worlds: near-linear complexity and memory efficiency from the SSM layers, combined with strong in-context learning and recall from the attention layers. NVIDIA's Nemotron-H, AI21's Jamba, and Google's RecurrentGemma all follow this pattern.

The theoretical understanding of SSMs has deepened considerably. The SSD framework from Mamba-2 demonstrated a formal equivalence between structured SSMs and structured attention. Mamba-3's complex-valued states and MIMO formulation addressed known expressivity gaps. IBM's work on structured sparse transition matrices, a [NeurIPS](/wiki/neurips) 2025 spotlight, tackled the expressivity-efficiency balance from yet another angle.

On the hardware front, SSMs are beginning to benefit from dedicated optimization. NVIDIA's NeMo framework supports hybrid SSM training, and the Mamba-2/3 kernels take advantage of tensor core operations. Research published in Nature Communications in 2025 demonstrated SSM implementation in compute-in-memory hardware for energy-efficient, event-driven processing.

A comprehensive survey published in March 2025, "From S4 to Mamba," traced the full evolution of structured state space models, cataloging dozens of variants and their applications [18]. The survey highlights both the rapid progress in the field and the remaining open questions, particularly around scaling SSMs to frontier model sizes and closing the gap with Transformers on recall-intensive tasks.

The Transformer remains the dominant architecture for frontier AI systems. But SSMs have established themselves as a legitimate and increasingly practical alternative, particularly for long-context, memory-constrained, or latency-sensitive applications. Whether the future belongs to pure SSMs, pure Transformers, or hybrids remains one of the most consequential open questions in deep learning architecture research.

## Explain Like I'm 5 (ELI5)

Imagine you are listening to a very long story. A Transformer is like writing down every single word of the story on sticky notes and spreading them all across a huge table so you can look at any two words at the same time. This works great, but it takes a lot of sticky notes and a really big table for a long story.

A state space model is more like keeping a small notebook. As you hear each word, you update your notes with the important stuff and erase what you do not need. You never need a bigger notebook no matter how long the story gets. The tricky part is deciding what to write down and what to erase. Mamba is a clever version that looks at each new word and decides on the spot whether it is important enough to remember.

## References

[1] Gu, A., Goel, K., & Re, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR 2022 (Outstanding Paper Honorable Mention). arXiv:2111.00396. https://arxiv.org/abs/2111.00396

[2] Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752

[3] Gu, A., Dao, T., Ermon, S., Rudra, A., & Re, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." NeurIPS 2020. arXiv:2008.07669. https://arxiv.org/abs/2008.07669

[4] Gu, A., Gupta, A., Goel, K., & Re, C. (2022). "On the Parameterization and Initialization of Diagonal State Space Models." NeurIPS 2022. arXiv:2206.11893. https://arxiv.org/abs/2206.11893

[5] Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. arXiv:2405.21060. https://arxiv.org/abs/2405.21060

[6] Gu, A. et al. (2025). "Mamba-3: Improved Sequence Modeling using State Space Principles." ICLR 2026. arXiv:2603.15569. https://arxiv.org/abs/2603.15569

[7] Mitra, S., Karami, R., Xu, H., Huang, S., & Kwon, H. (2025). "Characterizing State Space Model and Hybrid Language Model Performance with Long Context." arXiv:2507.12442. https://arxiv.org/abs/2507.12442

[8] Waleffe, R. et al. (2024). "An Empirical Study of Mamba-based Language Models." NVIDIA Research. arXiv:2406.07887. https://arxiv.org/abs/2406.07887

[9] Lieber, O. et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." AI21 Labs. arXiv:2403.19887. https://arxiv.org/abs/2403.19887

[10] AI21 Labs. (2024). "Jamba-1.5: Hybrid Transformer-Mamba Models at Scale." https://www.ai21.com/research/jamba-1-5-hybrid-transformer-mamba-models-at-scale/

[11] NVIDIA. (2025). "Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models." arXiv:2504.03624. https://arxiv.org/abs/2504.03624

[12] Fu, D., Dao, T., Saab, K., Thomas, A., Rudra, A., & Re, C. (2022). "Hungry Hungry Hippos: Towards Language Modeling with State Space Models." ICLR 2023. arXiv:2212.14052. https://arxiv.org/abs/2212.14052

[13] Poli, M. et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." ICML 2023. arXiv:2302.10866. https://arxiv.org/abs/2302.10866

[14] Peng, B. et al. (2024). "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence." COLM 2024. arXiv:2404.05892. https://arxiv.org/abs/2404.05892

[15] Sun, Y. et al. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. https://arxiv.org/abs/2307.08621

[16] De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local [Attention](/wiki/attention) for Efficient Language Models." Google DeepMind. arXiv:2402.19427. https://arxiv.org/abs/2402.19427

[17] BrainChip. (2025). "State Space Models for [Edge AI](/wiki/edge_ai)." Presented at Embedded Vision Summit 2025. https://www.infoq.com/news/2025/07/state-space-models-edge-compute/

[18] Somvanshi, S. et al. (2025). "From S4 to Mamba: A Comprehensive Survey on Structured State Space Models." arXiv:2503.18970. https://arxiv.org/abs/2503.18970

[19] Cartesia. "Company." https://www.cartesia.ai/company/ ; Tri Dao, personal website. https://tridao.me/