See also: transformer, attention mechanism, recurrent neural network, large language model
Mamba is a neural network architecture for sequence modeling that uses selective state space models (SSMs) to process sequential data in linear time with respect to sequence length. It was introduced by Albert Gu and Tri Dao in December 2023 in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." The architecture offers an alternative to the transformer, which relies on the attention mechanism and scales quadratically with sequence length.
The central idea behind Mamba is making the parameters of a state space model depend on the input, allowing the model to selectively propagate or forget information along the sequence depending on the current token. Combined with a hardware-aware parallel scan algorithm, Mamba achieves both the modeling power of content-aware reasoning and the computational efficiency of linear-time processing.
Mamba was pretrained at multiple scales (130M, 370M, 790M, 1.4B, and 2.8B parameters) on 300 billion tokens from the Pile dataset. The Mamba-3B model matched the perplexity of a Transformer twice its size while delivering 5x higher inference throughput. As a general sequence model backbone, Mamba achieved strong performance across language, audio, and genomics modalities.
State space models originate from control theory and signal processing. A continuous-time SSM maps an input signal u(t) to an output signal y(t) through a latent state vector x(t) using two equations:
Here, A is the state transition matrix that governs how the latent state evolves over time, B is the input projection matrix, C is the output projection matrix, and D provides a direct skip connection from input to output. The model learns these parameters to capture the dynamics of input-to-output mappings through the latent state representation.
In classical systems, A, B, C, and D are fixed matrices. This property is called Linear Time Invariance (LTI): the same dynamics apply regardless of when or what input arrives. LTI systems have useful mathematical properties but cannot perform content-dependent reasoning, a limitation that Mamba directly addresses.
Since real-world data like text tokens and audio samples arrive as discrete sequences rather than continuous signals, the continuous SSM must be discretized before it can be applied computationally. The most common method is the zero-order hold (ZOH) technique, which holds each discrete input value constant until the next sample arrives.
A learnable step size parameter (denoted delta) controls the resolution of the discretization. Through ZOH, the continuous matrices A and B are converted into their discrete counterparts (typically written as A-bar and B-bar) that operate on sequences step by step:
Discretization is one of the most important aspects of SSM architectures because it enables two equivalent computational views of the same model: a recurrent view (processing tokens one at a time) and a convolutional view (processing the entire sequence in parallel). This dual representation allows SSMs to use the parallelizable convolutional form during training for speed, then switch to the recurrent form during inference for efficiency.
A key challenge for any recurrent model is remembering information over long sequences without suffering from vanishing or exploding gradients. The HiPPO (High-order Polynomial Projection Operators) framework, introduced by Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re in 2020, provides a principled initialization for the state matrix A.
HiPPO works by continuously projecting the input history onto a basis of orthogonal polynomials (specifically Legendre polynomials). This creates a state representation that compresses historical information optimally: recent tokens are captured with high fidelity while older tokens decay gracefully. The resulting HiPPO matrix ensures that each update step requires only O(N) operations, and gradient norms scale as O(1/t), preventing the vanishing and exploding gradient problems that plague standard RNNs.
HiPPO initialization proved to be a foundational component for all subsequent SSM architectures, including S4 and Mamba.
The Structured State Space for Sequences (S4) model, introduced by Albert Gu, Karan Goel, and Christopher Re in 2021 (published at ICLR 2022 where it received an Outstanding Paper Honorable Mention), was the first SSM architecture to achieve competitive performance with transformers on a variety of sequence modeling tasks.
S4 combined three components:
The main technical contribution of S4 was decomposing the HiPPO matrix A into a Normal Plus Low-Rank (NPLR) form. This decomposition allowed the SSM convolutional kernel to be computed in O(N + L) operations and memory (where N is the state dimension and L is the sequence length), reducing what would otherwise be an intractable computation.
S4 achieved several notable results:
| Benchmark | S4 result | Significance |
|---|---|---|
| Sequential CIFAR-10 | 91.0% accuracy | Without data augmentation or auxiliary losses |
| Long Range Arena (Path-X) | Solved (length 16,384) | First model to solve this task; all prior methods failed |
| Generation speed | 60x faster than Transformers | On comparable autoregressive tasks |
However, S4 and other LTI state space models shared a fundamental limitation: because their parameters remained constant regardless of input content, they could not perform content-based reasoning. For example, they struggled with tasks requiring the model to attend to specific tokens based on their content rather than their position.
Between S4 and Mamba, several intermediate SSM variants were developed:
| Model | Year | Key contribution |
|---|---|---|
| S4 | 2021 | NPLR parameterization of HiPPO matrix |
| DSS (Diagonal State Spaces) | 2022 | Showed diagonal approximation of A achieves comparable performance |
| S4D | 2022 | Simplified S4 with diagonal initialization |
| S5 | 2023 | Multi-input multi-output (MIMO) SSM with parallel scan |
| H3 (Hungry Hungry Hippos) | 2023 | Combined SSM with multiplicative gating for language modeling |
| Mamba | 2023 | Input-dependent (selective) SSM parameters |
Each step simplified the architecture while maintaining or improving performance, with Mamba making the final and most significant leap by abandoning time invariance altogether.
The core limitation of all prior SSM architectures was Linear Time Invariance. Because the matrices A, B, and C remained identical for every token regardless of input content, these models could not perform content-aware filtering. Consider a language model processing the sentence "The cat sat on the mat": an LTI model treats the word "the" with exactly the same dynamics as the word "cat," even though they carry very different amounts of semantic information.
Transformers solve this problem through self-attention, which computes pairwise similarity scores between all tokens. This enables content-based reasoning but comes at O(L^2) cost in both time and memory. Mamba achieves content-aware processing while maintaining O(L) complexity.
Mamba makes three key SSM parameters functions of the input:
| Parameter | Role | How it becomes selective |
|---|---|---|
| B (input matrix) | Controls how input enters the state | Projected from the input via a linear layer; different for each token |
| C (output matrix) | Controls how state maps to output | Projected from the input via a linear layer; different for each token |
| Delta (step size) | Controls discretization resolution | Projected from the input via a linear layer + softplus; different for each token |
The state matrix A remains fixed (initialized with HiPPO) because making it input-dependent would break the parallel scan algorithm.
The step size delta plays a particularly important role as a selection mechanism. A large delta causes the model to emphasize the current input token and reset more of the historical state. A small delta causes the model to suppress the current token in favor of preserving existing context. This gives the model a learned, content-dependent ability to decide what information to retain and what to discard, similar to the gating mechanisms in LSTMs and GRUs.
Making parameters input-dependent means the model can no longer be computed as a fixed convolution kernel, since the kernel changes at every time step. This forces Mamba to use the recurrent representation. However, naive sequential recurrence would be far too slow for training on modern GPUs.
Mamba solves this with the parallel scan (prefix sum) algorithm. The key insight is that the recurrence operation is associative: the order in which intermediate results are combined does not affect the final answer. This property allows the sequence to be split into segments that are computed in parallel, with results merged iteratively. The parallel scan reduces the time complexity from O(L) sequential steps to O(log L) parallel steps while producing the same output as sequential recurrence.
The combination of input-dependent parameters and the parallel scan algorithm is what the authors call the "selective scan" mechanism.
The selective scan mechanism creates a computational challenge: the expanded state (with input-dependent B and C matrices incorporating the batch and sequence length dimensions) is much larger than the original state. Naively materializing this expanded state in GPU high-bandwidth memory (HBM, also called DRAM) would erase the efficiency gains of using an SSM in the first place.
Mamba addresses this with a hardware-aware algorithm that mirrors techniques from FlashAttention (also developed by Tri Dao). The implementation uses three key optimizations:
Kernel fusion. Instead of writing intermediate results (discretization output, scan output, and the C multiplication) back to slow GPU DRAM between each operation, Mamba fuses all three operations into a single GPU kernel that keeps intermediate values in fast on-chip SRAM. This eliminates costly memory transfers between the GPU memory hierarchy levels.
Recomputation. Rather than storing the large intermediate states during the forward pass for use in backpropagation, Mamba recomputes them during the backward pass. Although this doubles the computation for those states, recomputation is faster than the alternative of reading large intermediate tensors from DRAM, because modern GPUs are memory-bandwidth-limited rather than compute-limited.
Avoiding materialization. The expanded state is never fully materialized in DRAM. Only the compressed hidden state (of size N, the state dimension) is kept in memory, not the full expanded representation.
These optimizations make the selective scan operation 20 to 40 times faster than a naive implementation. The resulting algorithm is faster than optimized attention implementations (such as FlashAttention) for long sequences while scaling linearly rather than quadratically.
The full Mamba architecture wraps the selective SSM layer into a simplified block design. Unlike transformers, which alternate between self-attention layers and MLP (feed-forward) layers, each Mamba block combines both functions into a single unit:
This design is inspired by the gated MLP architecture from Llama and similar models, but with the selective SSM replacing the nonlinear activation on one branch. The result is a simpler architecture that does not require separate attention and MLP blocks, layer normalization within the block, or positional embeddings.
A complete Mamba model stacks many of these blocks (for example, 48 blocks for the 1.4B parameter model and 64 blocks for the 2.8B model), with RMSNorm applied before each block.
| Aspect | Transformer | Mamba |
|---|---|---|
| Core mechanism | Self-attention | Selective state space model |
| Time complexity (sequence length L) | O(L^2) per layer | O(L) per layer |
| Training parallelism | Fully parallel (attention matrix) | Parallel (via parallel scan) |
| Inference mode | Autoregressive with KV cache | Recurrent with fixed-size state |
| Inference memory | KV cache grows linearly with context | Fixed-size state (constant memory) |
| Content-aware reasoning | Yes (attention computes pairwise token interactions) | Yes (input-dependent SSM parameters) |
| Context window | Fixed maximum (requires extension techniques) | Theoretically unbounded |
| Long-range dependencies | Attention can directly connect any two positions | State carries compressed history |
| In-context learning | Strong | Weaker (limited by finite state size) |
| Copying/retrieval | Strong (direct access to all past tokens) | Weaker (information must survive state compression) |
| Position encoding | Required (sinusoidal, RoPE, ALiBi, etc.) | Not required |
| Inference throughput (similar size) | Baseline | Up to 5x higher |
The fundamental trade-off is between expressiveness and efficiency. Transformers maintain an uncompressed representation of the entire sequence through the KV cache, allowing direct access to any past token. Mamba compresses the entire history into a fixed-size hidden state, which is much more memory-efficient but means information can only be accessed if it survived the compression. This makes transformers better at tasks requiring exact retrieval from context (such as looking up a phone number mentioned earlier in the text) while Mamba excels at tasks involving long-range dependencies and high-throughput generation.
Mamba was evaluated across multiple modalities on the Pile dataset (300B tokens for language) and domain-specific datasets for audio and genomics.
On language modeling with the Pile, Mamba models showed consistent improvements over transformer baselines at each scale:
| Model | Parameters | Pile perplexity | Notes |
|---|---|---|---|
| Transformer | 1.4B | Baseline | Standard transformer |
| Mamba | 1.4B | Better than Transformer-1.4B | Matches Transformer at lower perplexity |
| Transformer | 2.8B | Baseline | Standard transformer |
| Mamba | 2.8B | Matches Transformer-6.9B | Same quality at 40% of the compute |
On downstream zero-shot evaluation tasks, Mamba-3B outperformed Transformer-3B models and matched or exceeded the performance of Transformers with twice as many parameters.
On the SC09 speech generation benchmark, a small Mamba model outperformed larger GAN-based and diffusion-based models (including WaveNet, SampleRNN, WaveGAN, DiffWave, and SaShiMi). A parameter-matched larger variant further improved fidelity metrics, reducing the FID score by more than half compared to prior state-of-the-art.
Mamba demonstrated strong results on DNA sequence modeling, outperforming HyenaDNA across model sizes. Unlike HyenaDNA, whose performance degraded with longer sequences, Mamba continued to improve with context lengths up to 1 million tokens. On a downstream species classification task distinguishing five great ape species (which share approximately 99% DNA similarity), Mamba's ability to use extremely long contexts proved particularly effective.
Mamba's selective scan implementation achieved up to 3x speedup over prior SSM methods on A100 GPUs and up to 5x higher generation throughput compared to similarly sized Transformers. The efficient scan kernel was 40x faster than a standard (naive) implementation of the selective recurrence.
In May 2024, Tri Dao and Albert Gu published "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," introducing Mamba-2. The paper established a theoretical framework called Structured State Space Duality (SSD) that reveals a deep connection between SSMs and attention mechanisms.
The key theoretical result is that a state space model with a scalar-times-identity state matrix (where all diagonal elements of A are identical) is mathematically equivalent to a form of masked self-attention with a 1-semiseparable causal mask. This duality means the same computation can be expressed either as an SSM recurrence or as a matrix multiplication resembling attention:
When all a_t values equal 1, the attention form reduces to standard causal linear attention.
This connection runs through the theory of semiseparable matrices, a well-studied class in numerical linear algebra. The authors showed that both SSMs and attention can be understood as different decompositions of the same structured matrix.
| Feature | Mamba-1 | Mamba-2 |
|---|---|---|
| State matrix A | Diagonal (different value per channel) | Scalar times identity (one value shared) |
| Head dimension | P = 1 (independent SSMs per channel) | P >= 64 (shared dynamics across channels) |
| Typical state dimension N | 16 | 64 to 256 |
| Parameter generation | Sequential (SSM params depend on x) | Parallel (A, B, C generated alongside x) |
| Core algorithm | Selective scan via parallel prefix sum | Chunkwise SSD (quadratic within chunks, linear across chunks) |
The larger state dimension in Mamba-2 (up to 16x larger than Mamba-1) significantly improves performance on associative recall tasks, where Mamba-1 was weakest. The multi-head structure, where P channels share a single state transition, reduces the total number of independent recurrences while increasing expressiveness.
Mamba-2's core SSD layer runs 2 to 8 times faster than Mamba-1's selective SSM while maintaining competitive performance with Transformers on language modeling benchmarks. The SSD algorithm achieves this speedup by decomposing sequences into chunks, applying the quadratic (attention-like) computation within each chunk for hardware efficiency, and passing SSM states between chunks to maintain the linear overall scaling. A minimal PyTorch implementation requires approximately 30 lines of code.
In March 2026, Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu introduced Mamba-3 (published at ICLR 2026). The paper addresses remaining expressivity limitations in Mamba-2 through three innovations:
Exponential-trapezoidal discretization. This replaces the simpler exponential-Euler discretization used in Mamba-1 and Mamba-2 with a second-order accurate method. The improved discretization enables an implicit convolution applied on the SSM input, increasing the expressivity of the recurrence.
Complex-valued state updates. By modeling the SSM in the complex number domain, Mamba-3 achieves a more expressive state update. This connects to data-dependent rotary position embeddings (RoPE), providing a theoretical bridge between complex SSMs and the position encoding techniques used in transformers.
Multi-input multi-output (MIMO) formulation. MIMO transitions from outer-product to matrix-multiplication-based state updates, increasing the rank of input and output projections. This raises decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size but maintains similar wall-clock decode latency.
At the 1.5B parameter scale, Mamba-3 improved average downstream accuracy by 0.6 points over Gated DeltaNet (the next best model) in the SISO configuration, with the MIMO variant adding another 1.2 points. Mamba-3 with state size 64 matched the perplexity of Mamba-2 with state size 128, effectively achieving the same language modeling quality at half the state size. The model also solved formal language tasks (parity and modular arithmetic) that Mamba-2 could not handle, addressing known state-tracking deficiencies in prior linear recurrent models.
Recognizing that transformers and Mamba have complementary strengths, AI21 Labs released Jamba in March 2024, the first production-grade hybrid architecture combining Transformer attention layers, Mamba SSM layers, and Mixture-of-Experts (MoE).
Jamba interleaves blocks of Transformer and Mamba layers with a ratio of approximately one attention layer for every seven Mamba layers. Each block contains either an attention or Mamba layer followed by a feed-forward network (MLP). The MoE component allows the model to use only 12B of its total 52B parameters at inference time, keeping compute costs manageable.
Jamba supports context lengths up to 256K tokens while fitting in a single 80GB GPU. The hybrid design provides:
AI21 Labs later released Jamba-1.5 in two sizes: Jamba-1.5-Large (94B active parameters) and Jamba-1.5-Mini (12B active parameters), both with 256K token effective context length. Jamba demonstrated that hybrid architectures can capture the best properties of both design paradigms rather than forcing a choice between them.
Mamba and its variants have been applied to language modeling, text generation, and related NLP tasks. The linear-time inference makes Mamba particularly attractive for applications requiring long-context generation or high-throughput serving. Several open-source Mamba-based language models have been released, and the architecture has been integrated into model serving frameworks.
SSMs have proven especially effective for genomics, where sequences can be extremely long (millions of base pairs). Mamba-based models outperform transformer-based and convolution-based alternatives on DNA sequence modeling tasks. Caduceus, a bidirectional Mamba model for DNA, outperformed comparably sized unidirectional models and transformer models orders of magnitude larger on biologically relevant tasks including predicting the effects of genetic mutations on gene expression.
Mamba achieved state-of-the-art results on audio waveform modeling, outperforming SaShiMi, Hyena, and transformer-based models on both pretraining quality and downstream generation metrics. The ability to process very long sequences efficiently makes SSMs a natural fit for raw audio waveforms, which require modeling dependencies across tens of thousands of samples.
Vision Mamba (Vim), introduced in January 2024 and accepted at ICML 2024, adapts the Mamba architecture for visual tasks. Vim uses bidirectional Mamba blocks with position embeddings to process image patch sequences. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation, Vim achieves higher performance than the DeiT vision transformer while being 2.8x faster and using 86.8% less GPU memory at high resolutions.
Other vision variants include VMamba (NeurIPS 2024), VideoMamba for video understanding (ECCV 2024), and Mamba-ND for multi-dimensional data.
Despite its efficiency advantages, Mamba has several known limitations compared to transformers:
In-context learning. Empirical studies show that Mamba and Mamba-2 lag behind transformers on in-context learning tasks. On the standard five-shot MMLU benchmark, Mamba models produce approximately 15 points lower accuracy compared to similarly-sized transformer models after training on 1.1 trillion tokens.
Copying and retrieval. Transformers can copy information from their input context with near-perfect accuracy up to their context length, while Mamba models begin to fail at copying for input sequences beyond approximately 500 tokens. This limitation is fundamental: a constant-size state cannot faithfully store an arbitrary-length sequence. Generalized SSMs cannot copy input sequences uniformly unless the state size grows linearly with sequence length.
Multi-query associative recall. Mamba struggles to retrieve specific key-value pairs from context, because the finite hidden state can be overwhelmed as the number of key-value pairs increases. Pretrained transformers can outperform Mamba models with 10x more parameters on information retrieval tasks.
State capacity. All of these limitations stem from the same root cause: Mamba compresses the entire sequence history into a fixed-size state vector. Any information that does not survive this compression is permanently lost. Transformers avoid this problem by maintaining full access to all past tokens through the KV cache, at the cost of linear memory growth and quadratic computation.
These limitations have motivated hybrid architectures like Jamba, which use sparse attention layers to handle retrieval-heavy tasks while relying on Mamba layers for efficient long-range processing.
Imagine you are listening to a very long story. A transformer is like writing down every single word of the story so you can look back at any word whenever you want. This works really well, but your notebook gets bigger and bigger, and it takes longer and longer to flip through all the pages.
Mamba is like keeping a summary in your head instead of writing everything down. As you hear each new word, you decide whether it is important enough to remember or whether you can forget it. You update your mental summary as you go. Your summary always stays the same size no matter how long the story gets, so you can listen to very, very long stories without running out of space. The downside is that if someone asks you to repeat word number 47 exactly, you might not remember it because you only kept a summary, not the full text.
The special trick Mamba uses is that it gets to decide what to remember based on what it is hearing right now. If it hears something important, it pays more attention and updates its summary. If it hears something less important, it mostly ignores it and keeps its existing summary. This is what "selective" means in "selective state space model."