Mamba is a deep learning architecture for sequence modeling based on selective state space models (SSMs). Introduced by Albert Gu and Tri Dao in December 2023, Mamba addresses fundamental limitations of Transformer architectures by replacing the quadratic-time self-attention mechanism with a linear-time selective state space layer. The architecture achieves performance competitive with Transformers on language modeling and other sequential tasks while scaling linearly in sequence length, enabling efficient processing of sequences up to millions of tokens. Mamba was published at the first Conference on Language Modeling (COLM) in 2024, where it received the Outstanding Paper Award.
Since their introduction in 2017, Transformers have dominated sequence modeling across natural language processing, computer vision, and other domains. The core mechanism of the Transformer, self-attention, computes pairwise interactions between all tokens in a sequence. This gives the model a global receptive field at every layer but comes at a computational cost of O(n^2) in sequence length n, both in time and memory. For long sequences (thousands to millions of tokens), this quadratic scaling becomes prohibitively expensive. Moreover, during autoregressive generation, Transformers must store a key-value (KV) cache that grows linearly with sequence length, creating memory bottlenecks during inference.
These limitations motivated research into sub-quadratic architectures that could match Transformer quality while handling longer sequences more efficiently. State space models emerged as one of the most promising directions.
State space models draw on classical control theory and signal processing. A continuous-time SSM maps an input signal x(t) to an output y(t) through a latent state h(t) using the following equations:
Here, A is the state transition matrix, B is the input projection matrix, C is the output projection matrix, and D provides a skip connection. To apply these models to discrete sequences (such as text tokens), the continuous parameters are discretized using techniques like the zero-order hold (ZOH) method, which introduces a learnable step size parameter (delta) that controls the resolution at which the continuous dynamics are sampled.
The theoretical foundation for modern SSMs traces back to the HiPPO (High-order Polynomial Projection Operators) framework, introduced by Albert Gu and collaborators at NeurIPS 2020. HiPPO provides a principled method for initializing the state matrix A so that the hidden state optimally compresses the history of the input signal using orthogonal polynomial projections. This initialization proved critical for enabling SSMs to capture long-range dependencies, a problem that had long plagued recurrent neural networks.
Building on HiPPO, the S4 (Structured State Spaces for Sequence Modeling) architecture was introduced by Albert Gu, Karan Goel, and Christopher Re at ICLR 2022. S4 addressed the computational challenges of working with large state matrices by conditioning the matrix A with a low-rank correction, allowing stable diagonalization and reducing the computation to a Cauchy kernel. S4 achieved breakthrough results on the Long Range Arena benchmark, including solving the Path-X task (sequence length 16,384) that all prior methods had failed on. It also reached 91% accuracy on sequential CIFAR-10 without data augmentation, performed generation 60 times faster than Transformers, and substantially closed the gap on language and image modeling tasks.
However, S4 and its variants operated as linear time-invariant (LTI) systems, meaning the state transition parameters remained fixed regardless of the input content. This property enabled efficient computation via convolutions but limited the model's ability to perform content-based reasoning.
The H3 (Hungry Hungry Hippos) model, published at ICLR 2023 by Tri Dao, Dan Fu, Khaled Saab, and others from Stanford's Hazy Research group, identified two key capabilities that SSMs lacked for effective language modeling: recalling earlier tokens in a sequence and comparing tokens across the sequence. H3 addressed these shortcomings by stacking two SSMs with multiplicative interactions between their outputs and input projections. At the 2.7B parameter scale, H3 came within 0.4 perplexity points of Transformers on OpenWebText while matching attention performance on synthetic language tasks. H3 also introduced hardware-aware algorithmic optimizations that would later influence Mamba's design.
The central innovation of Mamba is making the SSM parameters functions of the input, creating what the authors call a "selective" state space model (sometimes referred to as S6). In traditional SSMs like S4, the matrices A, B, and C remain fixed across all positions in the sequence. This means the model applies the same dynamics regardless of what token it is processing. While this linear time-invariance allows efficient parallel computation via convolutions, it fundamentally limits the model's ability to selectively focus on or ignore specific inputs.
In Mamba, the parameters B, C, and the discretization step size delta become explicit functions of the current input token. Specifically, at each position t in the sequence, these parameters are computed through learned linear projections of the input:
The step size delta controls how much the model attends to the current input versus retaining information from previous states. A larger delta causes the model to focus more on the current token and reset previous context, while a smaller delta allows the model to retain more historical information. This gives Mamba a content-aware gating mechanism conceptually similar to the forget gates in LSTM networks, but operating within the SSM framework.
The matrix A, which governs state transitions, is initialized using the HiPPO framework and is not made input-dependent. Instead, it is parameterized in log-space and combined with the input-dependent delta during discretization.
The Mamba architecture organizes computation into repeated blocks that combine ideas from the H3 SSM block and the gated MLP (multi-layer perceptron) blocks used in Transformers. Each Mamba block processes its input through the following stages:
Input projection: The input of dimension D is linearly projected into two branches, each of dimension E times D, where E is the expansion factor (typically E = 2). One branch serves as the main processing path, and the other provides a gating signal.
1D convolution: The main branch passes through a one-dimensional depthwise convolution with a small kernel (typically kernel size 4). This convolution captures local patterns and provides positional awareness without requiring explicit positional encoding, enabling the SSM to incorporate short-range context before the recurrent computation.
Activation: A SiLU (Sigmoid Linear Unit, also known as Swish) activation function introduces non-linearity after the convolution.
Selective SSM: The activated features are processed by the selective state space layer. The input-dependent parameters B_t, C_t, and delta_t are computed via linear projections, the continuous parameters are discretized, and the selective scan algorithm computes the recurrence.
Gating: The output of the selective SSM is multiplied element-wise with the gating branch (which has passed through its own SiLU activation). This gating mechanism, inspired by gated linear units (GLUs), allows the model to control information flow.
Output projection: A final linear projection maps the gated output back to the model dimension D.
This block replaces both the attention and MLP sub-layers found in a standard Transformer block. Because Mamba combines these functions into a single block, a Mamba model typically uses roughly twice as many layers as a comparably sized Transformer to achieve a similar parameter count.
The design draws inspiration from the Gated Attention Unit (GAU), which similarly merged attention and MLP computations. Most of the parameters in each block (approximately 3ED^2) reside in the linear projections, while the SSM itself contributes relatively few parameters.
Making the SSM parameters input-dependent eliminates the ability to compute the model as a convolution, which was the primary efficiency mechanism for prior SSMs like S4. The data-dependent nature of the selective scan means each step depends on the previous step's output, creating a sequential dependency that seems to preclude parallelism.
Mamba addresses this through a hardware-aware parallel scan algorithm, inspired by the same GPU memory hierarchy principles behind FlashAttention. The key insight is that the expanded state (of dimension D times N, where N is the state dimension) should never be materialized in GPU high-bandwidth memory (HBM). Instead, the implementation:
During the backward pass, rather than storing the large intermediate states for gradient computation, Mamba recomputes them from the inputs. This trades extra computation for reduced memory access, which is favorable on modern GPUs where memory bandwidth is often the bottleneck.
The result is an implementation that runs up to 3 times faster than equivalent S4 implementations on A100 GPUs, while achieving true linear scaling in sequence length for both training and inference.
During autoregressive generation, Mamba operates as a true recurrent model. At each generation step, the model maintains a fixed-size hidden state (of dimension D times N) and processes one token at a time, requiring only constant time and memory per step. This contrasts sharply with Transformer inference, where the KV cache grows linearly with the number of generated tokens, causing both memory and computation to increase with sequence length.
This property gives Mamba up to 5 times higher throughput than Transformers during inference, with the advantage growing as sequences become longer.
The original Mamba release included pretrained models at five scales: 130M, 370M, 790M, 1.4B, and 2.8B parameters. All models were trained on 300 billion tokens from the Pile, a curated dataset commonly used for language model pretraining. The models followed the standard dimension configurations described by GPT-3. An additional variant, Mamba-2.8B-SlimPJ, was trained on 600 billion tokens from the SlimPajama dataset.
These models were released as base pretrained models without instruction tuning or other downstream modifications. The official implementation, built in PyTorch with custom CUDA kernels for the selective scan, is available in the state-spaces/mamba repository on GitHub.
In May 2024, Tri Dao and Albert Gu introduced Mamba-2 in the paper "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," published at ICML 2024. This work established a deep theoretical connection between state space models and attention mechanisms.
The structured state space duality (SSD) framework reveals that a specific class of SSMs is mathematically equivalent to a form of structured masked attention. Specifically, an SSM with a scalar-times-identity state matrix (where all diagonal elements of A share the same value) is equivalent to causal linear attention with a 1-semiseparable mask matrix.
This duality means the same computation can be expressed in two ways:
The SSD framework connects these views through the theory of semiseparable matrices, providing a rich body of theoretical results that bridges the SSM and attention literatures.
Mamba-2 introduces several modifications compared to the original Mamba:
The SSD algorithm computes the selective state space layer by decomposing the sequence into chunks and processing each chunk using matrix multiplications, while maintaining inter-chunk state through the SSM recurrence. This approach achieves the same asymptotic FLOP count as SSMs (linear in sequence length) while leveraging the tensor cores on modern GPUs that are optimized for matrix multiplication.
In practice, the SSD algorithm makes Mamba-2's core layer 2 to 8 times faster than Mamba during training, while maintaining competitive performance with Transformers on language modeling benchmarks.
Mamba-3, introduced at ICLR 2026, represents the third generation of the architecture. It integrates three key technical improvements:
At the 1.5B parameter scale, Mamba-3 improves average downstream accuracy by 1.8 percentage points over the next-best baseline (Gated DeltaNet), and achieves better pretraining perplexity than Mamba-2 while using half the state size.
Several notable language models have adopted Mamba or Mamba-based architectures, either in pure SSM form or as hybrid designs.
Jamba is a hybrid Transformer-Mamba language model introduced by AI21 Labs in March 2024. Its architecture interleaves Mamba and Transformer layers, with an overall ratio of one attention layer for every eight total layers. Some layers incorporate mixture-of-experts (MoE) modules to increase model capacity without proportionally increasing active parameters.
The initial Jamba model fits in a single 80GB GPU and supports a context length of 256K tokens. Its hybrid design reduces the KV cache to approximately 4GB at full context length, compared to substantially higher requirements for equivalent pure Transformer models. AI21 Labs later scaled the architecture to Jamba 1.5, with 398B total parameters and 94B active parameters, marking the first large-scale deployment of a Mamba-Transformer hybrid.
Jamba was published at ICLR 2025.
Falcon Mamba 7B, released by the Technology Innovation Institute (TII) of Abu Dhabi in August 2024, is a pure Mamba-based language model. Built on the original Mamba architecture with additional RMS normalization layers for training stability at scale, it was the first major open-source state space language model.
Falcon Mamba 7B outperformed several Transformer models in its class, including Meta's Llama 3.1 8B and Mistral 7B on certain benchmarks, as independently verified by Hugging Face. The model was released under an open-access license with pretrained, instruction-tuned, and quantized variants.
Zamba, introduced by Zyphra in May 2024, uses a novel hybrid approach consisting of a Mamba backbone with a single shared attention layer inserted every six blocks. By sharing attention weights across all insertion points, Zamba minimizes the parameter cost of incorporating attention while still benefiting from its strengths. The architecture also concatenates original input embeddings to the shared attention block to improve information flow across model depth.
Zamba2 extended this approach with two shared attention layers and was released in multiple sizes (1.2B, 2.7B, and 7B parameters).
Codestral Mamba, released by Mistral AI in July 2024, is a 7B-parameter Mamba-2-based model specialized for code generation. It supports a 256K token context window and achieves a 75.0% score on HumanEval, outperforming CodeGemma-1.1 7B (61.0%) and DeepSeek v1.5 7B (65.9%). The model is available under an Apache 2.0 open-source license.
Additional models incorporating Mamba include MoE-Mamba (combining mixture-of-experts with Mamba layers), MambaByte (a token-free variant operating directly on raw bytes), and various domain-specific adaptations for vision, audio, and genomics.
The following table summarizes the key architectural and performance differences between Mamba and Transformer models.
| Property | Transformer | Mamba | Mamba-2 |
|---|---|---|---|
| Core mechanism | Self-attention | Selective SSM (S6) | Structured state space duality (SSD) |
| Training complexity (sequence length n) | O(n^2) | O(n) | O(n) |
| Inference complexity (per token) | O(n) due to KV cache | O(1) constant | O(1) constant |
| Memory during generation | KV cache grows linearly | Fixed-size hidden state | Fixed-size hidden state |
| Parallelism during training | Fully parallel (matrix multiply) | Parallel scan (custom CUDA) | Chunk-wise matrix multiply + scan |
| Tensor core utilization | High | Limited (scalar operations) | High (matrix multiply based) |
| Long-range dependencies | Global attention at every layer | Through recurrent state dynamics | Through recurrent state + head structure |
| Positional encoding | Required (sinusoidal, RoPE, etc.) | Implicit via convolution and SSM | Implicit via SSM |
| In-context learning | Strong | Weaker on some tasks (e.g., MMLU few-shot) | Improved with hybrid approaches |
| Throughput advantage | Baseline | Up to 5x faster at inference | 2-8x faster than Mamba during training |
Empirical comparisons between Mamba, Mamba-2, and Transformer architectures have been conducted at various scales. The following results come from the NVIDIA empirical study conducted at the 8B parameter scale with controlled training conditions.
| Benchmark | Transformer (8B) | Mamba-2 (8B) | Mamba-2-Hybrid (8B) |
|---|---|---|---|
| MMLU (5-shot) | 50.07% | 48.70% | 53.60% |
| HellaSwag | 75.89% | 77.69% | 77.68% |
| PIQA | ~78% | ~78% | ~79% |
| WinoGrande | ~70% | ~71% | ~72% |
| Average (12 tasks) | Baseline | +1.81 pts | +2.65 pts |
Pure SSM models and pure Transformers each have distinct strengths. SSMs excel at efficient long-sequence processing and fast inference, while Transformers offer strong in-context learning and retrieval capabilities through attention. Hybrid architectures attempt to combine the best of both.
The typical hybrid approach interleaves Mamba (or Mamba-2) layers with a smaller number of attention layers. The NVIDIA study found that a configuration of 24 Mamba-2 layers, 4 attention layers, and 28 MLP layers (in a 56-layer model at 8B parameters) consistently outperformed both pure Transformers and pure Mamba models.
Hybrid models offer several practical advantages:
This hybrid paradigm has been adopted by AI21 Labs (Jamba), Zyphra (Zamba), and NVIDIA in their research models, suggesting it may become a standard architectural pattern.
Mamba's linear scaling and efficient inference make it particularly well-suited for domains involving long sequences.
Mamba and its derivatives have been applied to standard language modeling, code generation (Codestral Mamba), and long-document processing. The constant-memory inference property is especially valuable for applications requiring long-context generation, such as document summarization and conversational AI with extended histories.
DNA sequences can span millions of base pairs, making the quadratic cost of attention impractical. Mamba has demonstrated strong results on genomic sequence modeling, enabling pretraining and fine-tuning on sequences up to one million tokens in length. The original paper reported that Mamba outperformed prior models (including Hyena and Transformers) on DNA sequence modeling benchmarks.
Mamba has been applied to autoregressive waveform generation and speech synthesis. On the SC09 speech generation dataset and YouTubeMix audio dataset, Mamba achieved lower negative log-likelihood than competing models. The architecture's ability to process long sequences efficiently aligns well with the high temporal resolution of audio data, where a single second of audio at 16kHz corresponds to 16,000 time steps.
Vision Mamba (Vim), published at ICML 2024, adapts the Mamba architecture for visual recognition tasks. By processing image patches as sequences with bidirectional state space modeling, Vim achieves comparable accuracy to Vision Transformers (ViT) like DeiT on ImageNet classification while being 2.8 times faster and using 86.8% less GPU memory during batch inference at high resolutions (1248 x 1248). Additional Mamba-based vision models have been proposed for object detection, semantic segmentation, and medical image analysis.
Mamba has been adapted for time series forecasting (MambaTS), recommendation systems, video understanding, and drug design. The architecture's general-purpose sequence modeling capabilities and efficiency advantages make it applicable to any domain where data can be represented as ordered sequences.
Despite its advantages, Mamba has several known limitations:
The following table summarizes key milestones in the development of the Mamba architecture and related state space models.
| Date | Milestone |
|---|---|
| August 2020 | HiPPO framework introduced (Gu et al., NeurIPS 2020) |
| November 2021 | S4 (Structured State Spaces) introduced (Gu, Goel, Re; published at ICLR 2022) |
| December 2022 | H3 (Hungry Hungry Hippos) introduced (Dao, Fu et al.; published at ICLR 2023) |
| December 2023 | Mamba introduced (Gu and Dao; published at COLM 2024 with Outstanding Paper Award) |
| March 2024 | Jamba released by AI21 Labs (published at ICLR 2025) |
| May 2024 | Mamba-2 introduced (Dao and Gu; published at ICML 2024) |
| May 2024 | Zamba released by Zyphra |
| July 2024 | Codestral Mamba released by Mistral AI |
| August 2024 | Falcon Mamba 7B released by TII |
| March 2026 | Mamba-3 introduced (published at ICLR 2026) |