Mamba 2 is a state space model architecture introduced in the paper "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" by Tri Dao and Albert Gu, published at ICML 2024 (arXiv:2405.21060). It refines the selective SSM layer from the original Mamba model through a theoretical framework called State Space Duality (SSD), which establishes a formal equivalence between structured state space models and a class of attention mechanisms. The new layer is 2 to 8 times faster than the Mamba 1 selective scan at the same state dimension, and allows the state dimension to grow from 16 in Mamba 1 to 64 or 128 by default, while remaining competitive with Transformer architectures on language modeling benchmarks.
The SSD framework reveals that SSMs and attention are not competing families of sequence models but rather two computational views of the same underlying mathematical object: structured semiseparable matrices. This theoretical result brought a wave of practical benefits, including compatibility with the tensor parallelism and sequence parallelism techniques developed for Transformers, and motivated a family of hybrid architectures that interleave SSM layers with attention layers to capture advantages of both.
Mamba 2 was developed by Tri Dao and Albert Gu.
Tri Dao is an assistant professor at Princeton University. He completed his PhD at Stanford, where he worked on efficient deep learning systems. He is also the primary author of Flash Attention, the IO-aware attention algorithm that became a standard component in large language model training. His research focuses on hardware-aware algorithm design, aiming to close the gap between the theoretical complexity of machine learning operations and the practical throughput achievable on real hardware.
Albert Gu is an assistant professor at Carnegie Mellon University. His doctoral work at Stanford, supervised by Christopher Re, introduced the S4 model and the broader family of structured state space models that preceded Mamba. He is also Chief Scientist at Cartesia, which he co-founded to build real-time AI products on top of SSM architectures. Gu was named to Time's 100 Most Influential People in AI.
The two researchers had previously co-authored the original Mamba paper (December 2023) and have continued to collaborate on SSM research through both their academic positions and the Goomba Lab blog, where they published a four-part series explaining the theory, algorithms, and systems of Mamba 2.
State space models originated in control theory, where they describe continuous-time dynamical systems through linear differential equations. In the sequence modeling context, they map an input sequence to an output sequence through a hidden state that evolves over time. The theoretical appeal is significant: SSMs process sequences in linear time relative to sequence length, whereas self-attention in Transformers scales quadratically. Inference is also efficient because the hidden state has fixed size, so generating each new token requires only an O(1) update rather than attending to a growing key-value cache.
For most of the 2010s, recurrent models such as LSTMs and GRUs were the dominant alternative to attention, but they struggled with long-range dependencies and were difficult to parallelize during training. Structured state space models, starting with S4 (Gu et al., 2021), showed that with the right parameterization a linear recurrence could be trained efficiently in parallel and could capture long-range dependencies competitive with Transformers on sequence tasks. S4, S5, and related work demonstrated strong results on the Long Range Arena benchmark.
The first Mamba model (Gu and Dao, 2023) added input-dependent selectivity to the SSM framework. Prior SSMs used time-invariant transition matrices, meaning the model processed every token with the same dynamics. Mamba introduced selective state spaces, where the transition matrices A, B, and C are functions of the input, allowing the model to gate what information to retain in the hidden state. This brought SSMs to near-parity with Transformers on language modeling tasks while retaining linear complexity. Training Mamba 1 relied on a parallel associative scan algorithm that runs efficiently on GPUs but does not use the tensor cores (matrix multiplication units) that dominate modern GPU throughput.
Modern GPUs are designed around matrix multiplication. An A100 achieves 312 TFLOPS in BF16 matrix multiplication but only 19 TFLOPS in FP32 scalar arithmetic. An H100 achieves 989 TFLOPS in BF16 matrix multiplication but only 67 TFLOPS in scalar arithmetic. This is roughly a 15x gap. Mamba 1's parallel scan operates element-wise rather than through matrix multiplications, which means it uses the slower arithmetic units and leaves most GPU capacity idle. The SSD algorithm in Mamba 2 was designed specifically to express the SSM computation as matrix multiplications, closing this hardware gap.
The central theoretical contribution of the Mamba 2 paper is the State Space Duality (SSD) framework. It connects structured SSMs and attention through the mathematics of semiseparable matrices.
A semiseparable matrix is a structured lower-triangular matrix with the property that every submatrix contained on and below its diagonal is low-rank. The key insight of the SSD framework is that the output-to-input mapping of any causal linear SSM can be written as multiplication by a semiseparable matrix. Entry (i, j) of this matrix, for i >= j, equals the product C_i times the chain A_{j+1} * A_{j+2} * ... * A_i times B_j, which describes how the contribution of input at position j propagates to output at position i through the recurrence. An SSM with a scalar-times-identity transition matrix (A_t = alpha_t * I) produces a 1-semiseparable matrix, meaning each off-diagonal submatrix has rank at most 1.
This representation is not just a notational trick. It means that computing an SSM is equivalent to multiplying a vector by a semiseparable matrix, and any algorithm for that matrix multiplication corresponds to a valid algorithm for the SSM. The parallel associative scan used in Mamba 1 is one such algorithm; the SSD algorithm is another, chosen to exploit tensor cores.
The SSD framework also shows that a certain class of attention mechanisms produces the same semiseparable matrix. Standard self-attention computes output as the product of a softmax-normalized score matrix with a value matrix. Linear attention removes the softmax, making the score matrix a product of query and key matrices directly. Structured Masked Attention (SMA) generalizes this by replacing the causal mask with any structured matrix L that supports fast matrix-vector multiplication.
When the mask L is chosen to be a 1-semiseparable matrix encoding the same scalar discount factors used in an SSM, SMA and the SSM compute identical results. They are the same model expressed in two different computational forms. This is the duality: the SSM view naturally leads to recurrent algorithms, while the SMA view naturally leads to attention-like algorithms with matrix multiplications.
The paper also proves a converse result: any kernel attention method that has a fast recurrent form must be an SSM. This establishes SSMs and kernel attention as the same class of models, not merely analogous families.
SSD differs from standard softmax self-attention in exactly two ways. First, it removes the softmax normalization. Second, it applies a multiplicative mask that encodes the input-dependent discount factors from the SSM recurrence. These discount factors represent how much the influence of a token decays over distance: if the model assigns a low discount value at position t, earlier contributions are suppressed rapidly.
From this perspective, Mamba 2 can be understood as causal linear attention with a learnable, input-dependent positional mask. The model controls this mask through the scalar A values, which determine how much context each position carries forward. This view clarifies why SSMs tend to be good at smooth, predictable patterns but can struggle with exact retrieval: the discount mechanism compresses distant tokens rather than preserving them precisely.
The duality also connects to the broader literature on efficient attention. Linear attention (Katharopoulos et al., 2020) removes the softmax from standard attention to achieve linear complexity, but lacks a mechanism for position-dependent weighting. Retention (Sun et al., 2023) adds a geometric decay factor to linear attention, bringing it closer to SSMs. The SSD framework shows that these are all instances of a single parametric family indexed by the choice of mask matrix L. SSMs with learned input-dependent A values occupy a particularly expressive position in this family because their mask is data-driven at each step rather than fixed.
The following table compares the core computational properties of standard attention, linear attention, and SSD:
| Property | Softmax attention | Linear attention | SSD (Mamba 2) |
|---|---|---|---|
| Training complexity | O(T^2 d) | O(T d N) | O(T d N) |
| Inference (per token) | O(T d) (grows with T) | O(d N) | O(d N) |
| Memory (inference) | O(T d) KV cache | O(d N) fixed state | O(d N) fixed state |
| Mask type | Causal (uniform) | None | Input-dependent 1-semiseparable |
| Exact token retrieval | Yes | Approximate | Approximate |
| Tensor core friendly | Yes | Partial | Yes |
| Supports tensor parallelism | Yes | Partial | Yes (Mamba 2) |
The training complexity for SSD is O(T d N) rather than O(T^2 d) because the SSD algorithm operates on chunks of size Q, running the quadratic form within each chunk and the linear recurrence across chunks. For typical values (T = 8192, Q = 128, d = 2048, N = 64), the SSD algorithm achieves similar FLOP counts to attention at moderate sequence lengths and becomes faster at longer sequences.
The core architectural change from Mamba 1 to Mamba 2 is a constraint on the recurrent transition matrix A. In Mamba 1, A is a diagonal matrix (each element of the state vector has its own decay rate). In Mamba 2, A is further restricted to scalar-times-identity: A_t = alpha_t * I, where alpha_t is a single scalar. All dimensions of the hidden state share the same decay rate at each time step.
This restriction looks like it should reduce expressiveness. In practice, the authors found that the SSD models with scalar A remain highly competitive on language modeling, and the constraint enables the entire block decomposition strategy that makes the SSD algorithm efficient. The expressiveness lost from the constraint on A is recovered by expanding the state dimension N from 16 (Mamba 1 default) to 64 or 128 in Mamba 2, which the faster algorithm makes affordable.
In Mamba 1, the SSM parameters B, C, and delta (the discretization step) were computed sequentially, with delta applied to A and B before use. Mamba 2 moves all data-dependent projections to a single parallel step at the beginning of the block. The inputs X, A, B, and C are all projected from the input in parallel, with no sequential dependency between them. This simplifies the data flow, reduces the number of passes through GPU memory, and makes the block more amenable to tensor parallelism.
Mamba 2 introduces a multi-head structure analogous to multi-head attention (MHA) in Transformers. The model splits the state into multiple heads, each with its own set of B and C projections. This is analogous to how MHA uses separate query and key projections per head.
Two variants of parameter sharing reduce the cost of the multi-head structure. In grouped-value attention (GVA) mode, analogous to grouped-query attention in Transformers, the input X and the B projection are shared across multiple heads while C is independent per head. In multi-value (MV) mode, X and C are shared while B is independent. These sharing schemes allow larger effective model capacity without proportional increases in parameters or computation.
The multi-head structure also enables tensor parallelism across GPUs. In Mamba 1, the custom selective scan kernel required two all-reduce operations per layer during distributed training, compared to one for Transformer layers. Mamba 2's parallel projection structure and head-based organization reduces this to one all-reduce per layer, matching Transformers and enabling more efficient large-scale training.
Mamba 2 adds a group normalization (or RMS normalization) layer applied to the output of the SSM computation before the final output projection. This follows a pattern also seen in some Mamba 1 training runs and in models like Falcon Mamba, where additional normalization was found necessary for stable training at scale.
The SSD algorithm is a four-step procedure that computes the SSM output by decomposing the semiseparable matrix into blocks and using matrix multiplication for most of the computation.
Given a sequence of length T, the algorithm selects a chunk size Q (typically between 64 and 256 tokens). The sequence is divided into T/Q chunks. The four steps are:
Intra-chunk outputs: For each chunk, compute the contribution to the output from tokens within the same chunk using the attention-like (quadratic) form of SSD. This is a Q x Q matrix multiplication per chunk and runs fully in parallel across all chunks.
Chunk states: For each chunk, compute the SSM state at the end of the chunk assuming a zero initial state. This summarizes what each chunk contributes to future chunks.
Pass states: Run a sequential SSM scan over the T/Q chunk-level states computed in step 2. This is the only sequential step, but it operates on T/Q elements rather than T, reducing its cost by a factor of Q. For Q = 64, the sequential work is 64 times smaller than a full sequential scan.
Output states: Compute the additional contribution to each token's output from the true initial state that was propagated through the inter-chunk scan in step 3. This is again a matrix multiplication.
Steps 1, 2, and 4 are fully parallel and use matrix multiplications, which run on tensor cores. Only step 3 is sequential, and it operates on a much shorter sequence. The reference implementation of this algorithm requires approximately 25 lines of code, compared to the more complex selective scan in Mamba 1.
This design is similar in spirit to Flash Attention, which uses block-level chunking and tiling to keep computation within fast on-chip SRAM rather than reading from slower HBM. The SSD algorithm achieves comparable memory efficiency by processing chunks that fit in SRAM while combining results across chunks without materializing the full semiseparable matrix.
The primary benchmarked improvement is training speed. Measured on the same state dimension, the SSD algorithm is 2 to 8 times faster than the Mamba 1 parallel scan. The range reflects the fact that the speedup grows with sequence length: at short sequences (under 1,000 tokens) the gain is modest, while at longer sequences the matrix-multiplication-heavy SSD algorithm amortizes its overhead more efficiently and the tensor core advantage becomes dominant.
Because the SSD algorithm is faster, Mamba 2 can afford to run with a state dimension 4 to 8 times larger than Mamba 1 at comparable or lower training cost. The larger state improves performance on associative recall tasks, where the model needs to store many key-value associations in its hidden state. On the synthetic Multi-Query Associative Recall (MQAR) benchmark used by the authors, Mamba 2 with N=64 substantially outperforms Mamba 1 with N=16.
The paper trains Mamba 2 on 300 billion tokens from the Pile dataset and evaluates against Mamba 1 and the Transformer++ baseline (a Transformer with RoPE, SwiGLU, and other modern improvements). Mamba 2 matches or slightly exceeds Mamba 1 and sits close to the Transformer++ at equivalent parameter counts and token budgets. The 2.7B Mamba 2 model achieves 8.5 perplexity on an 8K context window, compared to 9.1 for comparably sized Transformers.
| Sequence length | Transformer attention (relative) | Mamba 2 SSM (relative) |
|---|---|---|
| 2,048 tokens | 1x | ~1x |
| 16,384 tokens | ~6x | ~1x |
| 65,536 tokens | ~25x | ~1x |
| 262,144 tokens | ~100x+ | ~1x |
At 256K tokens, NVIDIA reports Mamba 2 is approximately 18 times faster than a Transformer layer at the same sequence length. This asymptotic advantage reflects the linear vs. quadratic complexity difference: attention compute grows as O(T^2) while the SSM computation grows as O(T).
Beyond the core algorithm, Mamba 2's parallel projection structure enables several systems optimizations that were not available for Mamba 1.
Tensor parallelism splits model parameters across GPUs. Mamba 2 achieves this by dividing the input and output projection matrices across the tensor parallel degree (2, 4, or 8 shards), applying per-GPU group normalization, and using a single all-reduce per layer. This matches the communication pattern of Transformer layers and allows Mamba 2 to scale across multiple GPUs with no additional overhead compared to Transformers.
Sequence parallelism splits long sequences across multiple GPUs. Mamba 2 supports this through the block decomposition property of SSD: each GPU computes local outputs for its portion of the sequence, then passes final chunk states to subsequent GPUs via point-to-point communication.
Variable-length batches, where sequences within a batch have different lengths, normally require padding to a uniform length, which wastes computation on padding tokens. Mamba 2 handles this by treating the entire batch as a single concatenated sequence and setting the transition scalar A_t to zero at sequence boundaries, preventing state from leaking between sequences.
A widely adopted finding from both the Mamba 1 and Mamba 2 literature is that hybrid architectures, which alternate SSM layers with occasional attention layers, outperform either architecture used alone. SSM layers provide efficient long-range context compression; attention layers provide precise token-level retrieval. The combination addresses the primary weakness of pure SSMs (limited in-context recall) while retaining most of their efficiency advantage.
The Mamba 2 paper itself tested a hybrid of 6 attention layers and 58 SSD blocks, which outperformed both the pure Mamba 2 model and the pure Transformer baseline on the 300B-token Pile training run. This result has been replicated across a wide range of subsequent work.
Jamba, developed by AI21 Labs, was among the first commercially released hybrid models to combine Mamba layers with attention. The Jamba 1.5 series, released in 2024, uses 72 total layers interleaving Mamba and attention blocks along with a mixture-of-experts (MoE) routing mechanism. Jamba 1.5 Mini and Jamba 1.5 Large support 256K-token context windows and achieve faster inference than comparable Transformer-only models on long contexts.
On a 262,144-token input, Jamba 1.5 Mini generates approximately 62 tokens per second, compared to 41 for LLaMA 3.1 8B and 39 for Mixtral under the same conditions. The model fits in a smaller GPU memory footprint than Transformer equivalents because SSM layers do not require a KV cache, and Jamba 1.5 Mini can handle up to 140K tokens on a single 24 GB GPU.
Zamba2, developed by Zyphra, replaces the Mamba 1 blocks from the original Zamba architecture with Mamba 2 blocks. The architecture uses a backbone of Mamba 2 layers interleaved with two shared attention blocks in an alternating pattern. The shared attention blocks use a low-rank adaptation (LoRA) projector at each invocation to allow the shared MLP weights to specialize per layer at minimal parameter cost.
Zamba2-2.7B was trained on 3 trillion tokens and subsequently annealed on 100 billion high-quality tokens. The model has extremely low inference latency due to the Mamba 2 blocks, which the Zyphra team reports have roughly four times the throughput of an equal-parameter Transformer block. Zamba2 models are available in 1.2B, 2.7B, and 7B parameter sizes.
Bamba is an open-source hybrid Mamba 2 model developed by IBM Research in collaboration with Princeton, CMU, and UIUC, with direct involvement from Albert Gu and Tri Dao. Bamba-9B was trained on 2.2 trillion tokens and achieves benchmark performance comparable to LLaMA 3.1 8B despite having been trained on roughly one-seventh the data.
At inference time on vLLM, Bamba-9B delivers 2.5 times the throughput and 2 times lower latency compared to a standard Transformer of similar size. IBM released the full training recipe, data mixtures, and a quantization framework alongside the model weights, making Bamba one of the most fully open hybrid SSM models available. The architecture later informed IBM's Granite 4.0 series of enterprise models.
The Technology Innovation Institute (TII) in Abu Dhabi released Falcon Mamba 7B in 2024 as a pure SSM model based on Mamba 1 with additional RMS normalization layers for training stability. This was followed by the Falcon-H1 series in 2025, which uses a hybrid architecture where attention heads and Mamba 2 SSM components run in parallel within each layer. Falcon-H1 models range from 0.5B to 34B parameters and support up to 256K-token context windows across 18 languages.
NVIDIA developed the Nemotron-H family of hybrid Mamba-Transformer models as part of its NeMo framework. Nemotron-H comes in 8B, 47B, and 56B parameter sizes and is designed to offer Transformer-quality accuracy at lower inference cost. At long sequence lengths of 256K tokens, NVIDIA reports Mamba 2 layers are approximately 18 times faster than Transformer attention layers. The NeMo framework provides reference implementations of Mamba 2 training with full support for tensor parallelism, sequence parallelism, and variable-length batching.
Cartesia is an AI company co-founded by Albert Gu (as Chief Scientist) and other researchers to build real-time AI applications on top of state space model technology. The company, based in San Francisco, raised a $5 million seed round and later a $22 million Series A led by Index Ventures in December 2024, bringing total capital raised to $27 million.
Cartesia's primary product is Sonic, a text-to-speech model based on a state space model architecture. Sonic was released in May 2024 and achieved sub-90 millisecond latency to first audio output, which the company positioned as the fastest TTS model available at the time. The latency advantage comes directly from the SSM architecture: because SSMs process tokens through a fixed-size state with O(1) cost per token, there is no growing KV cache to read during generation, which reduces memory bandwidth pressure and enables consistent low-latency output regardless of prompt length.
Cartesia also released Mamba-3B-SlimPJ, an SSM trained on the SlimPajama dataset that demonstrated competitive performance against Transformer models of the same size, and has continued developing multi-modal real-time models. The company has argued that SSMs are particularly well-suited to always-on, streaming applications because their fixed memory footprint and constant per-token cost make resource consumption predictable.
The main documented limitation of Mamba 2, shared with Mamba 1 and most SSMs, is imprecise in-context retrieval. Transformers store all past tokens in a KV cache and can attend directly to any of them during generation, making exact lookup tasks straightforward. SSMs compress past context into a fixed-size hidden state, which means tokens that were not retained strongly during compression cannot be precisely retrieved later.
NVIDIA's empirical study of Mamba-based language models found that both Mamba 1 and Mamba 2 begin to fail at phonebook lookup tasks once input sequences exceed approximately 500 tokens. A Transformer-based 8B model maintained near-perfect accuracy on the same task up to its full pretraining context length of 4,096 tokens. This gap persisted even after training Mamba 2 on 3.5 trillion tokens, suggesting it is architectural rather than a matter of training scale.
Five-shot in-context learning, where the model reads a few examples in the prompt and generalizes to new inputs, also lags behind Transformers. Because few-shot learning relies on the model attending back to the demonstration examples during generation, the SSM's compressed state representation is at a structural disadvantage.
The state dimension in Mamba 2 is a fixed hyperparameter chosen at training time. While the SSD algorithm allows this dimension to be set much larger than in Mamba 1 without prohibitive cost, there is still a ceiling: once the model's hidden state is full, new information can only be accommodated by discarding or overwriting old information. Transformers have no equivalent constraint; their KV cache grows proportionally with context length.
This tradeoff means Mamba 2 is well-suited to tasks where the relevant information is distributed throughout a long sequence (dense context), but can be unreliable on tasks that require finding and retrieving a specific fact buried in a very long document (sparse retrieval).
Several groups training large SSM models reported that plain Mamba architectures can exhibit instability at scale without additional normalization. Falcon Mamba 7B addressed this by adding RMS normalization layers. The Mamba 2 architecture includes group normalization on the SSM output, which partially addresses this, but training SSMs at the scale of the largest Transformer models (70B+ parameters, trillions of tokens) remains less well-characterized than Transformer training at those scales.
As of mid-2024, the Mamba 2 ecosystem was less mature than the Transformer ecosystem. Transformer models benefit from years of optimized inference kernels, serving frameworks, and deployment tooling. Mamba 2 required custom CUDA kernels for efficient training, which initially limited availability to researchers with access to appropriate GPU hardware running Linux. Integration into mainstream frameworks like HuggingFace Transformers arrived later in 2024.
The reference implementation at the state-spaces/mamba GitHub repository provides five Mamba 2 models trained on 300 billion tokens from the Pile:
| Model | Layers | Model dimension | Parameters |
|---|---|---|---|
| mamba2-130m | 24 | 768 | 130M |
| mamba2-370m | 48 | 1,024 | 370M |
| mamba2-780m | 48 | 1,536 | 780M |
| mamba2-1.3b | 48 | 2,048 | 1.3B |
| mamba2-2.7b | 64 | 2,560 | 2.7B |
All models use a default state dimension of 64 or 128, compared to 16 in the corresponding Mamba 1 models. A hybrid variant, mamba2attn-2.7b, adds attention layers to the 2.7B model.