Mamba 2

Mamba 2 is a state space model architecture introduced in the paper "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality" by Tri Dao and Albert Gu, published at ICML 2024 (arXiv:2405.21060). It refines the selective SSM layer from the original Mamba model through a theoretical framework called State Space Duality (SSD), which establishes a formal equivalence between structured state space models and a class of attention mechanisms. The new layer is 2 to 8 times faster than the Mamba 1 selective scan at the same state dimension, and allows the state dimension to grow from 16 in Mamba 1 to 64 or 128 by default, while remaining competitive with Transformer architectures on language modeling benchmarks.

The SSD framework reveals that SSMs and attention are not competing families of sequence models but rather two computational views of the same underlying mathematical object: structured semiseparable matrices. This theoretical result brought a wave of practical benefits, including compatibility with the tensor parallelism and sequence parallelism techniques developed for Transformers, and motivated a family of hybrid architectures that interleave SSM layers with attention layers to capture advantages of both. In the two years since publication, Mamba 2 has become the dominant SSM variant used in production deployments, anchoring models such as Mistral Codestral Mamba, AI21's Jamba 1.5 and 1.6 series, NVIDIA Hymba and Nemotron-H, IBM Bamba and Granite 4.0, and Tencent Hunyuan-TurboS.

Authors

Mamba 2 was developed by Tri Dao and Albert Gu.

Tri Dao is an assistant professor at Princeton University. He completed his PhD at Stanford, where he worked on efficient deep learning systems. He is also the primary author of Flash Attention, the IO-aware attention algorithm that became a standard component in large language model training. His research focuses on hardware-aware algorithm design, aiming to close the gap between the theoretical complexity of machine learning operations and the practical throughput achievable on real hardware.

Albert Gu is an assistant professor at Carnegie Mellon University. His doctoral work at Stanford, supervised by Christopher Re, introduced the S4 model and the broader family of structured state space models that preceded Mamba. He is also Chief Scientist at Cartesia, which he co-founded to build real-time AI products on top of SSM architectures. Gu was named to Time's 100 Most Influential People in AI.

The two researchers had previously co-authored the original Mamba paper (December 2023) and have continued to collaborate on SSM research through both their academic positions and the Goomba Lab blog, where they published a four-part series explaining the theory, algorithms, and systems of Mamba 2.

Background

State space models and the SSM revival

State space models originated in control theory, where they describe continuous-time dynamical systems through linear differential equations. In the sequence modeling context, they map an input sequence to an output sequence through a hidden state that evolves over time. The theoretical appeal is significant: SSMs process sequences in linear time relative to sequence length, whereas self-attention in Transformers scales quadratically. Inference is also efficient because the hidden state has fixed size, so generating each new token requires only an O(1) update rather than attending to a growing key-value cache.

For most of the 2010s, recurrent models such as LSTMs and GRUs were the dominant alternative to attention, but they struggled with long-range dependencies and were difficult to parallelize during training. Structured state space models, starting with S4 (Gu et al., 2021), showed that with the right parameterization a linear recurrence could be trained efficiently in parallel and could capture long-range dependencies competitive with Transformers on sequence tasks. S4, S5, and related work demonstrated strong results on the Long Range Arena benchmark.

The first Mamba model (Gu and Dao, 2023) added input-dependent selectivity to the SSM framework. Prior SSMs used time-invariant transition matrices, meaning the model processed every token with the same dynamics. Mamba introduced selective state spaces, where the transition matrices A, B, and C are functions of the input, allowing the model to gate what information to retain in the hidden state. This brought SSMs to near-parity with Transformers on language modeling tasks while retaining linear complexity. Training Mamba 1 relied on a parallel associative scan algorithm that runs efficiently on GPUs but does not use the tensor cores (matrix multiplication units) that dominate modern GPU throughput.

Hardware context

Modern GPUs are designed around matrix multiplication. An A100 achieves 312 TFLOPS in BF16 matrix multiplication but only 19 TFLOPS in FP32 scalar arithmetic. An H100 achieves 989 TFLOPS in BF16 matrix multiplication but only 67 TFLOPS in scalar arithmetic. This is roughly a 15x gap. Mamba 1's parallel scan operates element-wise rather than through matrix multiplications, which means it uses the slower arithmetic units and leaves most GPU capacity idle. The SSD algorithm in Mamba 2 was designed specifically to express the SSM computation as matrix multiplications, closing this hardware gap.

The hardware gap had grown wider with successive GPU generations. NVIDIA's roadmap from Volta (2017) through Hopper (2022) to Blackwell (2024) emphasized BF16 and FP8 matrix throughput at the expense of scalar throughput, on the assumption that workloads would be dominated by Transformer-style dense matrix multiplication. Algorithms that did not fit this assumption, including most variants of linear attention and recurrent SSMs, were therefore disadvantaged at the kernel level. The SSD reformulation was as much a response to this hardware reality as it was a theoretical advance.

State Space Duality

The central theoretical contribution of the Mamba 2 paper is the State Space Duality (SSD) framework. It connects structured SSMs and attention through the mathematics of semiseparable matrices.

Semiseparable matrices

A semiseparable matrix is a structured lower-triangular matrix with the property that every submatrix contained on and below its diagonal is low-rank. The key insight of the SSD framework is that the output-to-input mapping of any causal linear SSM can be written as multiplication by a semiseparable matrix. Entry (i, j) of this matrix, for i >= j, equals the product C_i times the chain A_{j+1} * A_{j+2} * ... * A_i times B_j, which describes how the contribution of input at position j propagates to output at position i through the recurrence. An SSM with a scalar-times-identity transition matrix (A_t = alpha_t * I) produces a 1-semiseparable matrix, meaning each off-diagonal submatrix has rank at most 1.

This representation is not just a notational trick. It means that computing an SSM is equivalent to multiplying a vector by a semiseparable matrix, and any algorithm for that matrix multiplication corresponds to a valid algorithm for the SSM. The parallel associative scan used in Mamba 1 is one such algorithm; the SSD algorithm is another, chosen to exploit tensor cores. The duality is therefore not merely a conceptual bridge but an explicit menu of computational strategies, each with different memory and throughput tradeoffs, that practitioners can pick from based on hardware constraints and sequence length.

Structured Masked Attention

The SSD framework also shows that a certain class of attention mechanisms produces the same semiseparable matrix. Standard self-attention computes output as the product of a softmax-normalized score matrix with a value matrix. Linear attention removes the softmax, making the score matrix a product of query and key matrices directly. Structured Masked Attention (SMA) generalizes this by replacing the causal mask with any structured matrix L that supports fast matrix-vector multiplication.

When the mask L is chosen to be a 1-semiseparable matrix encoding the same scalar discount factors used in an SSM, SMA and the SSM compute identical results. They are the same model expressed in two different computational forms. This is the duality: the SSM view naturally leads to recurrent algorithms, while the SMA view naturally leads to attention-like algorithms with matrix multiplications.

The paper also proves a converse result: any kernel attention method that has a fast recurrent form must be an SSM. This establishes SSMs and kernel attention as the same class of models, not merely analogous families.

Connection to standard attention

SSD differs from standard softmax self-attention in exactly two ways. First, it removes the softmax normalization. Second, it applies a multiplicative mask that encodes the input-dependent discount factors from the SSM recurrence. These discount factors represent how much the influence of a token decays over distance: if the model assigns a low discount value at position t, earlier contributions are suppressed rapidly.

From this perspective, Mamba 2 can be understood as causal linear attention with a learnable, input-dependent positional mask. The model controls this mask through the scalar A values, which determine how much context each position carries forward. This view clarifies why SSMs tend to be good at smooth, predictable patterns but can struggle with exact retrieval: the discount mechanism compresses distant tokens rather than preserving them precisely.

The duality also connects to the broader literature on efficient attention. Linear attention (Katharopoulos et al., 2020) removes the softmax from standard attention to achieve linear complexity, but lacks a mechanism for position-dependent weighting. Retention (Sun et al., 2023) adds a geometric decay factor to linear attention, bringing it closer to SSMs. The SSD framework shows that these are all instances of a single parametric family indexed by the choice of mask matrix L. SSMs with learned input-dependent A values occupy a particularly expressive position in this family because their mask is data-driven at each step rather than fixed.

Relation to other linear attention variants

The SSD framework has clarified the relationships between several formerly disparate efficient attention proposals. RetNet's retention operator corresponds to SMA with a geometric decay mask that is shared across heads and fixed across positions. RWKV-4 and earlier RWKV variants use a similar fixed exponential decay, while RWKV-5 and RWKV-6 introduce data-dependent decays that bring them structurally closer to Mamba 2. Gated Linear Attention (GLA, Yang et al., 2024) uses a learnable input-dependent gate per channel that, in the SSD vocabulary, corresponds to a diagonal rather than scalar transition matrix. DeltaNet adds an explicit delta-rule update that lets the model overwrite specific state slots, addressing one of the recall weaknesses of softmax-free recurrences.

When these models are placed in the SSD coordinate system, they differ primarily in three axes: the form of the transition matrix (scalar versus diagonal versus more general), the dependence of the transition on input (fixed versus data-dependent), and the presence or absence of corrective update rules. Mamba 2 occupies the data-dependent scalar transition corner, accepting a constraint on the transition shape in exchange for an algorithm that maps cleanly onto matrix multiplication units. Several research groups have followed up with hybrids that pair Mamba 2 blocks with DeltaNet-style update rules or GLA-style diagonal transitions to recover expressiveness without sacrificing throughput.

Computational comparison

The following table compares the core computational properties of standard attention, linear attention, and SSD:

Property	Softmax attention	Linear attention	SSD (Mamba 2)
Training complexity	O(T^2 d)	O(T d N)	O(T d N)
Inference (per token)	O(T d) (grows with T)	O(d N)	O(d N)
Memory (inference)	O(T d) KV cache	O(d N) fixed state	O(d N) fixed state
Mask type	Causal (uniform)	None	Input-dependent 1-semiseparable
Exact token retrieval	Yes	Approximate	Approximate
Tensor core friendly	Yes	Partial	Yes
Supports tensor parallelism	Yes	Partial	Yes (Mamba 2)

The training complexity for SSD is O(T d N) rather than O(T^2 d) because the SSD algorithm operates on chunks of size Q, running the quadratic form within each chunk and the linear recurrence across chunks. For typical values (T = 8192, Q = 128, d = 2048, N = 64), the SSD algorithm achieves similar FLOP counts to attention at moderate sequence lengths and becomes faster at longer sequences.

Mamba 2 architecture

Constraints on the transition matrix

The core architectural change from Mamba 1 to Mamba 2 is a constraint on the recurrent transition matrix A. In Mamba 1, A is a diagonal matrix (each element of the state vector has its own decay rate). In Mamba 2, A is further restricted to scalar-times-identity: A_t = alpha_t * I, where alpha_t is a single scalar. All dimensions of the hidden state share the same decay rate at each time step.

This restriction looks like it should reduce expressiveness. In practice, the authors found that the SSD models with scalar A remain highly competitive on language modeling, and the constraint enables the entire block decomposition strategy that makes the SSD algorithm efficient. The expressiveness lost from the constraint on A is recovered by expanding the state dimension N from 16 (Mamba 1 default) to 64 or 128 in Mamba 2, which the faster algorithm makes affordable.

Parallel input projections

In Mamba 1, the SSM parameters B, C, and delta (the discretization step) were computed sequentially, with delta applied to A and B before use. Mamba 2 moves all data-dependent projections to a single parallel step at the beginning of the block. The inputs X, A, B, and C are all projected from the input in parallel, with no sequential dependency between them. This simplifies the data flow, reduces the number of passes through GPU memory, and makes the block more amenable to tensor parallelism.

Multi-head structure

Mamba 2 introduces a multi-head structure analogous to multi-head attention (MHA) in Transformers. The model splits the state into multiple heads, each with its own set of B and C projections. This is analogous to how MHA uses separate query and key projections per head.

Two variants of parameter sharing reduce the cost of the multi-head structure. In grouped-value attention (GVA) mode, analogous to grouped-query attention in Transformers, the input X and the B projection are shared across multiple heads while C is independent per head. In multi-value (MV) mode, X and C are shared while B is independent. These sharing schemes allow larger effective model capacity without proportional increases in parameters or computation.

The multi-head structure also enables tensor parallelism across GPUs. In Mamba 1, the custom selective scan kernel required two all-reduce operations per layer during distributed training, compared to one for Transformer layers. Mamba 2's parallel projection structure and head-based organization reduces this to one all-reduce per layer, matching Transformers and enabling more efficient large-scale training.

Normalization

Mamba 2 adds a group normalization (or RMS normalization) layer applied to the output of the SSM computation before the final output projection. This follows a pattern also seen in some Mamba 1 training runs and in models like Falcon Mamba, where additional normalization was found necessary for stable training at scale.

The SSD algorithm

The SSD algorithm is a four-step procedure that computes the SSM output by decomposing the semiseparable matrix into blocks and using matrix multiplication for most of the computation.

Given a sequence of length T, the algorithm selects a chunk size Q (typically between 64 and 256 tokens). The sequence is divided into T/Q chunks. The four steps are:

Intra-chunk outputs: For each chunk, compute the contribution to the output from tokens within the same chunk using the attention-like (quadratic) form of SSD. This is a Q x Q matrix multiplication per chunk and runs fully in parallel across all chunks.
Chunk states: For each chunk, compute the SSM state at the end of the chunk assuming a zero initial state. This summarizes what each chunk contributes to future chunks.
Pass states: Run a sequential SSM scan over the T/Q chunk-level states computed in step 2. This is the only sequential step, but it operates on T/Q elements rather than T, reducing its cost by a factor of Q. For Q = 64, the sequential work is 64 times smaller than a full sequential scan.
Output states: Compute the additional contribution to each token's output from the true initial state that was propagated through the inter-chunk scan in step 3. This is again a matrix multiplication.

Steps 1, 2, and 4 are fully parallel and use matrix multiplications, which run on tensor cores. Only step 3 is sequential, and it operates on a much shorter sequence. The reference implementation of this algorithm requires approximately 25 lines of code, compared to the more complex selective scan in Mamba 1.

This design is similar in spirit to Flash Attention, which uses block-level chunking and tiling to keep computation within fast on-chip SRAM rather than reading from slower HBM. The SSD algorithm achieves comparable memory efficiency by processing chunks that fit in SRAM while combining results across chunks without materializing the full semiseparable matrix.

Speed and efficiency

Training throughput

The primary benchmarked improvement is training speed. Measured on the same state dimension, the SSD algorithm is 2 to 8 times faster than the Mamba 1 parallel scan. The range reflects the fact that the speedup grows with sequence length: at short sequences (under 1,000 tokens) the gain is modest, while at longer sequences the matrix-multiplication-heavy SSD algorithm amortizes its overhead more efficiently and the tensor core advantage becomes dominant.

Because the SSD algorithm is faster, Mamba 2 can afford to run with a state dimension 4 to 8 times larger than Mamba 1 at comparable or lower training cost. The larger state improves performance on associative recall tasks, where the model needs to store many key-value associations in its hidden state. On the synthetic Multi-Query Associative Recall (MQAR) benchmark used by the authors, Mamba 2 with N=64 substantially outperforms Mamba 1 with N=16.

Language modeling results

The paper trains Mamba 2 on 300 billion tokens from the Pile dataset and evaluates against Mamba 1 and the Transformer++ baseline (a Transformer with RoPE, SwiGLU, and other modern improvements). Mamba 2 matches or slightly exceeds Mamba 1 and sits close to the Transformer++ at equivalent parameter counts and token budgets. The 2.7B Mamba 2 model achieves 8.5 perplexity on an 8K context window, compared to 9.1 for comparably sized Transformers.

Sequence length scaling

Sequence length	Transformer attention (relative)	Mamba 2 SSM (relative)
2,048 tokens	1x	~1x
16,384 tokens	~6x	~1x
65,536 tokens	~25x	~1x
262,144 tokens	~100x+	~1x

At 256K tokens, NVIDIA reports Mamba 2 is approximately 18 times faster than a Transformer layer at the same sequence length. This asymptotic advantage reflects the linear vs. quadratic complexity difference: attention compute grows as O(T^2) while the SSM computation grows as O(T).

Systems-level improvements

Beyond the core algorithm, Mamba 2's parallel projection structure enables several systems optimizations that were not available for Mamba 1.

Tensor parallelism splits model parameters across GPUs. Mamba 2 achieves this by dividing the input and output projection matrices across the tensor parallel degree (2, 4, or 8 shards), applying per-GPU group normalization, and using a single all-reduce per layer. This matches the communication pattern of Transformer layers and allows Mamba 2 to scale across multiple GPUs with no additional overhead compared to Transformers.

Sequence parallelism splits long sequences across multiple GPUs. Mamba 2 supports this through the block decomposition property of SSD: each GPU computes local outputs for its portion of the sequence, then passes final chunk states to subsequent GPUs via point-to-point communication.

Variable-length batches, where sequences within a batch have different lengths, normally require padding to a uniform length, which wastes computation on padding tokens. Mamba 2 handles this by treating the entire batch as a single concatenated sequence and setting the transition scalar A_t to zero at sequence boundaries, preventing state from leaking between sequences.

Hybrid models

A widely adopted finding from both the Mamba 1 and Mamba 2 literature is that hybrid architectures, which alternate SSM layers with occasional attention layers, outperform either architecture used alone. SSM layers provide efficient long-range context compression; attention layers provide precise token-level retrieval. The combination addresses the primary weakness of pure SSMs (limited in-context recall) while retaining most of their efficiency advantage.

The Mamba 2 paper itself tested a hybrid of 6 attention layers and 58 SSD blocks, which outperformed both the pure Mamba 2 model and the pure Transformer baseline on the 300B-token Pile training run. This result has been replicated across a wide range of subsequent work, and by 2025 essentially every major SSM deployment uses a hybrid configuration rather than a pure SSM stack.

Codestral Mamba (Mistral AI)

Mistral released Codestral Mamba 7B (also marketed as Mamba-Codestral-7B-v0.1) on July 16, 2024, making it the first major commercial code generation model built on the Mamba 2 architecture. The model is released under an Apache 2.0 license and was developed with input from Albert Gu and Tri Dao. It is targeted specifically at code completion and was trained on a curated mix of public code repositories with special emphasis on long-range structural code patterns.

The architecture combines Mamba 2 layers with a small number of selective attention layers, following the hybrid pattern established by the Mamba 2 paper. Codestral Mamba supports a 256K-token context window, more than seven times the original Mistral 7B context, which the team argued is well-suited to entire-repository code understanding scenarios. Inference benefits from the SSM's fixed-state behavior: token throughput remains roughly constant as the prompt grows, in contrast to Transformer code models whose KV cache memory and bandwidth requirements scale linearly with context length.

On the HumanEval benchmark, Codestral Mamba achieves results competitive with similarly sized Transformer-based code models, including DeepSeek-Coder and CodeLlama variants. Mistral published Codestral Mamba alongside Mathstral 7B, a related math-specialist model that uses a more conventional Transformer backbone.

Jamba 1.5 and 1.6 (AI21 Labs)

Jamba, developed by AI21 Labs, was among the first commercially released hybrid models to combine Mamba layers with attention. The Jamba 1.5 series, released in 2024, uses 72 total layers interleaving Mamba and attention blocks along with a mixture-of-experts (MoE) routing mechanism. Jamba 1.5 Mini and Jamba 1.5 Large support 256K-token context windows and achieve faster inference than comparable Transformer-only models on long contexts.

On a 262,144-token input, Jamba 1.5 Mini generates approximately 62 tokens per second, compared to 41 for LLaMA 3.1 8B and 39 for Mixtral under the same conditions. The model fits in a smaller GPU memory footprint than Transformer equivalents because SSM layers do not require a KV cache, and Jamba 1.5 Mini can handle up to 140K tokens on a single 24 GB GPU.

AI21 released Jamba 1.6 in early 2025 as a successor focused on enterprise retrieval-augmented generation use cases. Jamba 1.6 keeps the hybrid SSM-attention-MoE backbone and 256K-token context window but retunes the model for grounded question answering and data classification. The company reports a 26 percentage point improvement in data classification quality over Jamba 1.5 and over 90 percent citation consistency on long-context grounded QA. A practical consequence reported by AI21 is that customers can migrate from Jamba 1.5 Large to Jamba 1.6 Mini while keeping quality high and recovering roughly 40 percent in latency, illustrating that architectural maturity, not just raw scale, is now the lever in hybrid SSM deployments.

Zamba 2 (Zyphra)

Zamba2, developed by Zyphra, replaces the Mamba 1 blocks from the original Zamba architecture with Mamba 2 blocks. The architecture uses a backbone of Mamba 2 layers interleaved with two shared attention blocks in an alternating pattern. The shared attention blocks use a low-rank adaptation (LoRA) projector at each invocation to allow the shared MLP weights to specialize per layer at minimal parameter cost.

Zamba2-2.7B was trained on 3 trillion tokens and subsequently annealed on 100 billion high-quality tokens. The model has extremely low inference latency due to the Mamba 2 blocks, which the Zyphra team reports have roughly four times the throughput of an equal-parameter Transformer block. Zamba2 models are available in 1.2B, 2.7B, and 7B parameter sizes.

Bamba (IBM Research)

Bamba is an open-source hybrid Mamba 2 model developed by IBM Research in collaboration with Princeton, CMU, and UIUC, with direct involvement from Albert Gu and Tri Dao. Bamba-9B was trained on 2.2 trillion tokens and achieves benchmark performance comparable to LLaMA 3.1 8B despite having been trained on roughly one-seventh the data.

At inference time on vLLM, Bamba-9B delivers 2.5 times the throughput and 2 times lower latency compared to a standard Transformer of similar size. IBM released the full training recipe, data mixtures, and a quantization framework alongside the model weights, making Bamba one of the most fully open hybrid SSM models available. The architecture later informed IBM's Granite 4.0 series of enterprise models.

Falcon-H1 (Falcon family at TII)

The Technology Innovation Institute (TII) in Abu Dhabi released Falcon Mamba 7B in 2024 as a pure SSM model based on Mamba 1 with additional RMS normalization layers for training stability. This was followed by the Falcon-H1 series in 2025, which uses a hybrid architecture where attention heads and Mamba 2 SSM components run in parallel within each layer. Falcon-H1 models range from 0.5B to 34B parameters and support up to 256K-token context windows across 18 languages.

NVIDIA Hymba and Nemotron-H

NVIDIA developed the Nemotron-H family of hybrid Mamba-Transformer models as part of its NeMo framework. Nemotron-H comes in 8B, 47B, and 56B parameter sizes and is designed to offer Transformer-quality accuracy at lower inference cost. At long sequence lengths of 256K tokens, NVIDIA reports Mamba 2 layers are approximately 18 times faster than Transformer attention layers. The NeMo framework provides reference implementations of Mamba 2 training with full support for tensor parallelism, sequence parallelism, and variable-length batching.

In late 2024 NVIDIA followed Nemotron-H with Hymba, a 1.5B-parameter small language model that takes the integration one step further. Rather than alternating Mamba and attention layers in series, Hymba runs them in parallel within a single hybrid head: each Hymba block contains a normalization layer, a hybrid head that applies attention and SSM in parallel and then fuses their outputs, another normalization, and a feedforward network. The block stack uses full attention only in the first, middle, and final blocks; the rest use sliding window attention to bound KV cache memory. NVIDIA reports that over half of the work formerly done by attention can be replaced by SSM processing with negligible accuracy loss, and Hymba's cross-layer KV cache sharing reduces cache memory by up to ten times relative to a same-size Transformer. Hymba 1.5B outperforms Llama 3.2 1B, OpenELM 1B, and Qwen 2.5 1.5B on MMLU, ARC-Challenge, HellaSwag, and SQuAD-C while remaining far smaller in memory footprint.

Hunyuan-TurboS (Tencent)

In March 2025 Tencent published Hunyuan-TurboS, the first publicly described ultra-large-scale hybrid Mamba-Transformer model deployed in production. Hunyuan-TurboS uses a mixture-of-experts backbone with 56B activated parameters out of 560B total, organized into 128 layers that alternate Mamba 2, attention, and feedforward sublayers using a structured AMF/MF block pattern. In the AMF configuration, each block sequentially applies an attention layer, a Mamba 2 layer (a faster variant the authors call "Faster Mamba2"), and a feedforward MoE layer. Grouped-query attention is used to bound the KV cache size of the attention sublayers.

The model was pretrained on 16 trillion tokens, supports a 256K context window, and uses an adaptive long-short chain-of-thought scheme that lets the model switch between a fast direct-answer mode and a deliberative reasoning mode depending on prompt difficulty. Tencent reports that the AMF pattern delivers 2.3 times faster long-context processing than a comparable pure Transformer with less than 15 percent quality degradation on a battery of reasoning benchmarks. On the LMSYS Chatbot Arena leaderboard at release, Hunyuan-TurboS reached an Elo of 1356, placing it in the top tier of contemporary models and ahead of Gemini 2.0 Flash and the OpenAI o4-mini-2025-04-16 snapshot at the time. The deployment is significant because it confirms that hybrid SSM architectures can be operated economically at frontier scale in real consumer-facing products.

Cartesia

Cartesia is an AI company co-founded by Albert Gu (as Chief Scientist) and other researchers to build real-time AI applications on top of state space model technology. The company, based in San Francisco, raised a $5 million seed round and later a $22 million Series A led by Index Ventures in December 2024, bringing total capital raised to $27 million.

Cartesia's primary product is Sonic, a text-to-speech model based on a state space model architecture. Sonic was released in May 2024 and achieved sub-90 millisecond latency to first audio output, which the company positioned as the fastest TTS model available at the time. The latency advantage comes directly from the SSM architecture: because SSMs process tokens through a fixed-size state with O(1) cost per token, there is no growing KV cache to read during generation, which reduces memory bandwidth pressure and enables consistent low-latency output regardless of prompt length.

Cartesia also released Mamba-3B-SlimPJ, an SSM trained on the SlimPajama dataset that demonstrated competitive performance against Transformer models of the same size, and has continued developing multi-modal real-time models. The company has argued that SSMs are particularly well-suited to always-on, streaming applications because their fixed memory footprint and constant per-token cost make resource consumption predictable.

Limitations

In-context recall

The main documented limitation of Mamba 2, shared with Mamba 1 and most SSMs, is imprecise in-context retrieval. Transformers store all past tokens in a KV cache and can attend directly to any of them during generation, making exact lookup tasks straightforward. SSMs compress past context into a fixed-size hidden state, which means tokens that were not retained strongly during compression cannot be precisely retrieved later.

NVIDIA's empirical study of Mamba-based language models found that both Mamba 1 and Mamba 2 begin to fail at phonebook lookup tasks once input sequences exceed approximately 500 tokens. A Transformer-based 8B model maintained near-perfect accuracy on the same task up to its full pretraining context length of 4,096 tokens. This gap persisted even after training Mamba 2 on 3.5 trillion tokens, suggesting it is architectural rather than a matter of training scale.

Five-shot in-context learning, where the model reads a few examples in the prompt and generalizes to new inputs, also lags behind Transformers. Because few-shot learning relies on the model attending back to the demonstration examples during generation, the SSM's compressed state representation is at a structural disadvantage.

Fixed state size

The state dimension in Mamba 2 is a fixed hyperparameter chosen at training time. While the SSD algorithm allows this dimension to be set much larger than in Mamba 1 without prohibitive cost, there is still a ceiling: once the model's hidden state is full, new information can only be accommodated by discarding or overwriting old information. Transformers have no equivalent constraint; their KV cache grows proportionally with context length.

This tradeoff means Mamba 2 is well-suited to tasks where the relevant information is distributed throughout a long sequence (dense context), but can be unreliable on tasks that require finding and retrieving a specific fact buried in a very long document (sparse retrieval).

Training instability at scale

Several groups training large SSM models reported that plain Mamba architectures can exhibit instability at scale without additional normalization. Falcon Mamba 7B addressed this by adding RMS normalization layers. The Mamba 2 architecture includes group normalization on the SSM output, which partially addresses this, but training SSMs at the scale of the largest Transformer models (70B+ parameters, trillions of tokens) remains less well-characterized than Transformer training at those scales. The 2025 deployment of Hunyuan-TurboS at the 560B-parameter MoE scale has begun to fill in this gap, but the recipes for stable training at frontier scale remain less openly documented than their Transformer analogs.

Ecosystem maturity

As of mid-2024, the Mamba 2 ecosystem was less mature than the Transformer ecosystem. Transformer models benefit from years of optimized inference kernels, serving frameworks, and deployment tooling. Mamba 2 required custom CUDA kernels for efficient training, which initially limited availability to researchers with access to appropriate GPU hardware running Linux. Integration into mainstream frameworks like HuggingFace Transformers arrived later in 2024, and vLLM added first-class support for Mamba 2 inference in early 2025, which substantially closed the deployment gap.

Pretrained models

The reference implementation at the state-spaces/mamba GitHub repository provides five Mamba 2 models trained on 300 billion tokens from the Pile:

Model	Layers	Model dimension	Parameters
mamba2-130m	24	768	130M
mamba2-370m	48	1,024	370M
mamba2-780m	48	1,536	780M
mamba2-1.3b	48	2,048	1.3B
mamba2-2.7b	64	2,560	2.7B

All models use a default state dimension of 64 or 128, compared to 16 in the corresponding Mamba 1 models. A hybrid variant, mamba2attn-2.7b, adds attention layers to the 2.7B model.

Beyond the reference checkpoints, the following table summarizes representative production deployments that use Mamba 2 as a core component:

Model	Year	Organization	Parameters	Architecture pattern	Context
Codestral Mamba	2024	Mistral AI	7B	Mamba 2 with selective attention	256K
Jamba 1.5 Mini / Large	2024	AI21 Labs	12B active / 94B total / MoE	Interleaved Mamba 2 + attention + MoE	256K
Jamba 1.6 Mini / Large	2025	AI21 Labs	12B active / 94B total / MoE	Refined Jamba 1.5 hybrid	256K
Zamba2	2024	Zyphra	1.2B / 2.7B / 7B	Mamba 2 backbone, shared attention with LoRA	4K-16K
Bamba	2024	IBM Research	9B	Hybrid Mamba 2 + attention	32K
Falcon-H1	2025	TII	0.5B-34B	Parallel Mamba 2 and attention heads	256K
Nemotron-H	2024-2025	NVIDIA	8B / 47B / 56B	Mamba 2 + attention	256K+
Hymba	2024	NVIDIA	1.5B	Parallel hybrid head, SWA + Mamba 2	8K
Hunyuan-TurboS	2025	Tencent	56B active / 560B MoE	AMF block pattern	256K

References

Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." *ICML 2024*. arXiv:2405.21060.
Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
Gu, A., Goel, K., & Re, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." *ICLR 2022*. arXiv:2111.00396.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. arXiv:2205.14135.
Lieber, O., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." AI21 Labs. arXiv:2403.19887.
AI21 Labs. (2024). "Jamba 1.5: Hybrid Transformer-Mamba Models at Scale." https://www.ai21.com/research/jamba-1-5-hybrid-transformer-mamba-models-at-scale/
AI21 Labs. (2025). "Introducing Jamba 1.6." https://www.ai21.com/blog/introducing-jamba-1-6/
Mistral AI. (2024). "Codestral Mamba." https://mistral.ai/news/codestral-mamba/
Mistral AI. (2024). "Mamba-Codestral-7B-v0.1 Model Card." https://huggingface.co/mistralai/Mamba-Codestral-7B-v0.1
Zyphra. (2024). "Zamba2-7B." https://www.zyphra.com/post/zamba2-7b
IBM Research. (2024). "Meet Bamba, IBM's new attention-state space model." https://research.ibm.com/blog/bamba-ssm-transformer-model
Hugging Face. (2024). "Bamba: Inference-Efficient Hybrid Mamba2 Model." https://huggingface.co/blog/bamba
Technology Innovation Institute. (2024). "Welcome Falcon Mamba: The first strong attention-free 7B model." https://huggingface.co/blog/falconmamba
NVIDIA. (2024). "An Empirical Study of Mamba-based Language Models." arXiv:2406.07887.
NVIDIA. (2024). "Mamba 2 and Hybrid Models - NeMo Framework User Guide." https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/mamba.html
NVIDIA. (2024). "Hymba: A Hybrid-Head Architecture for Small Language Models." NVIDIA Technical Blog. https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/
Liu, Y., et al. (2025). "Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought." Tencent. arXiv:2505.15431.
Cartesia. (2024). "Announcing Sonic: a low-latency voice model for lifelike speech." https://cartesia.ai/blog/sonic
Index Ventures. (2024). "Building the Next Generation of Real-Time AI Models." https://www.indexventures.com/perspectives/building-the-next-generation-of-real-time-ai-models-our-investment-in-cartesia/
Dao, T. (2024). "State Space Duality (Mamba-2) Part IV - The Systems." https://tridao.me/blog/2024/mamba2-part4-systems/
Gu, A. (2024). "State Space Duality (Mamba-2) Part I - The Model." https://goombalab.github.io/blog/2024/mamba2-part1-model/
Goomba Lab. (2024). "State Space Duality (Mamba-2) Part III - The Algorithm." https://goombalab.github.io/blog/2024/mamba2-part3-algorithm/
Yang, S., et al. (2024). "Gated Linear Attention Transformers with Hardware-Efficient Training." arXiv:2312.06635.

Authors

Background

State space models and the SSM revival

Hardware context

State Space Duality

Semiseparable matrices

Structured Masked Attention

Connection to standard attention

Relation to other linear attention variants

Computational comparison

Mamba 2 architecture

Constraints on the transition matrix

Parallel input projections

Multi-head structure

Normalization

The SSD algorithm

Speed and efficiency

Training throughput

Language modeling results

Sequence length scaling

Systems-level improvements

Hybrid models

Codestral Mamba (Mistral AI)

Jamba 1.5 and 1.6 (AI21 Labs)

Zamba 2 (Zyphra)

Bamba (IBM Research)

Falcon-H1 (Falcon family at TII)

NVIDIA Hymba and Nemotron-H

Hunyuan-TurboS (Tencent)

Cartesia

Limitations

In-context recall

Fixed state size

Training instability at scale

Ecosystem maturity

Pretrained models

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Self-attention

YaRN

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

Authors

Background

State space models and the SSM revival

Hardware context

State Space Duality

Semiseparable matrices

Structured Masked Attention

Connection to standard attention

Relation to other linear attention variants

Computational comparison

Mamba 2 architecture

Constraints on the transition matrix

Parallel input projections

Multi-head structure

Normalization

The SSD algorithm

Speed and efficiency

Training throughput

Language modeling results

Sequence length scaling

Systems-level improvements

Hybrid models

Codestral Mamba (Mistral AI)

Jamba 1.5 and 1.6 (AI21 Labs)

Zamba 2 (Zyphra)

Bamba (IBM Research)

Falcon-H1 (Falcon family at TII)

NVIDIA Hymba and Nemotron-H

Hunyuan-TurboS (Tencent)

Cartesia

Limitations

In-context recall

Fixed state size

Training instability at scale