Mamba
Last reviewed
May 17, 2026
Sources
21 citations
Review status
Source-backed
Revision
v7 ยท 6,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
21 citations
Review status
Source-backed
Revision
v7 ยท 6,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: transformer, attention mechanism, recurrent neural network, large language model, Mamba-2, state space model
Mamba is a neural network architecture for sequence modeling that uses selective state space models (SSMs) to process sequential data in linear time with respect to sequence length. It was introduced by Albert Gu and Tri Dao in December 2023 in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." The architecture offers an alternative to the transformer, which relies on the attention mechanism and scales quadratically with sequence length.
The central idea behind Mamba is making the parameters of a state space model depend on the input, allowing the model to selectively propagate or forget information along the sequence depending on the current token. Combined with a hardware-aware parallel scan algorithm, Mamba achieves both the modeling power of content-aware reasoning and the computational efficiency of linear-time processing.
Mamba was pretrained at multiple scales (130M, 370M, 790M, 1.4B, and 2.8B parameters) on 300 billion tokens from the Pile dataset. The Mamba-3B model matched the perplexity of a Transformer twice its size while delivering 5x higher inference throughput. As a general sequence model backbone, Mamba achieved strong performance across language, audio, and genomics modalities.
Since the original paper, Mamba has matured into a production architecture. Mistral, AI21 Labs, the Technology Innovation Institute (TII), Nvidia, and IBM have all shipped Mamba-based or Mamba-hybrid models, and frameworks such as vLLM, TensorRT-LLM, and Hugging Face transformers now support selective scan kernels out of the box. The architecture also seeded a family of follow-up variants including Mamba-2, Mamba-3, MambaByte, and Vision Mamba.
State space models originate from control theory and signal processing. A continuous-time SSM maps an input signal u(t) to an output signal y(t) through a latent state vector x(t) using two equations:
Here, A is the state transition matrix that governs how the latent state evolves over time, B is the input projection matrix, C is the output projection matrix, and D provides a direct skip connection from input to output. The model learns these parameters to capture the dynamics of input-to-output mappings through the latent state representation.
In classical systems, A, B, C, and D are fixed matrices. This property is called Linear Time Invariance (LTI): the same dynamics apply regardless of when or what input arrives. LTI systems have useful mathematical properties but cannot perform content-dependent reasoning, a limitation that Mamba directly addresses.
Since real-world data like text tokens and audio samples arrive as discrete sequences rather than continuous signals, the continuous SSM must be discretized before it can be applied computationally. The most common method is the zero-order hold (ZOH) technique, which holds each discrete input value constant until the next sample arrives.
A learnable step size parameter (denoted delta) controls the resolution of the discretization. Through ZOH, the continuous matrices A and B are converted into their discrete counterparts (typically written as A-bar and B-bar) that operate on sequences step by step:
Discretization is one of the most important aspects of SSM architectures because it enables two equivalent computational views of the same model: a recurrent view (processing tokens one at a time) and a convolutional view (processing the entire sequence in parallel). This dual representation allows SSMs to use the parallelizable convolutional form during training for speed, then switch to the recurrent form during inference for efficiency.
A key challenge for any recurrent model is remembering information over long sequences without suffering from vanishing or exploding gradients. The HiPPO (High-order Polynomial Projection Operators) framework, introduced by Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re in 2020, provides a principled initialization for the state matrix A.
HiPPO works by continuously projecting the input history onto a basis of orthogonal polynomials (specifically Legendre polynomials). This creates a state representation that compresses historical information optimally: recent tokens are captured with high fidelity while older tokens decay gracefully. The resulting HiPPO matrix ensures that each update step requires only O(N) operations, and gradient norms scale as O(1/t), preventing the vanishing and exploding gradient problems that plague standard RNNs.
HiPPO initialization proved to be a foundational component for all subsequent SSM architectures, including S4 and Mamba.
The Structured State Space for Sequences (S4) model, introduced by Albert Gu, Karan Goel, and Christopher Re in 2021 (published at ICLR 2022 where it received an Outstanding Paper Honorable Mention), was the first SSM architecture to achieve competitive performance with transformers on a variety of sequence modeling tasks.
S4 combined three components:
The main technical contribution of S4 was decomposing the HiPPO matrix A into a Normal Plus Low-Rank (NPLR) form. This decomposition allowed the SSM convolutional kernel to be computed in O(N + L) operations and memory (where N is the state dimension and L is the sequence length), reducing what would otherwise be an intractable computation.
S4 achieved several notable results:
| Benchmark | S4 result | Significance |
|---|---|---|
| Sequential CIFAR-10 | 91.0% accuracy | Without data augmentation or auxiliary losses |
| Long Range Arena (Path-X) | Solved (length 16,384) | First model to solve this task; all prior methods failed |
| Generation speed | 60x faster than Transformers | On comparable autoregressive tasks |
However, S4 and other LTI state space models shared a fundamental limitation: because their parameters remained constant regardless of input content, they could not perform content-based reasoning. For example, they struggled with tasks requiring the model to attend to specific tokens based on their content rather than their position.
Between S4 and Mamba, several intermediate SSM variants were developed:
| Model | Year | Key contribution |
|---|---|---|
| S4 | 2021 | NPLR parameterization of HiPPO matrix |
| DSS (Diagonal State Spaces) | 2022 | Showed diagonal approximation of A achieves comparable performance |
| S4D | 2022 | Simplified S4 with diagonal initialization |
| S5 | 2023 | Multi-input multi-output (MIMO) SSM with parallel scan |
| H3 (Hungry Hungry Hippos) | 2023 | Combined SSM with multiplicative gating for language modeling |
| Mamba | 2023 | Input-dependent (selective) SSM parameters |
Each step simplified the architecture while maintaining or improving performance, with Mamba making the final and most significant leap by abandoning time invariance altogether.
The core limitation of all prior SSM architectures was Linear Time Invariance. Because the matrices A, B, and C remained identical for every token regardless of input content, these models could not perform content-aware filtering. Consider a language model processing the sentence "The cat sat on the mat": an LTI model treats the word "the" with exactly the same dynamics as the word "cat," even though they carry very different amounts of semantic information.
Transformers solve this problem through self-attention, which computes pairwise similarity scores between all tokens. This enables content-based reasoning but comes at O(L^2) cost in both time and memory. Mamba achieves content-aware processing while maintaining O(L) complexity.
The Mamba paper demonstrated this concretely with two synthetic tasks. In Selective Copying, models had to copy a small set of tokens at variable positions, ignoring filler tokens. In Induction Heads, they had to retrieve the next token after a specific marker pattern. Vanilla S4 and other LTI baselines scored near random; Mamba achieved near-perfect accuracy and generalized to sequences much longer than those seen in training.
Mamba makes three key SSM parameters functions of the input:
| Parameter | Role | How it becomes selective |
|---|---|---|
| B (input matrix) | Controls how input enters the state | Projected from the input via a linear layer; different for each token |
| C (output matrix) | Controls how state maps to output | Projected from the input via a linear layer; different for each token |
| Delta (step size) | Controls discretization resolution | Projected from the input via a linear layer + softplus; different for each token |
The state matrix A remains fixed (initialized with HiPPO) because making it input-dependent would break the parallel scan algorithm.
The step size delta plays a particularly important role as a selection mechanism. A large delta causes the model to emphasize the current input token and reset more of the historical state. A small delta causes the model to suppress the current token in favor of preserving existing context. This gives the model a learned, content-dependent ability to decide what information to retain and what to discard, similar to the gating mechanisms in LSTMs and GRUs.
Making parameters input-dependent means the model can no longer be computed as a fixed convolution kernel, since the kernel changes at every time step. This forces Mamba to use the recurrent representation. However, naive sequential recurrence would be far too slow for training on modern GPUs.
Mamba solves this with the parallel scan (prefix sum) algorithm. The key insight is that the recurrence operation is associative: the order in which intermediate results are combined does not affect the final answer. This property allows the sequence to be split into segments that are computed in parallel, with results merged iteratively. The parallel scan reduces the time complexity from O(L) sequential steps to O(log L) parallel steps while producing the same output as sequential recurrence.
The combination of input-dependent parameters and the parallel scan algorithm is what the authors call the "selective scan" mechanism.
The selective scan mechanism creates a computational challenge: the expanded state (with input-dependent B and C matrices incorporating the batch and sequence length dimensions) is much larger than the original state. Naively materializing this expanded state in GPU high-bandwidth memory (HBM, also called DRAM) would erase the efficiency gains of using an SSM in the first place.
Mamba addresses this with a hardware-aware algorithm that mirrors techniques from FlashAttention (also developed by Tri Dao). The implementation uses three key optimizations:
Kernel fusion. Instead of writing intermediate results (discretization output, scan output, and the C multiplication) back to slow GPU DRAM between each operation, Mamba fuses all three operations into a single GPU kernel that keeps intermediate values in fast on-chip SRAM. This eliminates costly memory transfers between the GPU memory hierarchy levels.
Recomputation. Rather than storing the large intermediate states during the forward pass for use in backpropagation, Mamba recomputes them during the backward pass. Although this doubles the computation for those states, recomputation is faster than the alternative of reading large intermediate tensors from DRAM, because modern GPUs are memory-bandwidth-limited rather than compute-limited.
Avoiding materialization. The expanded state is never fully materialized in DRAM. Only the compressed hidden state (of size N, the state dimension) is kept in memory, not the full expanded representation.
These optimizations make the selective scan operation 20 to 40 times faster than a naive implementation. The resulting algorithm is faster than optimized attention implementations (such as FlashAttention) for long sequences while scaling linearly rather than quadratically. The reference CUDA implementation (mamba-ssm) became the canonical kernel used by virtually every downstream Mamba variant.
The full Mamba architecture wraps the selective SSM layer into a simplified block design. Unlike transformers, which alternate between self-attention layers and MLP (feed-forward) layers, each Mamba block combines both functions into a single unit:
This design is inspired by the gated MLP architecture from Llama and similar models, but with the selective SSM replacing the nonlinear activation on one branch. The result is a simpler architecture that does not require separate attention and MLP blocks, layer normalization within the block, or positional embeddings.
A complete Mamba model stacks many of these blocks (for example, 48 blocks for the 1.4B parameter model and 64 blocks for the 2.8B model), with RMSNorm applied before each block.
| Aspect | Transformer | Mamba |
|---|---|---|
| Core mechanism | Self-attention | Selective state space model |
| Time complexity (sequence length L) | O(L^2) per layer | O(L) per layer |
| Training parallelism | Fully parallel (attention matrix) | Parallel (via parallel scan) |
| Inference mode | Autoregressive with KV cache | Recurrent with fixed-size state |
| Inference memory | KV cache grows linearly with context | Fixed-size state (constant memory) |
| Content-aware reasoning | Yes (attention computes pairwise token interactions) | Yes (input-dependent SSM parameters) |
| Context window | Fixed maximum (requires extension techniques) | Theoretically unbounded |
| Long-range dependencies | Attention can directly connect any two positions | State carries compressed history |
| In-context learning | Strong | Weaker (limited by finite state size) |
| Copying/retrieval | Strong (direct access to all past tokens) | Weaker (information must survive state compression) |
| Position encoding | Required (sinusoidal, RoPE, ALiBi, etc.) | Not required |
| Inference throughput (similar size) | Baseline | Up to 5x higher |
The fundamental trade-off is between expressiveness and efficiency. Transformers maintain an uncompressed representation of the entire sequence through the KV cache, allowing direct access to any past token. Mamba compresses the entire history into a fixed-size hidden state, which is much more memory-efficient but means information can only be accessed if it survived the compression. This makes transformers better at tasks requiring exact retrieval from context (such as looking up a phone number mentioned earlier in the text) while Mamba excels at tasks involving long-range dependencies and high-throughput generation.
Mamba was evaluated across multiple modalities on the Pile dataset (300B tokens for language) and domain-specific datasets for audio and genomics.
On language modeling with the Pile, Mamba models showed consistent improvements over transformer baselines at each scale:
| Model | Parameters | Pile perplexity | Notes |
|---|---|---|---|
| Transformer | 1.4B | Baseline | Standard transformer |
| Mamba | 1.4B | Better than Transformer-1.4B | Matches Transformer at lower perplexity |
| Transformer | 2.8B | Baseline | Standard transformer |
| Mamba | 2.8B | Matches Transformer-6.9B | Same quality at 40% of the compute |
On downstream zero-shot evaluation tasks, Mamba-3B outperformed Transformer-3B models and matched or exceeded the performance of Transformers with twice as many parameters.
On the SC09 speech generation benchmark, a small Mamba model outperformed larger GAN-based and diffusion-based models (including WaveNet, SampleRNN, WaveGAN, DiffWave, and SaShiMi). A parameter-matched larger variant further improved fidelity metrics, reducing the FID score by more than half compared to prior state-of-the-art.
Mamba demonstrated strong results on DNA sequence modeling, outperforming HyenaDNA across model sizes. Unlike HyenaDNA, whose performance degraded with longer sequences, Mamba continued to improve with context lengths up to 1 million tokens. On a downstream species classification task distinguishing five great ape species (which share approximately 99% DNA similarity), Mamba's ability to use extremely long contexts proved particularly effective.
Mamba's selective scan implementation achieved up to 3x speedup over prior SSM methods on A100 GPUs and up to 5x higher generation throughput compared to similarly sized Transformers. The efficient scan kernel was 40x faster than a standard (naive) implementation of the selective recurrence.
In May 2024, Tri Dao and Albert Gu published "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," introducing Mamba-2. The paper established a theoretical framework called Structured State Space Duality (SSD) that reveals a deep connection between SSMs and attention mechanisms.
The key theoretical result is that a state space model with a scalar-times-identity state matrix (where all diagonal elements of A are identical) is mathematically equivalent to a form of masked self-attention with a 1-semiseparable causal mask. This duality means the same computation can be expressed either as an SSM recurrence or as a matrix multiplication resembling attention:
When all a_t values equal 1, the attention form reduces to standard causal linear attention.
This connection runs through the theory of semiseparable matrices, a well-studied class in numerical linear algebra. The authors showed that both SSMs and attention can be understood as different decompositions of the same structured matrix.
| Feature | Mamba-1 | Mamba-2 |
|---|---|---|
| State matrix A | Diagonal (different value per channel) | Scalar times identity (one value shared) |
| Head dimension | P = 1 (independent SSMs per channel) | P >= 64 (shared dynamics across channels) |
| Typical state dimension N | 16 | 64 to 256 |
| Parameter generation | Sequential (SSM params depend on x) | Parallel (A, B, C generated alongside x) |
| Core algorithm | Selective scan via parallel prefix sum | Chunkwise SSD (quadratic within chunks, linear across chunks) |
The larger state dimension in Mamba-2 (up to 16x larger than Mamba-1) significantly improves performance on associative recall tasks, where Mamba-1 was weakest. The multi-head structure, where P channels share a single state transition, reduces the total number of independent recurrences while increasing expressiveness.
Mamba-2's core SSD layer runs 2 to 8 times faster than Mamba-1's selective SSM while maintaining competitive performance with Transformers on language modeling benchmarks. The SSD algorithm achieves this speedup by decomposing sequences into chunks, applying the quadratic (attention-like) computation within each chunk for hardware efficiency, and passing SSM states between chunks to maintain the linear overall scaling. A minimal PyTorch implementation requires approximately 30 lines of code.
In March 2026, Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu introduced Mamba-3 (published at ICLR 2026). The paper addresses remaining expressivity limitations in Mamba-2 through three innovations:
Exponential-trapezoidal discretization. This replaces the simpler exponential-Euler discretization used in Mamba-1 and Mamba-2 with a second-order accurate method. The improved discretization enables an implicit convolution applied on the SSM input, increasing the expressivity of the recurrence.
Complex-valued state updates. By modeling the SSM in the complex number domain, Mamba-3 achieves a more expressive state update. This connects to data-dependent rotary position embeddings (RoPE), providing a theoretical bridge between complex SSMs and the position encoding techniques used in transformers.
Multi-input multi-output (MIMO) formulation. MIMO transitions from outer-product to matrix-multiplication-based state updates, increasing the rank of input and output projections. This raises decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size but maintains similar wall-clock decode latency.
At the 1.5B parameter scale, Mamba-3 improved average downstream accuracy by 0.6 points over Gated DeltaNet (the next best model) in the SISO configuration, with the MIMO variant adding another 1.2 points. Mamba-3 with state size 64 matched the perplexity of Mamba-2 with state size 128, effectively achieving the same language modeling quality at half the state size. The model also solved formal language tasks (parity and modular arithmetic) that Mamba-2 could not handle, addressing known state-tracking deficiencies in prior linear recurrent models.
MambaByte, introduced by Junxiong Wang and collaborators at COLM 2024, is a token-free adaptation of Mamba trained directly on raw byte sequences. Conventional large language models operate on subword tokens produced by a BPE or SentencePiece tokenizer; this inductive bias creates failure modes around multilingual text, unusual scripts, code, and adversarial typos. Byte-level modeling removes the tokenizer but multiplies effective sequence length roughly four to five times, which is prohibitive for quadratic attention.
MambaByte exploits Mamba's linear-time scan to make byte-level modeling tractable. MambaByte-972M is competitive with state-of-the-art subword transformer baselines on language modeling while remaining more robust to character-level noise and orthographic perturbations. The paper also introduced speculative decoding with tokenized drafting and byte-level verification, achieving a 2.6x inference speedup over the standard byte-level decode loop. MambaByte demonstrated a broader pattern: architectures whose cost is linear in sequence length unlock modeling regimes (raw audio, DNA, bytes, video frames) where tokenization was previously the bottleneck.
Within eighteen months of the original paper, Mamba and Mamba-2 moved from research code to commercial deployments across at least five organizations. The following are notable releases that ship pure-Mamba or hybrid-Mamba weights publicly.
Mistral released Codestral Mamba 7B on 16 July 2024 under the Apache 2.0 license, positioning it as the first major Mamba-2 model deployed for production code generation. Mistral framed the choice as a deliberate bet that Mamba's linear-time decode would matter more than transformer parity for the long-context, autocomplete-heavy workloads typical of code assistants.
Codestral Mamba achieved 75.0% on HumanEval Python, outperforming CodeGemma-1.1 7B (61.0%), CodeLlama 34B (31.1%), and DeepSeek Coder 6.7B v1.5 (65.9%) at similar or smaller parameter counts. On the Spider SQL benchmark it reached 58.8%. The model supports an effective context of 256K tokens for in-context retrieval and ships through Mistral's la Plateforme API as codestral-mamba-2407, with raw weights on Hugging Face and deployment recipes for the mistral-inference SDK and TensorRT-LLM.
The Technology Innovation Institute (TII) in Abu Dhabi released Falcon Mamba 7B on 12 August 2024 under the TII Falcon License 2.0. It was the first pure-SSM 7B-scale model to surpass leading transformer baselines on standardized evaluations, outperforming Llama 3.1 8B, Mistral 7B v0.3, and Falcon 2 11B on Hugging Face's Open LLM Leaderboard at release.
Falcon Mamba's key practical advantage was constant-memory generation: because the model carries a fixed-size SSM state rather than a KV cache, decoding a million-token output consumes the same GPU memory as decoding a hundred-token output. TII later released Falcon3-Mamba-7B-Instruct in January 2025 and Falcon-H1, a hybrid Mamba-attention architecture, in mid-2025.
AI21 Labs released Jamba in March 2024, Jamba 1.5 Mini and 1.5 Large in August 2024, Jamba 1.6 in March 2025, and Jamba 1.7 in July 2025. The family interleaves Mamba layers with sparse attention layers and Mixture-of-Experts routing, targeting enterprise long-context workloads such as retrieval-augmented generation, contract review, and customer-support knowledge bases.
| Version | Released | Architecture highlights | Effective context |
|---|---|---|---|
| Jamba v0.1 | March 2024 | 52B total / 12B active params, 1:7 attention:Mamba ratio, MoE | 256K tokens |
| Jamba 1.5 Mini | August 2024 | 52B total / 12B active params | 256K tokens |
| Jamba 1.5 Large | August 2024 | 398B total / 94B active params | 256K tokens |
| Jamba 1.6 | March 2025 | Quality and speed improvements over 1.5 | 256K tokens |
| Jamba 1.7 | July 2025 | Better grounding, optimized quantization for self-hosted deploy | 256K tokens |
At 256K tokens, Jamba 1.5 Large maintains a fixed activation memory budget that fits in a single 8x H100 node, where a comparably sized pure transformer would need hundreds of gigabytes for the KV cache. AI21 distributes Jamba under the Jamba Open Model License on Hugging Face and through cloud catalogs including Google Cloud Vertex AI, Microsoft Azure AI Foundry, NVIDIA NIM, AWS Bedrock, Databricks, and Snowflake Cortex. The 1.6 and 1.7 releases emphasize private on-prem and VPC deployment, reflecting Jamba's adoption by regulated enterprise customers.
Nvidia introduced Hymba in November 2024 as a family of small language models built around a hybrid-head parallel attention scheme. Rather than alternating Mamba and attention layers as Jamba does, each Hymba layer runs standard attention heads and Mamba heads in parallel and combines their outputs. Hymba-1.5B-Base outperformed similarly sized Llama 3.2 1B, SmolLM 1.7B, and Qwen 2.5 1.5B on common-sense reasoning benchmarks while requiring roughly 10x less KV-cache memory.
Nvidia followed Hymba with Nemotron Nano 2 (mid-2025) and Nemotron 3 (15 December 2025), both pairing Mamba-2 layers with a small fraction of attention layers and a Mixture-of-Experts router. Nvidia positions Nemotron as the reference architecture for agentic AI workloads on Blackwell GPUs, where the fixed-size SSM state simplifies multi-tenant serving.
IBM Research collaborated directly with Tri Dao, Albert Gu, and Minjia Zhang (University of Illinois Urbana-Champaign) on Bamba and Bamba V2, hybrid Mamba-2 models released through 2024 and 2025. They served as the empirical foundation for IBM's enterprise language models. On 2 October 2025, IBM launched the Granite 4.0 family, the first commercial LLM line built around a 9:1 ratio of Mamba-2 to attention layers. IBM reports a >70% reduction in serving RAM versus equivalent transformer Granite models for long-context and multi-session inference, with quality preserved on RAG, summarization, and code completion benchmarks. Granite 4.0 ships through IBM watsonx, the Red Hat AI portfolio, and as Apache 2.0 weights on Hugging Face.
| Model | Vendor | Released | Architecture | Notable use case |
|---|---|---|---|---|
| Codestral Mamba 7B | Mistral AI | July 2024 | Pure Mamba-2 | Code generation, autocomplete |
| Falcon Mamba 7B | TII | August 2024 | Pure Mamba | Constant-memory long generation |
| Jamba 1.5 Mini / Large | AI21 Labs | August 2024 | Hybrid Mamba + Attention + MoE | Enterprise 256K-context RAG |
| Hymba 1.5B | Nvidia | November 2024 | Parallel hybrid heads | On-device small LLM |
| Falcon3-Mamba-7B-Instruct | TII | January 2025 | Pure Mamba | Instruction-following SSM |
| Jamba 1.6 / 1.7 | AI21 Labs | March / July 2025 | Hybrid Mamba + Attention + MoE | Private enterprise deploy |
| Nemotron 3 | Nvidia | December 2025 | Hybrid Mamba-2 + Attention + MoE | Agentic AI on Blackwell GPUs |
| Granite 4.0 | IBM | October 2025 | 9:1 Mamba-2:Attention hybrid | Enterprise long-context workloads |
| Falcon H1R 7B | TII | January 2026 | Hybrid Mamba + Attention (reasoning) | Compact reasoning model |
Recognizing that transformers and Mamba have complementary strengths, AI21 Labs released Jamba in March 2024, the first production-grade hybrid architecture combining Transformer attention layers, Mamba SSM layers, and Mixture-of-Experts (MoE). Jamba's success catalyzed an entire generation of hybrid designs, and by 2026 nearly every frontier-scale long-context model uses some form of attention-SSM interleaving.
Jamba interleaves blocks of Transformer and Mamba layers with a ratio of approximately one attention layer for every seven Mamba layers. Each block contains either an attention or Mamba layer followed by a feed-forward network (MLP). The MoE component allows the model to use only 12B of its total 52B parameters at inference time, keeping compute costs manageable.
Jamba supports context lengths up to 256K tokens while fitting in a single 80GB GPU. The hybrid design provides:
AI21 Labs later released Jamba-1.5 in two sizes: Jamba-1.5-Large (94B active parameters) and Jamba-1.5-Mini (12B active parameters), both with 256K token effective context length. Jamba 1.6 and Jamba 1.7 (released in 2025) extended the family with improved retrieval grounding, more aggressive quantization, and optimizations targeting on-prem deployment. Jamba demonstrated that hybrid architectures can capture the best properties of both design paradigms rather than forcing a choice between them.
Following Jamba, multiple groups released hybrid architectures with different interleaving schemes:
| Model | Group | Hybrid design |
|---|---|---|
| Zamba | Zyphra | Single shared attention block applied at multiple layers, Mamba elsewhere |
| Samba | Microsoft Research | Alternating Sliding Window Attention and Mamba |
| Hunyuan-TurboS | Tencent | Mamba-2 with periodic attention layers and MoE |
| Nemotron Nano 2 / 3 | Nvidia | Mamba-2 with attention and MoE on Blackwell |
| Hymba | Nvidia | Parallel attention and Mamba heads inside each layer |
| Granite 4.0 | IBM | 9:1 Mamba-2 to attention ratio |
| Bamba V2 | IBM + UIUC | Open hybrid that informed Granite 4.0 |
| Falcon H1 / H1R | TII | Hybrid Mamba + attention, with H1R adding reasoning post-training |
The shared insight is that a small number of attention layers (typically 5 to 15% of the stack) is enough to recover most of the in-context-learning and retrieval capabilities that pure SSMs lack, while the majority Mamba layers preserve linear-time scaling. The 2025 survey "Hybrid Architectures for Language Models" (Lahoti et al., arXiv:2510.04800) found that interleaved hybrids outperform both pure Mamba and pure transformer baselines on most long-context benchmarks at matched compute.
Mamba and its variants have been applied to language modeling, text generation, and related NLP tasks. Linear-time inference makes Mamba attractive for long-context generation or high-throughput serving. Codestral Mamba in IDE autocomplete, Jamba in enterprise retrieval-augmented generation, and Falcon Mamba for constant-memory streaming each illustrate production niches that pure transformers handle awkwardly.
SSMs have proven especially effective for genomics, where sequences can stretch to millions of base pairs. Mamba-based models outperform transformer and convolution alternatives on DNA sequence modeling. Caduceus, a bidirectional Mamba for DNA introduced by Schiff and colleagues at ICML 2024, outperformed comparably sized unidirectional and transformer models orders of magnitude larger on tasks including predicting the effects of genetic mutations on gene expression. Evo, a 7B Mamba-based foundation model released in November 2024, demonstrated that Mamba's long-context advantages extend to generating functional protein and DNA sequences.
Mamba achieved state-of-the-art results on audio waveform modeling, outperforming SaShiMi, Hyena, and transformer-based models on both pretraining quality and downstream generation metrics. The ability to process very long sequences efficiently makes SSMs a natural fit for raw audio, which requires modeling dependencies across tens of thousands of samples. Mamba variants have also been used for streaming automatic speech recognition, where the fixed-size state enables real-time decoding without the growing KV cache typical of transformer ASR.
Vision Mamba (Vim), introduced in January 2024 and accepted at ICML 2024, adapts the Mamba architecture for visual tasks. Vim uses bidirectional Mamba blocks with position embeddings to process image patch sequences. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation, Vim achieves higher performance than the DeiT vision transformer while being 2.8x faster and using 86.8% less GPU memory at high resolutions.
Other vision variants include VMamba (NeurIPS 2024), VideoMamba for video understanding (ECCV 2024), Mamba-ND for multi-dimensional data, and U-Mamba for medical image segmentation.
Codestral Mamba's release demonstrated that pure Mamba can match dedicated transformer code models on functional benchmarks like HumanEval. Multimodal Mamba models such as Cobra (a vision-language model released in March 2024) and VL-Mamba showed that the architecture handles cross-modal sequence modeling, though hybrid attention-Mamba designs typically outperform pure Mamba on benchmarks that require fine-grained visual grounding.
The Mamba reference implementation lives in the mamba-ssm Python package maintained by Albert Gu and Tri Dao, which provides the fused CUDA selective-scan kernel and Mamba block layers compatible with PyTorch. The causal-conv1d package supplies the optimized 1D convolution used inside each Mamba block. Both packages target CUDA-capable GPUs (Ampere and newer) and form the foundation for most downstream training and inference stacks.
Production-grade serving frameworks gained Mamba support throughout 2024 and 2025:
| Framework | Mamba support |
|---|---|
| Hugging Face transformers | Native MambaForCausalLM and Mamba2ForCausalLM classes |
| vLLM | Selective scan and chunkwise SSD kernels for high-throughput serving |
| TensorRT-LLM | Optimized Mamba and hybrid kernels for Nvidia Hopper and Blackwell GPUs |
| llama.cpp | Mamba and Jamba inference on CPU and Apple Silicon via GGUF |
| MLC LLM | Mobile and WebGPU deployment of small Mamba models |
By mid-2025, deploying a Mamba or hybrid model required no more engineering effort than deploying a transformer of equivalent size.
Despite its efficiency advantages, Mamba has several known limitations compared to transformers:
In-context learning. Empirical studies show that Mamba and Mamba-2 lag behind transformers on in-context learning tasks. On the standard five-shot MMLU benchmark, Mamba models produce approximately 15 points lower accuracy compared to similarly-sized transformer models after training on 1.1 trillion tokens.
Copying and retrieval. Transformers can copy information from their input context with near-perfect accuracy up to their context length, while Mamba models begin to fail at copying for input sequences beyond approximately 500 tokens. This limitation is fundamental: a constant-size state cannot faithfully store an arbitrary-length sequence. Generalized SSMs cannot copy input sequences uniformly unless the state size grows linearly with sequence length.
Multi-query associative recall. Mamba struggles to retrieve specific key-value pairs from context, because the finite hidden state can be overwhelmed as the number of key-value pairs increases. Pretrained transformers can outperform Mamba models with 10x more parameters on information retrieval tasks.
State capacity. All of these limitations stem from the same root cause: Mamba compresses the entire sequence history into a fixed-size state vector. Any information that does not survive this compression is permanently lost. Transformers avoid this problem by maintaining full access to all past tokens through the KV cache, at the cost of linear memory growth and quadratic computation.
State tracking and formal language tasks. Mamba and Mamba-2 cannot reliably solve tasks that require maintaining unbounded discrete state, such as evaluating parenthesis matching at arbitrary depth or computing parity over long binary strings. Mamba-3's complex-valued state updates partially address this limitation, but transformers with sufficient depth remain stronger on formal-language and CoT-style reasoning chains.
These limitations have motivated the hybrid architectures listed above. The dominant industry view by 2026 is that pure Mamba and pure transformer architectures are both Pareto-suboptimal for general-purpose LLMs, and that hybrids capture the best of both.
Imagine you are listening to a very long story. A transformer is like writing down every single word of the story so you can look back at any word whenever you want. This works really well, but your notebook gets bigger and bigger, and it takes longer and longer to flip through all the pages.
Mamba is like keeping a summary in your head instead of writing everything down. As you hear each new word, you decide whether it is important enough to remember or whether you can forget it. You update your mental summary as you go. Your summary always stays the same size no matter how long the story gets, so you can listen to very, very long stories without running out of space. The downside is that if someone asks you to repeat word number 47 exactly, you might not remember it because you only kept a summary, not the full text.
The special trick Mamba uses is that it gets to decide what to remember based on what it is hearing right now. If it hears something important, it pays more attention and updates its summary. If it hears something less important, it mostly ignores it and keeps its existing summary. This is what "selective" means in "selective state space model."
Most modern systems use a clever combination: mostly summary keeping (Mamba) with a few pages of detailed notes (attention) sprinkled in where exact recall matters. That way you get the speed of summarizing and the precision of writing things down.