# Mamba

> Source: https://aiwiki.ai/wiki/mamba
> Updated: 2026-06-20
> Categories: Large Language Models, Model Architecture
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [transformer](/wiki/transformer), [attention mechanism](/wiki/attention), [recurrent neural network](/wiki/recurrent_neural_network), [large language model](/wiki/large_language_model), [Mamba-2](/wiki/mamba_2), [state space model](/wiki/state_space_model)*

## Introduction

Mamba is a [neural network](/wiki/neural_network) architecture for [sequence modeling](/wiki/sequence_modeling) that uses selective [state space models](/wiki/state_space_model) (SSMs) to process sequential data in linear time with respect to sequence length, offering an alternative to the [transformer](/wiki/transformer) whose [attention mechanism](/wiki/attention) scales quadratically. It was introduced by Albert Gu and Tri Dao on 1 December 2023 in the paper "Mamba: Linear-Time Sequence Modeling with Selective State Spaces."[1] In benchmarks reported by its authors, Mamba delivers 5x higher inference throughput than Transformers, scales linearly in sequence length, keeps improving on real data up to million-length sequences, and at the 3B-parameter scale outperforms Transformers of the same size while matching Transformers twice its size.[1]

The central idea behind Mamba is making the parameters of a state space model depend on the input, allowing the model to selectively propagate or forget information along the sequence depending on the current token.[1] In the words of Gu and Dao, "simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token."[1] Combined with a hardware-aware parallel scan algorithm, Mamba achieves both the modeling power of content-aware reasoning and the computational efficiency of linear-time processing.[1]

Mamba was pretrained at multiple scales (130M, 370M, 790M, 1.4B, and 2.8B parameters) on 300 billion tokens from the Pile dataset.[1][12] The Mamba-3B model matched the [perplexity](/wiki/perplexity) of a Transformer twice its size while delivering 5x higher inference throughput.[1] As a general sequence model backbone, Mamba achieved strong performance across language, audio, and genomics modalities.[1]

Since the original paper, Mamba has matured into a production architecture. Mistral, AI21 Labs, the Technology Innovation Institute (TII), Nvidia, and IBM have all shipped Mamba-based or Mamba-hybrid models, and frameworks such as vLLM, TensorRT-LLM, and Hugging Face transformers now support selective scan kernels out of the box. The architecture also seeded a family of follow-up variants including [Mamba-2](/wiki/mamba_2), Mamba-3, MambaByte, and Vision Mamba.

## At a glance

| Property | Value |
|----------|-------|
| Architecture type | Selective state space model (SSM) |
| Introduced | 1 December 2023 (arXiv:2312.00752)[1] |
| Authors | Albert Gu (Carnegie Mellon), Tri Dao (Princeton)[1] |
| Time complexity | O(L) in sequence length L (vs O(L^2) for attention)[1] |
| Inference throughput | Up to 5x higher than a comparable Transformer[1] |
| Pretraining scales | 130M, 370M, 790M, 1.4B, 2.8B parameters on 300B Pile tokens[1][12] |
| Headline result | Mamba-3B matches a Transformer twice its size[1] |
| Reference implementation | mamba-ssm (CUDA selective-scan kernel), Apache 2.0 |
| Direct successors | Mamba-2 (2024), Mamba-3 (2026)[4][5] |

## Background: state space models

### Continuous-time formulation

State space models originate from control theory and signal processing. A continuous-time SSM maps an input signal u(t) to an output signal y(t) through a latent state vector x(t) using two equations:

- **State equation:** x'(t) = A x(t) + B u(t)
- **Output equation:** y(t) = C x(t) + D u(t)

Here, A is the state transition matrix that governs how the latent state evolves over time, B is the input projection matrix, C is the output projection matrix, and D provides a direct skip connection from input to output. The model learns these parameters to capture the dynamics of input-to-output mappings through the latent state representation.

In classical systems, A, B, C, and D are fixed matrices. This property is called Linear Time Invariance (LTI): the same dynamics apply regardless of when or what input arrives. LTI systems have useful mathematical properties but cannot perform content-dependent reasoning, a limitation that Mamba directly addresses.[1]

### Discretization

Since real-world data like text tokens and audio samples arrive as discrete sequences rather than continuous signals, the continuous SSM must be discretized before it can be applied computationally. The most common method is the zero-order hold (ZOH) technique, which holds each discrete input value constant until the next sample arrives.

A learnable step size parameter (denoted delta) controls the resolution of the discretization. Through ZOH, the continuous matrices A and B are converted into their discrete counterparts (typically written as A-bar and B-bar) that operate on sequences step by step:

- A-bar = exp(delta * A)
- B-bar = (delta * A)^(-1) * (A-bar - I) * delta * B

Discretization is one of the most important aspects of SSM architectures because it enables two equivalent computational views of the same model: a recurrent view (processing tokens one at a time) and a convolutional view (processing the entire sequence in parallel). This dual representation allows SSMs to use the parallelizable convolutional form during training for speed, then switch to the recurrent form during inference for efficiency.[2]

### HiPPO initialization

A key challenge for any recurrent model is remembering information over long sequences without suffering from vanishing or exploding gradients. The HiPPO (High-order Polynomial Projection Operators) framework, introduced by Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re in 2020, provides a principled initialization for the state matrix A.[3]

HiPPO works by continuously projecting the input history onto a basis of orthogonal polynomials (specifically Legendre polynomials). This creates a state representation that compresses historical information optimally: recent tokens are captured with high fidelity while older tokens decay gracefully.[3] The resulting HiPPO matrix ensures that each update step requires only O(N) operations, and gradient norms scale as O(1/t), preventing the vanishing and exploding gradient problems that plague standard [RNNs](/wiki/recurrent_neural_network).[3]

HiPPO initialization proved to be a foundational component for all subsequent SSM architectures, including S4 and Mamba.[2][3]

## S4: structured state space sequences

The Structured State Space for Sequences (S4) model, introduced by Albert Gu, Karan Goel, and Christopher Re in 2021 (published at ICLR 2022 where it received an Outstanding Paper Honorable Mention), was the first SSM architecture to achieve competitive performance with [transformers](/wiki/transformer) on a variety of sequence modeling tasks.[2]

S4 combined three components:

1. The SSM framework described above
2. HiPPO initialization for handling long-range dependencies
3. A structured parameterization that makes the computation efficient

The main technical contribution of S4 was decomposing the HiPPO matrix A into a Normal Plus Low-Rank (NPLR) form. This decomposition allowed the SSM convolutional kernel to be computed in O(N + L) operations and memory (where N is the state dimension and L is the sequence length), reducing what would otherwise be an intractable computation.[2]

S4 achieved several notable results:

| Benchmark | S4 result | Significance |
|-----------|-----------|-------------|
| Sequential CIFAR-10 | 91.0% accuracy | Without data augmentation or auxiliary losses |
| Long Range Arena (Path-X) | Solved (length 16,384) | First model to solve this task; all prior methods failed |
| Generation speed | 60x faster than Transformers | On comparable autoregressive tasks |

However, S4 and other LTI state space models shared a fundamental limitation: because their parameters remained constant regardless of input content, they could not perform content-based reasoning.[1] For example, they struggled with tasks requiring the model to attend to specific tokens based on their content rather than their position.[1]

### Evolution from S4 to Mamba

Between S4 and Mamba, several intermediate SSM variants were developed:

| Model | Year | Key contribution |
|-------|------|------------------|
| S4 | 2021 | NPLR parameterization of HiPPO matrix |
| DSS (Diagonal State Spaces) | 2022 | Showed diagonal approximation of A achieves comparable performance |
| S4D | 2022 | Simplified S4 with diagonal initialization |
| S5 | 2023 | Multi-input multi-output (MIMO) SSM with parallel scan |
| H3 (Hungry Hungry Hippos) | 2023 | Combined SSM with multiplicative gating for language modeling |
| Mamba | 2023 | Input-dependent (selective) SSM parameters |

Each step simplified the architecture while maintaining or improving performance, with Mamba making the final and most significant leap by abandoning time invariance altogether.[1]

## How does Mamba's selective scan mechanism work?

### The selectivity problem

The core limitation of all prior SSM architectures was Linear Time Invariance. Because the matrices A, B, and C remained identical for every token regardless of input content, these models could not perform content-aware filtering.[1] Consider a language model processing the sentence "The cat sat on the mat": an LTI model treats the word "the" with exactly the same dynamics as the word "cat," even though they carry very different amounts of semantic information.

The Mamba authors diagnosed this directly: "a key weakness of such models is their inability to perform content-based reasoning."[1] Transformers solve this problem through [self-attention](/wiki/self_attention), which computes pairwise similarity scores between all tokens. This enables content-based reasoning but comes at O(L^2) cost in both time and memory. Mamba achieves content-aware processing while maintaining O(L) complexity.[1]

The Mamba paper demonstrated this concretely with two synthetic tasks. In Selective Copying, models had to copy a small set of tokens at variable positions, ignoring filler tokens. In Induction Heads, they had to retrieve the next token after a specific marker pattern. Vanilla S4 and other LTI baselines scored near random; Mamba achieved near-perfect accuracy and generalized to sequences much longer than those seen in training.[1]

### Input-dependent parameters

Mamba makes three key SSM parameters functions of the input:[1]

| Parameter | Role | How it becomes selective |
|-----------|------|-------------------------|
| B (input matrix) | Controls how input enters the state | Projected from the input via a linear layer; different for each token |
| C (output matrix) | Controls how state maps to output | Projected from the input via a linear layer; different for each token |
| Delta (step size) | Controls discretization resolution | Projected from the input via a linear layer + softplus; different for each token |

The state matrix A remains fixed (initialized with HiPPO) because making it input-dependent would break the parallel scan algorithm.[1]

The step size delta plays a particularly important role as a selection mechanism. A large delta causes the model to emphasize the current input token and reset more of the historical state. A small delta causes the model to suppress the current token in favor of preserving existing context.[1] This gives the model a learned, content-dependent ability to decide what information to retain and what to discard, similar to the gating mechanisms in [LSTMs](/wiki/long_short-term_memory_lstm) and [GRUs](/wiki/recurrent_neural_network).[1]

### Parallel scan algorithm

Making parameters input-dependent means the model can no longer be computed as a fixed convolution kernel, since the kernel changes at every time step. This forces Mamba to use the recurrent representation. However, naive sequential recurrence would be far too slow for training on modern GPUs.[1]

Mamba solves this with the parallel scan (prefix sum) algorithm. The key insight is that the recurrence operation is associative: the order in which intermediate results are combined does not affect the final answer. This property allows the sequence to be split into segments that are computed in parallel, with results merged iteratively. The parallel scan reduces the time complexity from O(L) sequential steps to O(log L) parallel steps while producing the same output as sequential recurrence.[1]

The combination of input-dependent parameters and the parallel scan algorithm is what the authors call the "selective scan" mechanism.[1]

## Hardware-aware implementation

The selective scan mechanism creates a computational challenge: the expanded state (with input-dependent B and C matrices incorporating the batch and sequence length dimensions) is much larger than the original state. Naively materializing this expanded state in GPU high-bandwidth memory (HBM, also called DRAM) would erase the efficiency gains of using an SSM in the first place.[1]

Mamba addresses this with a hardware-aware algorithm that mirrors techniques from [FlashAttention](/wiki/flash_attention) (also developed by Tri Dao).[1][10] The implementation uses three key optimizations:

**Kernel fusion.** Instead of writing intermediate results (discretization output, scan output, and the C multiplication) back to slow GPU DRAM between each operation, Mamba fuses all three operations into a single GPU kernel that keeps intermediate values in fast on-chip SRAM. This eliminates costly memory transfers between the GPU memory hierarchy levels.[1]

**Recomputation.** Rather than storing the large intermediate states during the forward pass for use in backpropagation, Mamba recomputes them during the backward pass. Although this doubles the computation for those states, recomputation is faster than the alternative of reading large intermediate tensors from DRAM, because modern GPUs are memory-bandwidth-limited rather than compute-limited.[1]

**Avoiding materialization.** The expanded state is never fully materialized in DRAM. Only the compressed hidden state (of size N, the state dimension) is kept in memory, not the full expanded representation.[1]

These optimizations make the selective scan operation 20 to 40 times faster than a naive implementation.[1] The resulting algorithm is faster than optimized attention implementations (such as FlashAttention) for long sequences while scaling linearly rather than quadratically.[1] The reference CUDA implementation (`mamba-ssm`) became the canonical kernel used by virtually every downstream Mamba variant.

## Mamba architecture

The full Mamba architecture wraps the selective SSM layer into a simplified block design. Unlike transformers, which alternate between self-attention layers and MLP (feed-forward) layers, each Mamba block combines both functions into a single unit:[1]

1. The input is projected to a higher dimension through two parallel linear projections.
2. One projection passes through a 1D convolution followed by a SiLU activation and the selective SSM.
3. The other projection passes through a SiLU activation (acting as a multiplicative gate).
4. The two paths are combined via element-wise multiplication.
5. A final linear projection maps back to the model dimension.

This design is inspired by the gated MLP architecture from [Llama](/wiki/llama) and similar models, but with the selective SSM replacing the nonlinear activation on one branch.[1] The result is a simpler architecture that does not require separate attention and MLP blocks, layer normalization within the block, or positional embeddings.[1] The authors describe Mamba as "a simplified end-to-end neural network architecture without attention or even MLP blocks."[1]

A complete Mamba model stacks many of these blocks (for example, 48 blocks for the 1.4B parameter model and 64 blocks for the 2.8B model), with [RMSNorm](/wiki/rmsnorm) applied before each block.[1]

## How does Mamba differ from a transformer?

| Aspect | [Transformer](/wiki/transformer) | Mamba |
|--------|-------------|-------|
| Core mechanism | [Self-attention](/wiki/self_attention) | Selective state space model |
| Time complexity (sequence length L) | O(L^2) per layer | O(L) per layer |
| Training parallelism | Fully parallel (attention matrix) | Parallel (via parallel scan) |
| Inference mode | Autoregressive with KV cache | Recurrent with fixed-size state |
| Inference memory | KV cache grows linearly with context | Fixed-size state (constant memory) |
| Content-aware reasoning | Yes (attention computes pairwise token interactions) | Yes (input-dependent SSM parameters) |
| Context window | Fixed maximum (requires extension techniques) | Theoretically unbounded |
| Long-range dependencies | Attention can directly connect any two positions | State carries compressed history |
| In-context learning | Strong | Weaker (limited by finite state size) |
| Copying/retrieval | Strong (direct access to all past tokens) | Weaker (information must survive state compression) |
| Position encoding | Required (sinusoidal, RoPE, ALiBi, etc.) | Not required |
| Inference throughput (similar size) | Baseline | Up to 5x higher |

The fundamental trade-off is between expressiveness and efficiency. Transformers maintain an uncompressed representation of the entire sequence through the KV cache, allowing direct access to any past token. Mamba compresses the entire history into a fixed-size hidden state, which is much more memory-efficient but means information can only be accessed if it survived the compression.[1] This makes transformers better at tasks requiring exact retrieval from context (such as looking up a phone number mentioned earlier in the text) while Mamba excels at tasks involving long-range dependencies and high-throughput generation.[9]

## What benchmark results did Mamba achieve?

Mamba was evaluated across multiple modalities on the Pile dataset (300B tokens for language) and domain-specific datasets for audio and genomics.[1]

### Language modeling

On language modeling with the Pile, Mamba models showed consistent improvements over transformer baselines at each scale:[1]

| Model | Parameters | Pile perplexity | Notes |
|-------|-----------|----------------|-------|
| Transformer | 1.4B | Baseline | Standard transformer |
| Mamba | 1.4B | Better than Transformer-1.4B | Matches Transformer at lower perplexity |
| Transformer | 2.8B | Baseline | Standard transformer |
| Mamba | 2.8B | Matches Transformer-6.9B | Same quality at 40% of the compute |

On downstream zero-shot evaluation tasks, Mamba-3B outperformed Transformer-3B models and matched or exceeded the performance of Transformers with twice as many parameters.[1]

### Audio generation

On the SC09 speech generation benchmark, a small Mamba model outperformed larger GAN-based and [diffusion](/wiki/diffusion_model)-based models (including WaveNet, SampleRNN, WaveGAN, DiffWave, and SaShiMi). A parameter-matched larger variant further improved fidelity metrics, reducing the FID score by more than half compared to prior state-of-the-art.[1]

### DNA sequence modeling

Mamba demonstrated strong results on DNA sequence modeling, outperforming HyenaDNA across model sizes. Unlike HyenaDNA, whose performance degraded with longer sequences, Mamba continued to improve with context lengths up to 1 million tokens. On a downstream species classification task distinguishing five great ape species (which share approximately 99% DNA similarity), Mamba's ability to use extremely long contexts proved particularly effective.[1]

### Inference speed

Mamba's selective scan implementation achieved up to 3x speedup over prior SSM methods on A100 GPUs and up to 5x higher generation throughput compared to similarly sized Transformers. The efficient scan kernel was 40x faster than a standard (naive) implementation of the selective recurrence.[1]

## Mamba-2: structured state space duality

In May 2024, Tri Dao and Albert Gu published "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," introducing [Mamba-2](/wiki/mamba_2).[4] The paper established a theoretical framework called Structured State Space Duality (SSD) that reveals a deep connection between SSMs and [attention mechanisms](/wiki/attention).[4]

### The SSD framework

The key theoretical result is that a state space model with a scalar-times-identity state matrix (where all diagonal elements of A are identical) is mathematically equivalent to a form of masked self-attention with a 1-semiseparable causal mask. This duality means the same computation can be expressed either as an SSM recurrence or as a matrix multiplication resembling attention:[4]

- **SSM form:** h_t = a_t * h_{t-1} + B_t * x_t, then y_t = C_t^T * h_t
- **Attention form:** M = L * (C * B^T), where L is a lower-triangular matrix of cumulative A products

When all a_t values equal 1, the attention form reduces to standard causal linear attention.[4]

This connection runs through the theory of semiseparable matrices, a well-studied class in numerical linear algebra. The authors showed that both SSMs and attention can be understood as different decompositions of the same structured matrix.[4]

### Architecture changes from Mamba-1 to Mamba-2

| Feature | Mamba-1 | Mamba-2 |
|---------|---------|--------|
| State matrix A | Diagonal (different value per channel) | Scalar times identity (one value shared) |
| Head dimension | P = 1 (independent SSMs per channel) | P >= 64 (shared dynamics across channels) |
| Typical state dimension N | 16 | 64 to 256 |
| Parameter generation | Sequential (SSM params depend on x) | Parallel (A, B, C generated alongside x) |
| Core algorithm | Selective scan via parallel prefix sum | Chunkwise SSD (quadratic within chunks, linear across chunks) |

The larger state dimension in Mamba-2 (up to 16x larger than Mamba-1) significantly improves performance on associative recall tasks, where Mamba-1 was weakest. The multi-head structure, where P channels share a single state transition, reduces the total number of independent recurrences while increasing expressiveness.[4]

### Performance

Mamba-2's core SSD layer runs 2 to 8 times faster than Mamba-1's selective SSM while maintaining competitive performance with Transformers on language modeling benchmarks. The SSD algorithm achieves this speedup by decomposing sequences into chunks, applying the quadratic (attention-like) computation within each chunk for hardware efficiency, and passing SSM states between chunks to maintain the linear overall scaling. A minimal PyTorch implementation requires approximately 30 lines of code.[4]

## Mamba-3

In March 2026, Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu introduced Mamba-3 (published at ICLR 2026).[5] The paper addresses remaining expressivity limitations in Mamba-2 through three innovations:

**Exponential-trapezoidal discretization.** This replaces the simpler exponential-Euler discretization used in Mamba-1 and Mamba-2 with a second-order accurate method. The improved discretization enables an implicit convolution applied on the SSM input, increasing the expressivity of the recurrence.[5]

**Complex-valued state updates.** By modeling the SSM in the complex number domain, Mamba-3 achieves a more expressive state update. This connects to data-dependent rotary position embeddings ([RoPE](/wiki/rope)), providing a theoretical bridge between complex SSMs and the position encoding techniques used in transformers.[5]

**Multi-input multi-output (MIMO) formulation.** MIMO transitions from outer-product to matrix-multiplication-based state updates, increasing the rank of input and output projections. This raises decoding FLOPs by up to 4x relative to Mamba-2 at a fixed state size but maintains similar wall-clock decode latency.[5]

At the 1.5B parameter scale, Mamba-3 improved average downstream accuracy by 0.6 points over Gated DeltaNet (the next best model) in the SISO configuration, with the MIMO variant adding another 1.2 points. Mamba-3 with state size 64 matched the perplexity of Mamba-2 with state size 128, effectively achieving the same language modeling quality at half the state size. The model also solved formal language tasks (parity and modular arithmetic) that Mamba-2 could not handle, addressing known state-tracking deficiencies in prior linear recurrent models.[5]

## MambaByte and byte-level sequence modeling

MambaByte, introduced by Junxiong Wang and collaborators at COLM 2024, is a token-free adaptation of Mamba trained directly on raw byte sequences.[13] Conventional [large language models](/wiki/large_language_model) operate on subword tokens produced by a [BPE](/wiki/byte_pair_encoding) or SentencePiece tokenizer; this inductive bias creates failure modes around multilingual text, unusual scripts, code, and adversarial typos. Byte-level modeling removes the tokenizer but multiplies effective sequence length roughly four to five times, which is prohibitive for quadratic attention.[13]

MambaByte exploits Mamba's linear-time scan to make byte-level modeling tractable. MambaByte-972M is competitive with state-of-the-art subword transformer baselines on language modeling while remaining more robust to character-level noise and orthographic perturbations.[13] The paper also introduced speculative decoding with tokenized drafting and byte-level verification, achieving a 2.6x inference speedup over the standard byte-level decode loop.[13] MambaByte demonstrated a broader pattern: architectures whose cost is linear in sequence length unlock modeling regimes (raw audio, DNA, bytes, video frames) where tokenization was previously the bottleneck.

## Which production models use Mamba?

Within eighteen months of the original paper, Mamba and Mamba-2 moved from research code to commercial deployments across at least five organizations. The following are notable releases that ship pure-Mamba or hybrid-Mamba weights publicly.

### Codestral Mamba

Mistral released Codestral Mamba 7B on 16 July 2024 under the Apache 2.0 license, positioning it as the first major Mamba-2 model deployed for production code generation.[14] Mistral framed the choice as a deliberate bet that Mamba's linear-time decode would matter more than transformer parity for the long-context, autocomplete-heavy workloads typical of [code assistants](/wiki/code_generation).[14]

Codestral Mamba achieved 75.0% on [HumanEval](/wiki/humaneval) Python, outperforming CodeGemma-1.1 7B (61.0%), CodeLlama 34B (31.1%), and DeepSeek Coder 6.7B v1.5 (65.9%) at similar or smaller parameter counts. On the Spider SQL benchmark it reached 58.8%.[14] The model supports an effective context of 256K tokens for in-context retrieval and ships through Mistral's la Plateforme API as `codestral-mamba-2407`, with raw weights on Hugging Face and deployment recipes for the `mistral-inference` SDK and TensorRT-LLM.[14]

### Falcon Mamba 7B

The Technology Innovation Institute (TII) in Abu Dhabi released Falcon Mamba 7B on 12 August 2024 under the TII Falcon License 2.0.[15] It was the first pure-SSM 7B-scale model to surpass leading transformer baselines on standardized evaluations, outperforming Llama 3.1 8B, Mistral 7B v0.3, and Falcon 2 11B on Hugging Face's Open LLM Leaderboard at release.[15]

Falcon Mamba's key practical advantage was constant-memory generation: because the model carries a fixed-size SSM state rather than a KV cache, decoding a million-token output consumes the same GPU memory as decoding a hundred-token output.[15] TII later released Falcon3-Mamba-7B-Instruct in January 2025 and [Falcon-H1](/wiki/falcon_h1), a hybrid Mamba-attention architecture, in mid-2025.

### AI21 Jamba family

AI21 Labs released Jamba in March 2024, Jamba 1.5 Mini and 1.5 Large in August 2024, Jamba 1.6 in March 2025, and Jamba 1.7 in July 2025.[6][16][17] The family interleaves Mamba layers with sparse attention layers and [Mixture-of-Experts](/wiki/mixture_of_experts) routing, targeting enterprise long-context workloads such as retrieval-augmented generation, contract review, and customer-support knowledge bases.[6]

| Version | Released | Architecture highlights | Effective context |
|---------|----------|------------------------|-------------------|
| Jamba v0.1 | March 2024 | 52B total / 12B active params, 1:7 attention:Mamba ratio, MoE | 256K tokens |
| Jamba 1.5 Mini | August 2024 | 52B total / 12B active params | 256K tokens |
| Jamba 1.5 Large | August 2024 | 398B total / 94B active params | 256K tokens |
| Jamba 1.6 | March 2025 | Quality and speed improvements over 1.5 | 256K tokens |
| Jamba 1.7 | July 2025 | Better grounding, optimized quantization for self-hosted deploy | 256K tokens |

At 256K tokens, Jamba 1.5 Large maintains a fixed activation memory budget that fits in a single 8x H100 node, where a comparably sized pure transformer would need hundreds of gigabytes for the KV cache.[16] AI21 distributes Jamba under the Jamba Open Model License on Hugging Face and through cloud catalogs including Google Cloud Vertex AI, Microsoft Azure AI Foundry, NVIDIA NIM, AWS Bedrock, Databricks, and Snowflake Cortex.[16] The 1.6 and 1.7 releases emphasize private on-prem and VPC deployment, reflecting Jamba's adoption by regulated enterprise customers.[17]

### Nvidia Hymba and Nemotron

Nvidia introduced Hymba in November 2024 as a family of small language models built around a hybrid-head parallel attention scheme.[18] Rather than alternating Mamba and attention layers as Jamba does, each Hymba layer runs standard attention heads and Mamba heads in parallel and combines their outputs.[18] Hymba-1.5B-Base outperformed similarly sized Llama 3.2 1B, SmolLM 1.7B, and Qwen 2.5 1.5B on common-sense reasoning benchmarks while requiring roughly 10x less KV-cache memory.[18]

Nvidia followed Hymba with Nemotron Nano 2 (mid-2025) and Nemotron 3 (15 December 2025), both pairing Mamba-2 layers with a small fraction of attention layers and a [Mixture-of-Experts](/wiki/mixture_of_experts) router. Nvidia positions Nemotron as the reference architecture for agentic AI workloads on Blackwell GPUs, where the fixed-size SSM state simplifies multi-tenant serving.

### IBM Granite 4.0 and Bamba

IBM Research collaborated directly with Tri Dao, Albert Gu, and Minjia Zhang (University of Illinois Urbana-Champaign) on Bamba and Bamba V2, hybrid Mamba-2 models released through 2024 and 2025. They served as the empirical foundation for IBM's enterprise language models. On 2 October 2025, IBM launched the Granite 4.0 family, the first commercial LLM line built around a 9:1 ratio of Mamba-2 to attention layers.[19] IBM reports a >70% reduction in serving RAM versus equivalent transformer Granite models for long-context and multi-session inference, with quality preserved on RAG, summarization, and code completion benchmarks.[19] Granite 4.0 ships through IBM watsonx, the Red Hat AI portfolio, and as Apache 2.0 weights on Hugging Face.[19]

### Production model summary

| Model | Vendor | Released | Architecture | Notable use case |
|-------|--------|----------|--------------|------------------|
| Codestral Mamba 7B | Mistral AI | July 2024 | Pure Mamba-2 | Code generation, autocomplete |
| Falcon Mamba 7B | TII | August 2024 | Pure Mamba | Constant-memory long generation |
| Jamba 1.5 Mini / Large | AI21 Labs | August 2024 | Hybrid Mamba + Attention + MoE | Enterprise 256K-context RAG |
| Hymba 1.5B | Nvidia | November 2024 | Parallel hybrid heads | On-device small LLM |
| Falcon3-Mamba-7B-Instruct | TII | January 2025 | Pure Mamba | Instruction-following SSM |
| Jamba 1.6 / 1.7 | AI21 Labs | March / July 2025 | Hybrid Mamba + Attention + MoE | Private enterprise deploy |
| Nemotron 3 | Nvidia | December 2025 | Hybrid Mamba-2 + Attention + MoE | Agentic AI on Blackwell GPUs |
| Granite 4.0 | IBM | October 2025 | 9:1 Mamba-2:Attention hybrid | Enterprise long-context workloads |
| Falcon H1R 7B | TII | January 2026 | Hybrid Mamba + Attention (reasoning) | Compact reasoning model |

## Hybrid architectures: Jamba and beyond

Recognizing that transformers and Mamba have complementary strengths, AI21 Labs released Jamba in March 2024, the first production-grade hybrid architecture combining Transformer attention layers, Mamba SSM layers, and [Mixture-of-Experts](/wiki/mixture_of_experts) (MoE).[6] Jamba's success catalyzed an entire generation of hybrid designs, and by 2026 nearly every frontier-scale long-context model uses some form of attention-SSM interleaving.[20]

### Jamba architecture

Jamba interleaves blocks of Transformer and Mamba layers with a ratio of approximately one attention layer for every seven Mamba layers. Each block contains either an attention or Mamba layer followed by a [feed-forward network](/wiki/feedforward_neural_network_ffn) (MLP). The MoE component allows the model to use only 12B of its total 52B parameters at inference time, keeping compute costs manageable.[6]

Jamba supports context lengths up to 256K tokens while fitting in a single 80GB GPU.[6] The hybrid design provides:

- High throughput from the Mamba layers (which dominate the architecture)
- Strong in-context learning and retrieval from the sparse attention layers
- Parameter efficiency from the MoE routing

AI21 Labs later released Jamba-1.5 in two sizes: Jamba-1.5-Large (94B active parameters) and Jamba-1.5-Mini (12B active parameters), both with 256K token effective context length.[16] Jamba 1.6 and Jamba 1.7 (released in 2025) extended the family with improved retrieval grounding, more aggressive quantization, and optimizations targeting on-prem deployment.[17] Jamba demonstrated that hybrid architectures can capture the best properties of both design paradigms rather than forcing a choice between them.

### The broader hybrid landscape

Following Jamba, multiple groups released hybrid architectures with different interleaving schemes:

| Model | Group | Hybrid design |
|-------|-------|--------------|
| Zamba | Zyphra | Single shared attention block applied at multiple layers, Mamba elsewhere |
| Samba | Microsoft Research | Alternating Sliding Window Attention and Mamba |
| Hunyuan-TurboS | Tencent | Mamba-2 with periodic attention layers and MoE |
| Nemotron Nano 2 / 3 | Nvidia | Mamba-2 with attention and MoE on Blackwell |
| Hymba | Nvidia | Parallel attention and Mamba heads inside each layer |
| Granite 4.0 | IBM | 9:1 Mamba-2 to attention ratio |
| Bamba V2 | IBM + UIUC | Open hybrid that informed Granite 4.0 |
| Falcon H1 / H1R | TII | Hybrid Mamba + attention, with H1R adding reasoning post-training |

The shared insight is that a small number of attention layers (typically 5 to 15% of the stack) is enough to recover most of the in-context-learning and retrieval capabilities that pure SSMs lack, while the majority Mamba layers preserve linear-time scaling. The 2025 survey "Hybrid Architectures for Language Models" (Lahoti et al., arXiv:2510.04800) found that interleaved hybrids outperform both pure Mamba and pure transformer baselines on most long-context benchmarks at matched compute.[20]

## What is Mamba used for?

### Natural language processing

Mamba and its variants have been applied to [language modeling](/wiki/language_modeling), [text generation](/wiki/text_generation_models), and related NLP tasks. Linear-time inference makes Mamba attractive for long-context generation or high-throughput serving. Codestral Mamba in IDE autocomplete, Jamba in enterprise retrieval-augmented generation, and Falcon Mamba for constant-memory streaming each illustrate production niches that pure transformers handle awkwardly.

### Genomics and DNA modeling

SSMs have proven especially effective for genomics, where sequences can stretch to millions of base pairs. Mamba-based models outperform transformer and convolution alternatives on DNA sequence modeling.[1] Caduceus, a bidirectional Mamba for DNA introduced by Schiff and colleagues at ICML 2024, outperformed comparably sized unidirectional and transformer models orders of magnitude larger on tasks including predicting the effects of genetic mutations on gene expression.[21] Evo, a 7B Mamba-based foundation model released in November 2024, demonstrated that Mamba's long-context advantages extend to generating functional protein and DNA sequences.

### Audio and speech

Mamba achieved state-of-the-art results on audio waveform modeling, outperforming SaShiMi, Hyena, and transformer-based models on both pretraining quality and downstream generation metrics.[1] The ability to process very long sequences efficiently makes SSMs a natural fit for raw audio, which requires modeling dependencies across tens of thousands of samples. Mamba variants have also been used for streaming automatic speech recognition, where the fixed-size state enables real-time decoding without the growing KV cache typical of transformer ASR.

### Computer vision

Vision Mamba (Vim), introduced in January 2024 and accepted at ICML 2024, adapts the Mamba architecture for visual tasks.[7] Vim uses bidirectional Mamba blocks with position embeddings to process image patch sequences.[7] On [ImageNet](/wiki/imagenet) classification, COCO object detection, and ADE20k semantic segmentation, Vim achieves higher performance than the DeiT vision transformer while being 2.8x faster and using 86.8% less GPU memory at high resolutions.[7]

Other vision variants include VMamba (NeurIPS 2024), VideoMamba for video understanding (ECCV 2024), Mamba-ND for multi-dimensional data, and U-Mamba for medical image segmentation.

### Multimodal and code

Codestral Mamba's release demonstrated that pure Mamba can match dedicated transformer code models on functional benchmarks like HumanEval.[14] Multimodal Mamba models such as Cobra (a vision-language model released in March 2024) and VL-Mamba showed that the architecture handles cross-modal sequence modeling, though hybrid attention-Mamba designs typically outperform pure Mamba on benchmarks that require fine-grained visual grounding.

## Tooling and ecosystem

The Mamba reference implementation lives in the `mamba-ssm` Python package maintained by Albert Gu and Tri Dao, which provides the fused CUDA selective-scan kernel and Mamba block layers compatible with PyTorch. The `causal-conv1d` package supplies the optimized 1D convolution used inside each Mamba block. Both packages target CUDA-capable GPUs (Ampere and newer) and form the foundation for most downstream training and inference stacks.

Production-grade serving frameworks gained Mamba support throughout 2024 and 2025:

| Framework | Mamba support |
|-----------|---------------|
| Hugging Face transformers | Native `MambaForCausalLM` and `Mamba2ForCausalLM` classes |
| vLLM | Selective scan and chunkwise SSD kernels for high-throughput serving |
| TensorRT-LLM | Optimized Mamba and hybrid kernels for Nvidia Hopper and Blackwell GPUs |
| llama.cpp | Mamba and Jamba inference on CPU and Apple Silicon via GGUF |
| MLC LLM | Mobile and WebGPU deployment of small Mamba models |

By mid-2025, deploying a Mamba or hybrid model required no more engineering effort than deploying a transformer of equivalent size.

## What are Mamba's limitations?

Despite its efficiency advantages, Mamba has several known limitations compared to transformers:

**In-context learning.** Empirical studies show that Mamba and Mamba-2 lag behind transformers on in-context learning tasks. On the standard five-shot MMLU benchmark, Mamba models produce approximately 15 points lower accuracy compared to similarly-sized transformer models after training on 1.1 trillion tokens.[8]

**Copying and retrieval.** Transformers can copy information from their input context with near-perfect accuracy up to their context length, while Mamba models begin to fail at copying for input sequences beyond approximately 500 tokens.[9] This limitation is fundamental: a constant-size state cannot faithfully store an arbitrary-length sequence. Generalized SSMs cannot copy input sequences uniformly unless the state size grows linearly with sequence length.[9]

**Multi-query associative recall.** Mamba struggles to retrieve specific key-value pairs from context, because the finite hidden state can be overwhelmed as the number of key-value pairs increases. Pretrained transformers can outperform Mamba models with 10x more parameters on information retrieval tasks.[9]

**State capacity.** All of these limitations stem from the same root cause: Mamba compresses the entire sequence history into a fixed-size state vector. Any information that does not survive this compression is permanently lost. Transformers avoid this problem by maintaining full access to all past tokens through the KV cache, at the cost of linear memory growth and quadratic computation.

**State tracking and formal language tasks.** Mamba and Mamba-2 cannot reliably solve tasks that require maintaining unbounded discrete state, such as evaluating parenthesis matching at arbitrary depth or computing parity over long binary strings.[11] Mamba-3's complex-valued state updates partially address this limitation, but transformers with sufficient depth remain stronger on formal-language and CoT-style reasoning chains.[5][11]

These limitations have motivated the hybrid architectures listed above. The dominant industry view by 2026 is that pure Mamba and pure transformer architectures are both Pareto-suboptimal for general-purpose LLMs, and that hybrids capture the best of both.[20]

## Explain like I'm 5 (ELI5)

Imagine you are listening to a very long story. A transformer is like writing down every single word of the story so you can look back at any word whenever you want. This works really well, but your notebook gets bigger and bigger, and it takes longer and longer to flip through all the pages.

Mamba is like keeping a summary in your head instead of writing everything down. As you hear each new word, you decide whether it is important enough to remember or whether you can forget it. You update your mental summary as you go. Your summary always stays the same size no matter how long the story gets, so you can listen to very, very long stories without running out of space. The downside is that if someone asks you to repeat word number 47 exactly, you might not remember it because you only kept a summary, not the full text.

The special trick Mamba uses is that it gets to decide what to remember based on what it is hearing right now. If it hears something important, it pays more attention and updates its summary. If it hears something less important, it mostly ignores it and keeps its existing summary. This is what "selective" means in "selective state space model."

Most modern systems use a clever combination: mostly summary keeping (Mamba) with a few pages of detailed notes (attention) sprinkled in where exact recall matters. That way you get the speed of summarizing and the precision of writing things down.

## References

1. Gu, A., & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
2. Gu, A., Goel, K., & Re, C. (2021). "Efficiently Modeling Long Sequences with Structured State Spaces." *Proceedings of ICLR 2022*. arXiv:2111.00396.
3. Gu, A., Dao, T., Ermon, S., Rudra, A., & Re, C. (2020). "HiPPO: Recurrent Memory with Optimal Polynomial Projections." *Proceedings of NeurIPS 2020*. arXiv:2008.07669.
4. Dao, T., & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." *Proceedings of ICML 2024*. arXiv:2405.21060.
5. Lahoti, A., Li, K. Y., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., & Gu, A. (2026). "Mamba-3: Improved Sequence Modeling using State Space Principles." *Proceedings of ICLR 2026*. arXiv:2603.15569.
6. Lieber, O., Lenz, B., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
7. Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., & Wang, X. (2024). "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." *Proceedings of ICML 2024*. arXiv:2401.09417.
8. Waleffe, R., Byeon, W., Riber, D., Norick, B., Korthikanti, V., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narang, D., Micikevicius, P., & Catanzaro, B. (2024). "An Empirical Study of Mamba-based Language Models." arXiv:2406.07887.
9. Jelassi, S., Brandfonbrener, D., Kakade, S., & Malach, E. (2024). "Repeat After Me: Transformers are Better than State Space Models at Copying." *Proceedings of ICML 2024*.
10. Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." *Proceedings of ICLR 2024*. arXiv:2307.08691.
11. Schlag, I., Munkhdalai, T., & Schmidhuber, J. (2024). "Exploring the Limitations of Mamba in COPY and CoT Reasoning." arXiv:2410.03810.
12. Gao, L., et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027.
13. Wang, J., Gangavarapu, T., Yan, J. N., & Rush, A. M. (2024). "MambaByte: Token-free Selective State Space Model." *Conference on Language Modeling (COLM) 2024*. arXiv:2401.13660.
14. Mistral AI. (2024). "Codestral Mamba." Mistral AI blog, 16 July 2024.
15. Technology Innovation Institute. (2024). "TII Releases First SSLM with Falcon Mamba 7B." TII press release, 12 August 2024.
16. AI21 Labs. (2024). "The Jamba 1.5 Open Model Family: The Most Powerful and Efficient Long Context Models." AI21 blog, 22 August 2024.
17. AI21 Labs. (2025). "AI21's Jamba 1.6: The Best Open Model for Private Enterprise Deployment." AI21 blog, 6 March 2025.
18. Dong, X., Fu, Y., Diao, S., Byeon, W., Chen, Z., Mahabaleshwarkar, A. S., et al. (2024). "Hymba: A Hybrid-head Architecture for Small Language Models." Nvidia Research. arXiv:2411.13676.
19. IBM Research. (2025). "IBM Granite 4.0: Hybrid Mamba-2 / Transformer Architecture." IBM watsonx documentation, 2 October 2025.
20. Lahoti, A., Li, K. Y., et al. (2025). "Hybrid Architectures for Language Models: Systematic Analysis and Design Insights." arXiv:2510.04800.
21. Schiff, Y., Kao, C.-H., Gokaslan, A., Dao, T., Gu, A., & Kuleshov, V. (2024). "Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling." *Proceedings of ICML 2024*. arXiv:2403.03234.

