Jamba is a family of large language models developed by AI21 Labs, first released in March 2024. It is notable for being the first production-grade language model to combine a Transformer architecture with a Mamba state space model (SSM) and a Mixture of Experts (MoE) routing mechanism within a single unified design. By interleaving Transformer attention layers with Mamba SSM layers and selectively activating only a fraction of total parameters through MoE, Jamba achieves high throughput, a small memory footprint, and strong benchmark performance, particularly on long-context tasks. The original Jamba paper was accepted and published as a conference paper at ICLR 2025.
AI21 Labs is an Israeli artificial intelligence company founded in 2017 by Yoav Shoham, Ori Goshen, and Amnon Shashua, headquartered in Tel Aviv. The company focuses on building language models for enterprise use. Before Jamba, AI21 Labs had developed the Jurassic series of Transformer-based language models. In March 2024, the company shifted direction by releasing Jamba, a model that broke from the Transformer-only paradigm that had dominated the field since 2017.
The motivation behind Jamba's hybrid design stems from the complementary strengths and weaknesses of Transformers and state space models. Pure Transformer models rely on self-attention mechanisms that scale quadratically with sequence length, making them computationally expensive for long contexts. They also require a key-value (KV) cache that grows linearly with context length, consuming large amounts of GPU memory. However, Transformers excel at tasks requiring precise recall of specific information from the input context, a property known as in-context learning.
Pure SSM models like Mamba, introduced by Albert Gu and Tri Dao in December 2023, offer linear-time sequence processing and a fixed-size state that does not grow with context length. This gives them substantial efficiency advantages for long sequences. However, research showed that pure Mamba models struggled with certain recall-intensive tasks where the model must retrieve and reproduce specific details from its input. AI21 Labs reasoned that combining both architectures could capture the efficiency of SSMs alongside the recall strength of Transformers.
Jamba's architecture is built around repeating units called Jamba blocks. Each Jamba block contains a fixed number of layers, where each layer is either a Mamba SSM layer or a Transformer attention layer, followed by a feed-forward network (FFN) or MLP sublayer. The key design choice is the ratio of attention layers to Mamba layers within each block.
In the original Jamba configuration (Jamba 1.0), the model uses an attention-to-Mamba ratio of 1:7. This means that out of every 8 layers in a Jamba block, 1 is a Transformer attention layer and the remaining 7 are Mamba SSM layers. The original model contains 4 Jamba blocks with 8 layers each, for a total of 32 layers. This ratio was determined through ablation studies that found a small number of attention layers was sufficient to recover the recall capabilities that pure Mamba models lacked, while preserving the efficiency benefits of the SSM layers.
The Mamba layers in Jamba use the Mamba-1 selective SSM formulation. AI21 Labs tested upgrading to Mamba-2 during the development of Jamba 1.5 but found that it did not yield improved performance, so they retained Mamba-1 blocks throughout the model family.
Jamba incorporates Mixture of Experts (MoE) into certain layers to increase the model's total capacity without proportionally increasing the computational cost at inference time. In the original Jamba configuration, MoE is applied at every other layer (every second layer), with 16 total experts per MoE layer and a top-2 routing strategy, meaning that for each input token, only the 2 most relevant experts are activated.
This MoE design is what creates the gap between Jamba's total parameter count and its active parameter count. In the original model, total parameters amount to 52 billion, but only 12 billion are active for any given token. The inactive experts contribute to the model's capacity (the breadth of knowledge encoded in its weights) without adding to the per-token compute cost.
One of Jamba's most significant practical advantages is its reduced KV cache memory usage. In a standard Transformer, every attention layer maintains a KV cache that stores key and value vectors for all tokens in the context window. This cache grows linearly with both the number of attention layers and the sequence length.
Because Jamba uses only 1 attention layer per 8 total layers, it has far fewer attention layers than a comparably sized pure Transformer. The Mamba layers use a fixed-size recurrent state instead of a growing KV cache. The result is a dramatically smaller memory footprint for long contexts.
| Model | Total Parameters | Active Parameters | KV Cache at 256K Tokens (16-bit) |
|---|---|---|---|
| Jamba | 52B | 12B | 4 GB |
| Mixtral 8x7B | 46.7B | 12.9B | 32 GB |
| Mistral 7B | 7.2B | 7.2B | 32 GB |
| LLaMA-2 70B | 70B | 70B | 128 GB |
As the table shows, Jamba's KV cache at 256K tokens is only 4 GB, compared to 32 GB for Mixtral and 128 GB for LLaMA-2 70B. This 8x to 32x reduction in cache memory enables Jamba to fit much longer contexts on a single GPU.
Jamba 1.0 was released on March 28, 2024, making it the first production-scale model to successfully deploy a hybrid SSM-Transformer architecture. The model was released as an open-weights base model under the Apache 2.0 license and made available on Hugging Face.
The key specifications of Jamba 1.0 are:
| Specification | Value |
|---|---|
| Total parameters | 52 billion |
| Active parameters (per token) | 12 billion |
| Context window | 256K tokens |
| Jamba blocks | 4 |
| Layers per block | 8 |
| Total layers | 32 |
| Attention-to-Mamba ratio | 1:7 |
| MoE experts per layer | 16 |
| Active experts per token | 2 (top-2) |
| MoE frequency | Every 2nd layer |
| License | Apache 2.0 |
Jamba 1.0 was benchmarked against models with comparable parameter counts, primarily Mixtral 8x7B (46.7B total, 12.9B active) and LLaMA-2 70B (70B dense). The model performed competitively on standard academic benchmarks:
| Benchmark | Jamba | LLaMA-2 70B | Mixtral 8x7B |
|---|---|---|---|
| HellaSwag (10-shot) | 87.1 | 85.3 | 86.7 |
| WinoGrande (5-shot) | 82.5 | 80.2 | 81.2 |
| ARC-Easy | 73.5 | 80.2 | 77.6 |
| ARC-Challenge (25-shot) | 64.4 | 67.3 | 66.0 |
| PIQA (zero-shot) | 83.2 | 82.8 | 83.0 |
| BoolQ (10-shot) | 88.2 | 85.0 | 88.4 |
| GSM8K (3-shot CoT) | 59.9 | 55.3 | 60.4 |
| HumanEval (pass@1) | 29.3 | 29.9 | 34.8 |
| Natural Questions (5-shot) | 45.9 | 46.9 | 44.8 |
| TruthfulQA (zero-shot) | 46.4 | 44.9 | 46.8 |
| MMLU (5-shot) | 67.4 | 69.8 | 70.6 |
| BBH (3-shot) | 45.4 | 51.2 | 50.3 |
Jamba outperformed both Mixtral and LLaMA-2 70B on commonsense reasoning tasks such as HellaSwag, WinoGrande, and BoolQ. On knowledge-intensive benchmarks like MMLU and BBH, LLaMA-2 70B and Mixtral held a slight edge, which was expected given that LLaMA-2 70B is a fully dense 70B-parameter model with nearly 6x more active parameters than Jamba.
Jamba's primary advantage over comparable models was throughput and memory efficiency rather than raw benchmark scores. On a single NVIDIA A100 80 GB GPU with 8K context and INT8 quantization, Jamba achieved approximately 3x the throughput of Mixtral 8x7B at batch size 16. On four A100 GPUs processing 128K-token contexts with a single batch, Jamba similarly delivered about 3x the throughput of Mixtral.
The model could fit a context of up to 140K tokens on a single 80 GB GPU, while Mixtral was limited to much shorter contexts on the same hardware due to its larger KV cache. This practical advantage made Jamba particularly attractive for applications requiring long document processing.
On long-context question-answering tasks, Jamba performed comparably to or slightly better than Mixtral:
| Dataset | Jamba (F1) | Mixtral (F1) |
|---|---|---|
| NarrativeQA | 0.30 | 0.29 |
| Natural Questions | 0.60 | 0.58 |
| LongFQA | 0.44 | 0.42 |
| CUAD | 0.44 | 0.46 |
| SFiction | 0.40 | 0.42 |
| Average | 0.44 | 0.43 |
On August 22, 2024, AI21 Labs released the Jamba 1.5 model family, consisting of two instruction-tuned models: Jamba 1.5 Mini and Jamba 1.5 Large. This release marked a significant scaling milestone, as Jamba 1.5 Large was the first time a hybrid SSM-Transformer architecture had been scaled to nearly 400 billion parameters.
Both models were released under the Jamba Open Model License and made available on Hugging Face, with deployment support through Google Cloud Vertex AI, Microsoft Azure, Amazon Bedrock, and NVIDIA NIM.
Jamba 1.5 Mini retains the same 52B total / 12B active parameter profile as the original Jamba 1.0. It includes improvements from continued pretraining and instruction tuning, with support for 9 languages: English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew. It maintains a 256K token context window.
Jamba 1.5 Large scales the hybrid architecture substantially:
| Specification | Jamba 1.5 Mini | Jamba 1.5 Large |
|---|---|---|
| Total parameters | 52B | 398B |
| Active parameters | 12B | 94B |
| Context window | 256K tokens | 256K tokens |
| Total layers | 32 | 72 |
| Jamba blocks | 4 | 9 |
| Layers per block | 8 | 8 |
| Attention-to-Mamba ratio | 1:7 | 1:7 |
| MoE experts | 16 | 16 |
| Active experts (top-k) | 2 | 2 |
Jamba 1.5 Large uses 9 Jamba blocks with 8 layers each, totaling 72 layers. It maintains the same 1:7 attention-to-Mamba ratio and 16-expert MoE configuration as the smaller model. The architecture also uses grouped-query attention for the Transformer layers.
To make Jamba 1.5 Large deployable on practical hardware, AI21 Labs developed a custom quantization technique called ExpertsInt8. This method quantizes the MoE and MLP layer weights (which account for over 85% of the model's parameters) to INT8 format while keeping activations in BF16. The approach has several advantages: it is fast (quantization takes only a few minutes), it does not require calibration data (avoiding an often unstable and time-consuming process), and it still supports BF16 for large activations. With ExpertsInt8, Jamba 1.5 Large fits on a single 8-GPU node while utilizing its full 256K context window.
The memory efficiency advantages scale up with Jamba 1.5 Large:
| Model | KV Cache at 256K Tokens |
|---|---|
| Jamba 1.5 Mini | 4 GB |
| Jamba 1.5 Large | 9 GB |
| LLaMA 3.1 70B | 80 GB |
| LLaMA 3.1 405B | 252 GB |
| Mistral Large 2 | 88 GB |
Jamba 1.5 Large requires only 9 GB of KV cache for a 256K-token context, compared to 252 GB for LLaMA 3.1 405B. This represents roughly a 28x reduction in cache memory.
Jamba 1.5 models were evaluated against leading open-weight models on standard academic benchmarks:
| Benchmark | Jamba 1.5 Mini | Jamba 1.5 Large | LLaMA 3.1 8B | LLaMA 3.1 70B | Mistral Large 2 |
|---|---|---|---|---|---|
| MMLU | 69.7 | 80.0 | 69.4 | 83.6 | 82.5 |
| MMLU-Pro | 39.8 | 48.3 | 38.0 | 53.0 | 54.2 |
| GPQA | 32.3 | 36.9 | 27.0 | 36.0 | 40.7 |
| ARC-Challenge | 85.7 | 93.0 | 83.4 | 94.8 | 65.0 |
| BBH | 53.4 | 65.5 | 51.0 | 69.0 | 70.8 |
| HumanEval | 62.8 | 71.3 | 72.6 | 80.5 | 92.0 |
| GSM8K | 75.8 | 87.0 | 75.2 | 71.5 | 91.0 |
Jamba 1.5 Mini outperformed LLaMA 3.1 8B on most benchmarks despite having similar active parameter counts. Jamba 1.5 Large performed competitively with LLaMA 3.1 70B, trading wins across different tasks. Mistral Large 2 and LLaMA 3.1 70B led on some knowledge and coding benchmarks, while Jamba 1.5 Large showed particular strength on ARC-Challenge and GSM8K.
On conversational and instruction-following benchmarks, the Jamba 1.5 models showed strong results:
| Benchmark | Jamba 1.5 Mini | Jamba 1.5 Large | LLaMA 3.1 70B | Mistral Large 2 |
|---|---|---|---|---|
| Arena Hard | 46.1 | 65.4 | 55.7 | 70.4 |
| WildBench | 42.4 | 48.5 | 49.8 | 56.3 |
Jamba 1.5 Large scored 65.4 on Arena Hard, surpassing LLaMA 3.1 70B (55.7) and approaching Mistral Large 2 (70.4). AI21 Labs noted that Jamba 1.5 Mini was the strongest model in its size class on Arena Hard, outperforming Mixtral 8x22B and Cohere Command-R+.
The RULER benchmark evaluates models' ability to maintain quality across different context lengths. Jamba 1.5 models demonstrated strong long-context performance:
| Context Length | Jamba 1.5 Mini | Jamba 1.5 Large |
|---|---|---|
| 4K | 95.7 | 96.7 |
| 8K | 95.2 | 96.6 |
| 16K | 94.7 | 96.4 |
| 32K | 93.8 | 96.0 |
| 64K | 92.7 | 95.4 |
| 128K | 89.8 | 95.1 |
| 256K | 86.1 | 93.9 |
| Average | 92.6 | 95.7 |
Both models maintained their effective context length at the full 256K tokens. By comparison, LLaMA 3.1 70B had an effective length of 64K on the RULER benchmark, and Mistral Large 2 had an effective length of 32K. This means the Jamba 1.5 models could reliably process and reason over much longer documents than their Transformer-only counterparts.
On January 8, 2026, AI21 Labs released Jamba 2, the third generation of the Jamba model family. Unlike the previous releases, which emphasized scaling and general-purpose performance, Jamba 2 focused on enterprise reliability, instruction following, and grounding (the ability to produce answers faithful to provided source material).
Jamba 2 was released in two variants:
| Specification | Jamba 2 3B | Jamba 2 Mini |
|---|---|---|
| Architecture | Dense SSM-Transformer | MoE SSM-Transformer |
| Total parameters | 3B | 52B |
| Active parameters | 3B | 12B |
| Context window | 256K tokens | 256K tokens |
| License | Apache 2.0 | Apache 2.0 |
The Jamba 2 3B is a dense model (no MoE routing), small enough to run on consumer devices including smartphones, laptops, and desktop computers. The Jamba 2 Mini retains the MoE architecture with 52B total and 12B active parameters.
Jamba 2 models were built from Jamba 1.5 pretraining checkpoints and then mid-trained on 500 billion carefully curated tokens with a higher representation of math and code, alongside high-quality web data and long documents. Training included a state-passing phase to optimize the Mamba layers for context length generalization, followed by cold-start supervised fine-tuning, Direct Preference Optimization (DPO), and multiple on-policy reinforcement learning phases using a combination of verifiable and model-based rewards.
Jamba 2 models were evaluated primarily on enterprise-relevant benchmarks measuring instruction following and grounding:
In blind side-by-side human evaluations on 100 real-world enterprise prompts, Jamba 2 Mini achieved a statistically significant advantage over Ministral 14B across factuality, style, constraint adherence, and helpfulness criteria. The evaluation focus shifted away from standard academic benchmarks toward practical enterprise reliability measures, reflecting AI21 Labs' positioning of Jamba 2 as a component in production agent systems.
The hybrid SSM-Transformer approach addresses a fundamental tension in language model design. Transformers provide powerful in-context learning and recall capabilities through their attention mechanism, but their quadratic scaling with sequence length and growing KV cache create practical bottlenecks for long-context applications. SSMs like Mamba provide linear-time processing and constant memory usage regardless of sequence length, but they struggle with precise information retrieval from long contexts.
Jamba's solution is to use just enough attention layers (1 out of every 8) to provide the recall capability, while relying on Mamba layers for the bulk of sequence processing. This yields several practical benefits:
The success of Jamba's hybrid approach has influenced the broader research community. Following Jamba's release, several other research groups began exploring hybrid SSM-Transformer architectures, validating the idea that combining the two paradigms yields better efficiency-quality tradeoffs than either architecture alone.
The following table provides an overview comparison of Jamba models with other notable language models:
| Model | Developer | Release | Architecture | Total Params | Active Params | Context Length | License |
|---|---|---|---|---|---|---|---|
| Jamba 1.0 | AI21 Labs | Mar 2024 | Hybrid SSM-Transformer + MoE | 52B | 12B | 256K | Apache 2.0 |
| Jamba 1.5 Mini | AI21 Labs | Aug 2024 | Hybrid SSM-Transformer + MoE | 52B | 12B | 256K | Jamba Open Model License |
| Jamba 1.5 Large | AI21 Labs | Aug 2024 | Hybrid SSM-Transformer + MoE | 398B | 94B | 256K | Jamba Open Model License |
| Jamba 2 3B | AI21 Labs | Jan 2026 | Dense SSM-Transformer | 3B | 3B | 256K | Apache 2.0 |
| Jamba 2 Mini | AI21 Labs | Jan 2026 | Hybrid SSM-Transformer + MoE | 52B | 12B | 256K | Apache 2.0 |
| Mixtral 8x7B | Mistral AI | Dec 2023 | Transformer + MoE | 46.7B | 12.9B | 32K | Apache 2.0 |
| LLaMA 2 70B | Meta AI | Jul 2023 | Dense Transformer | 70B | 70B | 4K | LLaMA 2 License |
| LLaMA 3.1 70B | Meta AI | Jul 2024 | Dense Transformer | 70B | 70B | 128K | LLaMA 3.1 License |
| LLaMA 3.1 405B | Meta AI | Jul 2024 | Dense Transformer | 405B | 405B | 128K | LLaMA 3.1 License |
| Mamba 3B | Albert Gu, Tri Dao | Dec 2023 | Pure SSM | 3B | 3B | Variable | Apache 2.0 |
Jamba models are available through multiple channels:
The Jamba architecture is integrated into the Hugging Face Transformers library as a first-class model type, with support for features like Flash Attention 2 for the attention layers.
Despite its architectural advantages, Jamba has several known limitations. On knowledge-intensive benchmarks like MMLU and BBH, the original Jamba 1.0 scored below LLaMA-2 70B and Mixtral 8x7B, likely because its 12B active parameters encode less factual knowledge than larger dense models. The Jamba 1.5 models partially closed this gap through continued pretraining and scaling, but Jamba 1.5 Large still trailed Mistral Large 2 and LLaMA 3.1 70B on several coding and reasoning benchmarks.
The Mamba components of the architecture are also less mature in terms of hardware optimization compared to the extensively optimized Transformer attention kernels. While custom CUDA kernels exist for Mamba, the Transformer ecosystem has had years of optimization work (FlashAttention, PagedAttention, etc.) that give pure Transformer models an implementation advantage that may narrow the theoretical efficiency gap in practice.
Additionally, the MoE architecture means that while only 12B parameters are active per token, all 52B (or 398B for Jamba 1.5 Large) parameters must be loaded into GPU memory. This creates a minimum memory requirement that exceeds what the active parameter count alone would suggest.