Jamba Reasoning 3B
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,612 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,612 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jamba Reasoning 3B is an open-weight small reasoning model released by the Israeli artificial-intelligence company AI21 Labs on October 8, 2025 [1][2]. The model has roughly 3 billion parameters and is built on AI21's Jamba hybrid architecture, which interleaves Mamba state space model (SSM) layers with Transformer attention layers. AI21 positions it as a "tiny" model designed to deliver competitive reasoning quality and a very long context window while running efficiently on consumer devices such as laptops and phones, rather than only in the data center [1][3]. It supports a context window of 256,000 tokens natively and up to roughly 1 million tokens with context extension, and is distributed under the permissive Apache License 2.0 [1][4].
AI21 Labs, founded in 2017 by Yoav Shoham, Ori Goshen, and Amnon Shashua, is one of the longest-running independent generative-AI companies and is known for an enterprise focus and for its work on alternatives to pure Transformer language models. In March 2024 the company introduced Jamba, described as the first production-grade model built on a hybrid SSM-Transformer architecture that combined Mamba layers, Transformer attention, and mixture-of-experts (MoE) routing. AI21 followed with the Jamba 1.5 family (Mini and Large) in August 2024 and Jamba 1.6 in March 2025, and a second-generation Jamba 2 line. The original Jamba models were comparatively large MoE systems aimed at long-context enterprise workloads.
Jamba Reasoning 3B represents a different application of the same architectural idea: instead of scaling up for the cloud, AI21 scaled the hybrid design down into a compact, dense model intended for on-device and edge inference [1][3]. Co-CEO Ori Goshen framed the release around the economics of AI inference, arguing that a large share of routine tasks can be served by small models running locally rather than by expensive centralized compute. AI21 cited research indicating that 40 to 70 percent of AI tasks can be handled by small language models at 10x to 30x lower cost through intelligent routing, with on-device models processing simple requests locally while reserving cloud resources for harder reasoning [1].
Like the rest of the Jamba family, Jamba Reasoning 3B uses a hybrid neural-network design that mixes two kinds of sequence layers. Most of the network consists of Mamba layers, a state space model variant whose memory and compute scale roughly linearly with sequence length, making them efficient for long inputs. A smaller number of standard Transformer self-attention layers are interleaved to capture the complex token-to-token dependencies that pure SSMs handle less well. According to AI21's model card, the network has 28 layers in total, composed of 26 Mamba layers and 2 attention layers, with the attention blocks using multi-query attention (20 query heads and a single shared key-value head) and a 64,000-token vocabulary [4].
Unlike the original 2024 Jamba models, Jamba Reasoning 3B is a dense model and does not use mixture-of-experts routing; AI21's documentation describes it solely as a hybrid Transformer-Mamba network and makes no reference to experts [4][5]. (The starting premise that this model is MoE is therefore incorrect: MoE was a feature of the larger Jamba releases, not of the 3B reasoning model.) The hybrid design's main practical benefit is memory efficiency at long context. Because Mamba layers do not maintain a growing key-value (KV) cache the way attention does, AI21 reports that Jamba Reasoning 3B keeps a KV cache roughly 8 times smaller than a comparable "vanilla" Transformer, which is what allows long contexts to fit in the limited memory of a laptop or phone [1]. The model was trained in several stages: large-scale pre-training on diverse documents, a mid-training phase of about 0.5 trillion tokens emphasizing mathematics and code, supervised fine-tuning at 32K context, direct preference optimization at 64K context, and a final stage of reinforcement learning with verifiable rewards (RLVR) targeting code generation, mathematical problem solving, structured output, and information extraction [4].
AI21 released Jamba Reasoning 3B as open weights, publishing the model on Hugging Face (including a quantized GGUF version) and on Kaggle, and making it runnable through local-inference tools such as llama.cpp, Ollama, and LM Studio [1][2][4]. The release was framed as the first in a planned series of small reasoning models from AI21.
The model is built for tasks that benefit from local processing of large documents or long histories: AI21 lists use cases including on-device retrieval-augmented generation, processing of legal and medical documents, field-technician manuals, productivity and conversational assistants, and agentic systems that run partly or wholly on the device [1]. Its 256K native context window, extensible to about 1 million tokens via rope scaling, is unusually large for a model of this size; AI21 and several reviewers noted that previous small models such as Llama 3.2 3B and Microsoft's Phi-4-mini were typically limited to around 128K tokens [3][5].
| Attribute | Detail |
|---|---|
| Developer | AI21 Labs |
| Release date | October 8, 2025 [1][2] |
| Model type | Small reasoning model, dense hybrid SSM-Transformer |
| Parameters | ~3 billion (BF16) [4] |
| Architecture | 28 layers: 26 Mamba (SSM) + 2 attention; multi-query attention (20 query heads, 1 KV head); no MoE [4] |
| Vocabulary | 64,000 tokens [4] |
| Context window | 256,000 tokens native; up to ~1,000,000 with extension [1][4] |
| Training | Pre-training, ~0.5T-token math/code mid-training, SFT (32K), DPO (64K), RLVR [4] |
| License | Apache License 2.0 [1][4] |
| Availability | Hugging Face, Kaggle, llama.cpp, Ollama, LM Studio [1] |
| Reported throughput | ~40 tokens/second on an Apple M3 MacBook Pro at 32K context [1] |
AI21 presented Jamba Reasoning 3B as a leader among very small reasoning models on standard intelligence benchmarks, and its model card includes a comparison against other sub-5-billion-parameter open models [4]. On the IFBench instruction-following benchmark, AI21 reports a score of 52 percent, which it and the third-party evaluator Artificial Analysis described as the best among tiny models; AI21 also cited a score of 21 on the Artificial Analysis Intelligence Index, a composite measure [4][6]. The figures below are AI21's own reported numbers and should be read as vendor-reported benchmarks.
| Model | MMLU-Pro | Humanity's Last Exam | IFBench |
|---|---|---|---|
| Jamba Reasoning 3B | 61.0% | 6.0% | 52.0% |
| Qwen3 4B | 70% | 5.1% | 33% |
| Gemma 3 4B | 42% | 5.2% | 28% |
| Llama 3.2 3B | 35% | 5.2% | 26% |
| IBM Granite 4.0 Micro | 44.7% | 5.1% | 24.8% |
| Phi-4-mini | 47% | 4.2% | 21% |
Source: AI21 Labs model card [4].
By these figures Jamba Reasoning 3B leads its size class on IFBench and Humanity's Last Exam and is competitive on MMLU-Pro, where AI21's own table places it second to Alibaba's Qwen3 4B [4][5]. As with all vendor benchmarks, independent reproduction matters, and the absolute Humanity's Last Exam scores for every model in this class are low, reflecting how difficult that benchmark is for small models.
AI21's stronger claims concern efficiency rather than raw accuracy. The company reports that Jamba Reasoning 3B sustains about 40 tokens per second on an Apple M3 MacBook Pro at a 32,000-token context, and in an accompanying announcement stated that it runs 3x to 5x faster than Llama 3.2 3B and Qwen3 4B at 32K tokens [1][2]. AI21 also described "2x to 5x efficiency gains" over competitors including DeepSeek, Google, Meta, and Microsoft at long context, attributing the advantage to the small KV cache of the hybrid architecture [1]. Independent coverage reported broadly similar throughput, on the order of 35 tokens per second at 32K context on a MacBook Pro versus roughly 8 to 15 tokens per second for comparable Transformer models, with the model remaining usable (around 17 or more tokens per second) at its maximum context length [3].
Jamba Reasoning 3B is notable as one of the clearest demonstrations that hybrid state space model architectures can be scaled down for on-device reasoning while preserving long context. By combining Mamba layers' near-linear scaling with a small amount of Transformer attention, it targets the memory bottleneck that normally prevents small models from handling very long inputs on consumer hardware. Its release sits within a broader 2025 wave of compact, efficient reasoning models, alongside Alibaba's Qwen3 4B, NVIDIA's Llama Nemotron Nano line, Microsoft's Phi-4-mini, Google's Gemma 3, and IBM Granite 4.0, several of which also adopt hybrid or SSM-influenced designs.
Strategically, the model reflects AI21's argument that the economics of AI will push a meaningful fraction of inference toward the edge, with small local models handling routine and agentic tasks and cloud models reserved for the hardest problems [1][3]. Its Apache 2.0 open-weight release lowers the barrier to that kind of distributed, on-device deployment and continues AI21's long-running effort to make the Jamba hybrid architecture, rather than the pure Transformer, a practical foundation for production language models.