IBM Granite 4.0
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,591 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,591 words
Add missing citations, update stale details, or suggest a clearer explanation.
IBM Granite 4.0 is the fourth generation of IBM's Granite family of open-weight enterprise large language models, released on October 2, 2025.[1][2] Its headline change is a hybrid architecture that replaces most of the self-attention layers found in conventional Transformers with Mamba-2 state-space-model layers, retaining only a minority of attention blocks and adding mixture-of-experts (MoE) routing in some variants.[2][3] IBM positions Granite 4.0 as a cost-efficient foundation for enterprise and agentic workloads, claiming the hybrid design cuts memory use by more than 70 percent for long-context and concurrent-session inference while preserving instruction-following and tool-use quality.[2][4] The models are released by IBM as open weights under the Apache 2.0 license and were the first open model family covered by an accredited ISO/IEC 42001 certification for AI management systems.[1][2]
Granite 4.0 is a family of small to mid-sized language models intended to run on modest hardware rather than large GPU clusters. The initial October 2025 launch comprised four models, each available in base (pretrained) and instruction-tuned form: Granite-4.0-H-Small (about 32 billion total parameters with roughly 9 billion active), Granite-4.0-H-Tiny (about 7 billion total with roughly 1 billion active), Granite-4.0-H-Micro (about 3 billion dense parameters), and a non-hybrid Granite-4.0-Micro (about 3 billion dense parameters using a conventional Transformer stack).[2][3][5] The "H" prefix denotes the hybrid Mamba-2/Transformer architecture, while the plain Micro model is provided as a pure-Transformer option for runtimes and tooling that do not yet support state-space layers.[2][6]
IBM frames the release around efficiency rather than raw scale: instead of competing with the largest frontier systems, Granite 4.0 targets strong performance per dollar for tasks such as retrieval-augmented generation (RAG), function calling, and customer-service or document-processing automation.[2][4] Press coverage at launch nicknamed the release a "Western Qwen," comparing IBM's open, efficiency-focused strategy to that of Alibaba's Qwen family.[3] All models were trained on more than 15 trillion tokens.[2][5]
Granite is IBM's line of enterprise foundation models, first introduced in 2023 and progressively opened up over subsequent releases. Earlier generations included the Granite Code models for software tasks and the Granite 3.x series of general-purpose language and instruction models.[2] Granite 4.0 is the direct successor to Granite 3.x, and IBM reports that the smaller Granite 4.0 models match or exceed the older Granite 3.3 8B model despite using fewer active parameters.[2][6] The Granite family is distributed through IBM's watsonx.ai platform and a range of third-party channels, and IBM offers contractual indemnification for Granite models used through watsonx, a feature aimed at enterprises wary of intellectual-property risk from open models.[2]
A defining trait across the Granite line is an emphasis on data governance and provenance: IBM states that Granite 4.0 was trained on a mix of permissively licensed open datasets, curated synthetic data, and human-authored examples, with documentation intended to support enterprise compliance review.[2][5] Granite 4.0 continues this positioning while making efficiency the central technical story.
The core innovation of Granite 4.0 is its hybrid sequence-mixing design. In the H variants, layers alternate between Mamba-2 state-space blocks and conventional Transformer attention blocks in a roughly 9-to-1 ratio, meaning the large majority of layers are Mamba-2 and only a small fraction use self-attention.[2][3][7] Mamba is a state-space model whose compute and memory cost grows linearly with sequence length, in contrast to the quadratic cost of standard attention. By keeping just enough attention layers to preserve in-context recall and reasoning while delegating most sequence processing to Mamba-2, IBM aims to capture the efficiency of state-space models without the quality loss seen in some pure-SSM systems.[2][3]
For the published model card of Granite-4.0-H-Small, the configuration is 36 Mamba-2 layers to 4 attention layers, with 72 experts and 10 active per token in its MoE feedforward blocks, 32 attention heads and 8 key-value heads.[5] Granite-4.0-H-Tiny similarly uses MoE routing (about 1 billion of its 7 billion parameters active per token), while the Micro models use dense feedforward layers instead of expert routing.[5][6]
Two further design choices reinforce the efficiency goal. The hybrid models use no positional encoding (NoPE), omitting the explicit rotary or learned position signals common in Transformers; IBM reports this does not harm long-context behavior in the hybrid setup.[2][5] And because the dominant Mamba-2 layers do not maintain a growing key-value cache, memory consumption stays far flatter as context length and the number of concurrent requests rise. IBM states this yields more than a 70 percent reduction in RAM for long-context and multi-session inference relative to comparable conventional Transformer LLMs, allowing models such as H-Small to serve multiple long-context sessions on a single entry-level data-center GPU like the NVIDIA L40S.[2][4][7] The Granite 4.0 models were trained on data samples up to 512K tokens in length, with performance validated on tasks up to 128K tokens.[2][5]
The table below summarizes the initial October 2025 models. Active-parameter and architecture figures are as reported by IBM and the Hugging Face model cards; benchmark numbers are IBM's own and should be read as vendor-reported.[2][5]
| Model | Total params | Active params | Architecture | Context (validated) | License |
|---|---|---|---|---|---|
| Granite-4.0-H-Small | ~32B | ~9B | Hybrid Mamba-2/Transformer, MoE | 128K | Apache 2.0 |
| Granite-4.0-H-Tiny | ~7B | ~1B | Hybrid Mamba-2/Transformer, MoE | 128K | Apache 2.0 |
| Granite-4.0-H-Micro | ~3B | ~3B (dense) | Hybrid Mamba-2/Transformer, dense | 128K | Apache 2.0 |
| Granite-4.0-Micro | ~3B | ~3B (dense) | Conventional Transformer, dense | 128K | Apache 2.0 |
IBM also signaled a broader roadmap. A larger Granite 4.0 Medium model for heavier enterprise workloads was announced as planned for later in 2025.[2][4] On October 28, 2025, IBM released the Granite 4.0 Nano series, a set of eight very small models in 350M and roughly 1B parameter sizes, each offered in both hybrid state-space and pure-Transformer variants and in base and instruction-tuned form, all under Apache 2.0.[8][9] The Nano models reuse the Granite 4.0 training methodology and are designed for on-device and edge deployment, including running locally in a web browser, while still providing tool-use and instruction-following capability.[8][9] The supported languages across the Granite 4.0 instruct models include English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese.[5]
Granite 4.0 is released under the permissive Apache 2.0 license, with open weights distributed on Hugging Face and through IBM watsonx.ai, Docker Hub, Kaggle, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio and Enterprise Hub, and other partners.[1][2] Beyond licensing, IBM emphasizes a set of governance and security features intended for regulated enterprise buyers.
The models were, according to IBM, the world's first open models covered by an accredited ISO/IEC 42001:2023 certification, an international standard for AI management systems addressing security, governance, and transparency.[1][2] The released model checkpoints are cryptographically signed so that deployers can verify provenance and integrity.[1][2] IBM also extends its standard intellectual-property indemnification to Granite 4.0 models used through watsonx and trains the models on data it describes as governed and provenance-documented.[2] The models were trained on a CoreWeave-hosted cluster of NVIDIA GB200 NVL72 systems.[5]
IBM reports that Granite 4.0 delivers strong results on enterprise-relevant tasks for its size. On the IFEval instruction-following benchmark, as measured through Stanford's HELM framework, IBM states that Granite-4.0-H-Small exceeds all open-weight models except Meta's much larger Llama 4 Maverick.[2] The H-Small model card lists an IFEval average of 87.55, a HumanEval pass@1 of 88, and a GSM8K 8-shot score of 87.27.[5] IBM further claims competitive results on the Berkeley Function Calling Leaderboard v3 (BFCLv3) and on multi-turn RAG evaluation (MTRAG), arguing that Granite 4.0 keeps pace with substantially larger models at lower cost on agentic and retrieval workloads.[2][4] These figures are vendor-reported and have not been independently reproduced here; they should be attributed to IBM.
In context, Granite 4.0 is significant as one of the first production-grade open model families to commit to a hybrid Mamba/Transformer architecture at scale, alongside contemporaneous hybrid efforts such as Qwen3-Next and earlier research systems like NVIDIA's Nemotron-H and AI21's Jamba.[3][7] Where competitors such as Qwen, Llama, Mistral, and Microsoft's Phi compete largely on benchmark quality and openness, IBM's distinctive pitch combines that openness with formal governance credentials (ISO 42001, signing, indemnification) and an efficiency-first architecture aimed squarely at reducing the cost of deploying capable open models in the enterprise.[2][3][4] IBM continued the line with a Granite 4.1 update, and the broader bet, that small, governed, memory-efficient models can serve a large share of real enterprise demand, is the defining theme of the Granite 4.0 generation.[2][4]