IBM Granite
Last reviewed
May 31, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 2,253 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 2,253 words
Add missing citations, update stale details, or suggest a clearer explanation.
IBM Granite is a family of open foundation models built by IBM for enterprise use. The family spans general purpose large language models, code models, safety classifiers, speech recognition, document conversion, text embeddings, and time series forecasting. Since the second half of 2024, IBM has released the weights for most Granite models under the permissive Apache 2.0 license, and it ships them through its own watsonx platform as well as Hugging Face and several cloud partners. Granite is IBM's answer to a specific question that large companies kept asking. Can you get a model that is open enough to inspect and self-host, small enough to run cheaply, and documented enough to pass a compliance review. [1][2]
The family is best known for two things. The first is IBM's heavy emphasis on transparency and governance, including disclosure of training data sources and, with the 4.0 generation, an ISO 42001 certification and cryptographic signing of model checkpoints. The second is the 4.0 release in October 2025, which moved most of the family onto a hybrid architecture that mixes Mamba state space layers with a small number of transformer attention layers to cut memory use sharply. [3][4]
Granite started inside IBM's watsonx push in 2023. The brand covered a set of proprietary enterprise models trained on business, legal, and technical text, and the first public Granite paper appeared in September 2023. [5] Over 2024 and 2025 the focus shifted toward open weights, and the lineup grew into a broad toolkit rather than a single chatbot model.
IBM positions Granite for work like summarization, classification and extraction, question answering over private documents, code generation, function calling, and multilingual dialog. The models target English plus a set of other languages including German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, and Chinese. [6] The pitch is not that any single Granite model beats the largest frontier systems on every leaderboard. It is that a company can deploy a small, well documented, openly licensed model on its own hardware, fine tune it on internal data, and know where the training data came from. That combination matters more in a bank or a hospital than a few extra points on a public benchmark.
The openness here is real but worth stating precisely. The model weights are Apache 2.0, which permits commercial use, modification, and redistribution. IBM also publishes technical reports and lists the data sources used for training, which is more than many other open weight releases offer. The full training datasets and the training code are not released in their entirety, so Granite is open weight with strong documentation rather than fully open source in the strictest sense. [2][7]
The open Granite line came together quickly across 2024 and 2025.
Granite Code arrived first, in May 2024. IBM released base and instruction tuned code models at 3B, 8B, 20B, and 34B parameters under Apache 2.0. These are decoder only models trained on code from 116 programming languages, aimed at code generation, explanation, and repair. The smaller models saw roughly four trillion tokens of code data in a first phase, followed by a second phase that mixed in technical, math, and web text. [8][9]
Granite 3.0 followed in October 2024 as the first broad general purpose open release. It included dense base and instruct models at 2B and 8B, two mixture of experts models for low latency serving, and the first Granite Guardian safety models. The dense models launched with a 4K context window. IBM reported training on a large multi trillion token corpus and published the data documentation alongside the weights. [1]
Granite 3.1 came in December 2024. Its main change was context length, extended from 4K to 128K tokens through a progressive training recipe, along with general quality improvements and an embedding model release. [10]
Granite 3.2, released on 26 February 2025, added optional chain of thought reasoning that developers can switch on or off, so the model only spends extra compute when a task needs it. It also brought Granite Vision 3.2 2B, a compact vision language model aimed at document understanding such as tables and charts. [11]
Granite 3.3, released on 16 April 2025, introduced Granite Speech 3.3 8B, IBM's first official speech to text model with translation, plus fill in the middle support for code and a set of retrieval focused LoRA adapters. [12]
Granite 4.0, released on 2 October 2025, is the headline generation and is covered in its own section below. A follow on 4.1 release later refined the lineup with dense models at 3B, 8B, and 30B and updated speech, vision, guardian, and embedding components. IBM reported that the 4.1 8B instruct model matches or beats the earlier 4.0 32B mixture of experts model while using a simpler architecture, and that the family trained on roughly 15 trillion tokens with context extended toward 512K. [3]
Granite 4.0 is where the family changed shape. Most Granite 4.0 models drop the pure transformer design in favor of a hybrid that interleaves Mamba 2 state space layers with a small number of standard attention layers. In the H-Small model card, the stack runs 36 Mamba 2 layers to 4 attention layers, which is close to the 9 to 1 ratio IBM describes as canonical for the hybrid design. The models also drop explicit positional encodings, a setup IBM labels NoPE. [4][13]
The motivation is memory. A transformer's attention cost grows with the square of the sequence length, so doubling the context roughly quadruples the work and the memory tied up in the key value cache. Mamba layers scale linearly instead, so doubling the context only doubles their cost. By making most layers Mamba and keeping a few attention layers for the precision that attention is good at, Granite 4.0-H can cut the RAM needed for long inputs and many concurrent requests by more than 70 percent compared with a conventional transformer of similar quality. That translates into running the same workload on cheaper GPUs, which is the entire point for IBM's enterprise buyers. [3][4][14]
The 4.0 lineup ships in several sizes. The hybrid mixture of experts models are H-Small at 32B total with about 9B active and H-Tiny at 7B total with about 1B active. H-Micro is a 3B hybrid dense model. IBM also shipped a conventional transformer Micro at 3B for developers who want the familiar architecture, plus a Nano series at roughly 350M and 1B that is small enough to run in a web browser or on edge hardware. The models carry a 128K context window in deployment and were trained on samples up to 512K tokens. Training ran on an NVIDIA GB200 cluster, and all of it is Apache 2.0. [13][14][15]
Granite 4.0 also leaned hard into governance. IBM says it is the first open model family to earn ISO 42001 certification, the international standard for AI management systems covering accountability, data privacy, and reliability. IBM cryptographically signs every 4.0 checkpoint on Hugging Face so that users can verify provenance, and it ran a bug bounty program with HackerOne offering up to 100,000 dollars for successful jailbreaks of guarded Granite deployments. [3][16][17]
Granite is a toolkit, and the specialized members are part of why IBM pitches it for end to end enterprise work.
Granite Code covers code generation, explanation, repair, and fill in the middle completion across many programming languages. [8]
Granite Guardian is a set of safety classifiers used as input and output guardrails. They flag categories such as harmful content, jailbreak attempts, and, in RAG settings, groundedness and relevance of retrieved context. They are meant to sit around a generation model rather than replace its own alignment. [1]
Granite Speech provides speech to text with translation. The 3.3 8B model can process longer audio than some fixed window systems, and IBM later reported a 5.33 percent word error rate for a 4.1 speech variant. [12][3]
Granite Docling is a compact 258M parameter vision language model for converting documents into machine readable form while preserving layout, tables, equations, and lists. IBM says its quality rivals systems several times its size, and it is released under Apache 2.0. [18]
Granite Embedding supplies text embedding models for semantic search and RAG, with later multilingual versions covering more than 200 languages. [3][10]
Granite Time Series provides compact models for forecasting on enterprise time series data, a domain most language model families ignore. [11]
The table below summarizes the main families and representative sizes.
| Family | Type | Representative sizes | License |
|---|---|---|---|
| Granite 4.0 language (H) | Hybrid Mamba 2 / transformer, dense and MoE | H-Small 32B (9B active), H-Tiny 7B (1B active), H-Micro 3B | Apache 2.0 |
| Granite 4.0 Micro / Nano | Conventional transformer | Micro 3B, Nano ~350M and ~1B | Apache 2.0 |
| Granite 3.x language | Transformer, dense and MoE | 2B, 8B dense; 1B and 3B MoE | Apache 2.0 |
| Granite Code | Code, decoder only | 3B, 8B, 20B, 34B | Apache 2.0 |
| Granite Guardian | Safety classifier | 2B, 8B | Apache 2.0 |
| Granite Speech | Speech to text and translation | ~2B, 8B | Apache 2.0 |
| Granite Vision | Document focused VLM | ~2B | Apache 2.0 |
| Granite Docling | Document conversion VLM | 258M | Apache 2.0 |
| Granite Embedding | Text embeddings | small, multilingual | Apache 2.0 |
| Granite Time Series | Forecasting | compact | Apache 2.0 |
IBM publishes evaluation tables on the Hugging Face model cards. The figures below come from the Granite 4.0 H-Small card and cover instruction following, reasoning, math, code, and tool use. They are self reported, so treat them as IBM's measurements rather than independent results.
| Benchmark | Setting | Granite 4.0 H-Small |
|---|---|---|
| MMLU | 5-shot | 78.44 |
| MMLU-Pro | 5-shot, CoT | 55.47 |
| BBH | 3-shot, CoT | 81.62 |
| GPQA | 0-shot, CoT | 40.63 |
| IFEval | average, strict | 87.55 |
| AlpacaEval 2.0 | - | 42.48 |
| GSM8K | 8-shot | 87.27 |
| HumanEval | pass@1 | 88 |
| MBPP | pass@1 | 84 |
| BFCL v3 | tool calling | 64.69 |
| SALAD-Bench | safety | 97.3 |
On instruction following, IBM reported that H-Small trailed only Meta's much larger Llama 4 Maverick among open models on IFEval at launch, and that on the Berkeley Function Calling Leaderboard it kept pace with larger systems at a lower serving cost. IBM also noted that the smallest 4.0 models outperformed the previous generation Granite 3.3 8B despite being less than half its size. The recurring theme across these numbers is capability per dollar rather than a top of the chart score. [3][14][15]
Granite sits in a specific spot in the open model landscape. It does not chase the largest possible model or the top of general leaderboards. It competes on small footprints, permissive licensing, and a paper trail that a risk team can actually read. The 4.0 generation pushed that further on two fronts at once. The hybrid Mamba and transformer design attacked the memory cost that makes long context and high concurrency expensive, and the ISO 42001 certification with checkpoint signing addressed the governance questions that slow enterprise adoption. [3][4]
That positioning has earned Granite comparisons to efficient open families from other vendors, with some coverage casting Granite 4.0 as a Western counterpart to the small, efficient open models coming out of China. [14] Whether Granite wins broad developer mindshare is a separate matter from whether it succeeds inside IBM's accounts, where the model is bundled with watsonx tooling, partner cloud availability, and IBM's support contracts. For organizations that need an openly licensed model they can host, audit, and certify, Granite is one of the few families built from the start around those requirements. [1][2]