Gemma 4
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 2,238 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 ยท 2,238 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemma 4 is a family of open-weight multimodal models developed by Google DeepMind and released on April 2, 2026. It is the fourth generation of the Gemma series, succeeding Gemma 3, and the first Gemma release distributed under the Apache License 2.0 rather than Google's custom Gemma Terms of Use. The family spans four instruction-tuned and pre-trained variants: two edge-oriented dense models, E2B and E4B, and two larger models, a 26B mixture-of-experts model (26B A4B) and a 31B dense model. Google built Gemma 4 from the research and technology behind Gemini 3 and described it as "our most intelligent open models," positioned to maximize intelligence per parameter on hardware ranging from phones to a single accelerator. [1][2][3]
The headline result at launch was that the 31B dense model reached an Elo of 1452 on the text leaderboard run by LMArena (branded Arena AI), placing third among open models worldwide, while the 26B MoE model took sixth. Google noted that the 31B model "outcompetes models 20x its size," and several outlets pointed out that it scored above OpenAI's gpt-oss 120B despite having roughly a quarter of the parameters. [1][3][5][9]
Gemma 3, released in March 2025, came in 1B, 4B, 12B, and 27B sizes, added native vision, and extended context to 128K tokens. Gemma 4 keeps the same general philosophy of small models that punch above their weight, but the lineup and the technical baseline both shifted. The dense flagship grew slightly to 31B, a new sparse 26B MoE option joined the family, and the two smallest models adopted the "effective parameter" naming (E2B and E4B) that Gemma 3n had introduced for on-device deployment. Context on the larger models doubled to 256K tokens, and audio understanding, previously confined to the experimental Gemma 3n branch, became a standard feature of the edge models. [1][3][4]
The relationship to Gemini tightened further. Where Gemma 3 distilled from Gemini 2.0 era teachers, Gemma 4 is built on Gemini 3 research and technology. The two product lines remain distinct in their intent: Gemini 3 is Google's frontier, closed, API-served family, while Gemma 4 ships open weights that anyone can download, fine-tune, and run locally. The jump in benchmark scores between generations is large. On MMLU-Pro the 31B model scores 85.2 percent against 67.6 percent for Gemma 3 27B, and on the AIME 2026 math competition without tools it reaches 89.2 percent against 20.8 percent for the older model. Those gains track the broader reasoning improvements Google folded in from the Gemini 3 program. [3][4]
Gemma 4 ships in four sizes, each available as a pre-trained base model and an instruction-tuned (IT) model. The two edge models are dense; the 26B model is a sparse mixture of experts that loads all of its parameters into memory but activates only a fraction during each forward pass.
| Variant | Type | Total parameters | Effective / active parameters | Layers | Context window | Modalities |
|---|---|---|---|---|---|---|
| Gemma 4 E2B | Dense | 5.1B (with embeddings) | 2.3B effective | 35 | 128K | Text, image, audio, video |
| Gemma 4 E4B | Dense | 8B (with embeddings) | 4.5B effective | 42 | 128K | Text, image, audio, video |
| Gemma 4 26B A4B | Mixture of experts | 25.2B | 3.8B active | 30 | 256K | Text, image, video |
| Gemma 4 31B | Dense | 30.7B | 30.7B | 60 | 256K | Text, image, video |
The 26B A4B model uses an expert configuration of 8 active experts out of 128, plus 1 shared expert, so that only about 3.8 billion parameters are engaged per token. This is what lets a model with frontier-class accuracy run at the inference cost of a 4B model, and it is the reason the 26B placed sixth on the Arena leaderboard while activating so few parameters. The "A4B" in its name refers to those roughly four billion active parameters. [3][4]
The two edge models reuse the effective-parameter scheme from Gemma 3n. Per-Layer Embeddings (PLE) move a large share of the embedding table out of the always-resident working set, so the count of parameters that must sit in fast memory during generation is far smaller than the total. Google reports that E2B can run in under 1.5 GB of memory on some devices, which is what makes phone and single-board-computer deployment realistic. [4][6]
Gemma 4 carries forward the hybrid attention design that defined Gemma 3, interleaving local sliding window attention layers with periodic global full-context layers. The local windows are 512 tokens on the two edge models and 1,024 tokens on the 26B and 31B models. Keeping most layers local is what holds the key-value cache cost down enough to make 256K-token context practical without a large memory blowup. [3][4]
Positional encoding uses a dual Rotary Position Embedding scheme. Sliding-window layers use a standard RoPE configuration, while the global layers apply a proportional or pruned variant (p-RoPE) tuned to generalize across the much longer global context. The tokenizer keeps the 262K-entry SentencePiece vocabulary shared with the Gemini line, which encodes non-Latin scripts efficiently and helps the 140-language coverage. Two further efficiency mechanisms appear in Gemma 4: a shared KV cache, where the last several layers of a given attention type reuse the keys and values from the last non-shared layer of that type, and Multi-Token Prediction (MTP) drafter models, small same-architecture assistants released for all four sizes that enable speculative decoding for up to roughly a 3x end-to-end speedup with no quality loss. [4]
Vision is handled by an encoder of about 150M parameters on the edge models and about 550M on the larger ones, with learned 2D positions and multidimensional RoPE so that images keep their original aspect ratios. The encoder supports configurable image token budgets of 70, 140, 280, 560, or 1,120 tokens, trading resolution against speed and memory. The audio path on E2B and E4B uses a USM-style conformer encoder of around 300M parameters, the same lineage as Gemma 3n. As with Gemma 3, training leans heavily on knowledge distillation from larger Gemini teachers, which is the main reason the small models reach accuracy their parameter counts would not predict on their own. [3][4]
All four models accept text and images and can process video by sampling frame sequences, with the model card noting support for clips up to about 60 seconds. The two edge models, E2B and E4B, additionally accept native audio input for speech recognition and spoken-language understanding, while the 26B and 31B models focus on text, image, and video. On MMMU-Pro the 31B model scores 76.9 percent and the 26B model 73.8 percent, against 49.7 percent for Gemma 3 27B. On document understanding, measured by OmniDocBench 1.5 edit distance where lower is better, the 31B model reaches 0.131 versus 0.365 for Gemma 3 27B. The variable aspect-ratio encoder and the larger token budgets target OCR, charts, and dense documents, areas Gemma 3 already pursued with its Pan and Scan windowing. [3][4]
The edge models support a 128K-token context window, while the 26B and 31B models extend to 256K tokens. On the MRCR v2 long-context retrieval benchmark with 8 needles at 128K, the 31B model scores 66.4 percent and the 26B model 44.1 percent, both well above the 13.5 percent recorded for Gemma 3 27B, a sign the larger models can track multiple facts across a long input rather than just locate a single one. Language coverage stays at more than 140 languages, trained natively. Google reports a training-data cutoff of January 2025 and a corpus drawn from web documents, code, images, and audio, and the shared 262K-token vocabulary keeps multilingual text compact. [3][4]
A recurring theme in Google's launch materials is that Gemma 4 is built for agents that run on local hardware. The models offer native function calling, structured JSON output, native system instructions, and constrained decoding for predictable outputs, along with a thinking mode that can spend additional tokens on multi-step reasoning before answering. Google frames the goal as agents that "plan, navigate apps, and complete tasks on your behalf," and the developer materials describe multi-step planning, autonomous action, and offline code generation as the target workloads. [1][2][6]
The coding numbers back the agentic pitch. On LiveCodeBench v6 the 31B model scores 80.0 percent and the 26B model 77.1 percent, against 29.1 percent for Gemma 3 27B. On Codeforces the 31B model reaches an Elo of 2150, a large jump from the prior generation. Reasoning benchmarks show a similar pattern: GPQA Diamond at 84.3 percent and MMLU-Pro at 85.2 percent for the 31B model. [4]
The instruction-tuned scores below come from the Gemma 4 model card and the Hugging Face launch writeup, with Gemma 3 27B shown for comparison. They illustrate how the 26B MoE and 31B dense models separate from the edge models on reasoning-heavy tasks.
| Benchmark | E2B | E4B | 26B A4B | 31B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU-Pro | 60.0 | 69.4 | 82.6 | 85.2 | 67.6 |
| AIME 2026 (no tools) | 37.5 | 42.5 | 88.3 | 89.2 | 20.8 |
| GPQA Diamond | 43.4 | 58.6 | 82.3 | 84.3 | 42.4 |
| BIG-Bench Hard | 21.9 | 33.1 | 64.8 | 74.4 | 19.3 |
| LiveCodeBench v6 | 44.0 | 52.0 | 77.1 | 80.0 | 29.1 |
| MMMU-Pro (vision) | 44.2 | 52.6 | 73.8 | 76.9 | 49.7 |
| MATH-Vision | 52.4 | 59.5 | 82.4 | 85.6 | 46.0 |
| MRCR v2 (8 needle, 128K) | 19.1 | 25.4 | 44.1 | 66.4 | 13.5 |
On the LMArena text leaderboard, which ranks models by blind human preference rather than by static benchmarks, the 31B dense model recorded an Elo of about 1452 and the 26B MoE model about 1441. Google cited these scores in reporting that the 31B placed third and the 26B sixth among open models at launch. Because Arena rankings shift as new models arrive and vote counts grow, those positions describe the field as of early April 2026 rather than a permanent standing. [1][3][4][5]
Gemma 4 is the first Gemma release distributed under the Apache License 2.0, an OSI-approved open-source license. Earlier Gemma generations, including Gemma 3, used the custom Gemma Terms of Use, which permitted commercial use but required downstream distributors to pass along Google's prohibited-use policy. Compliance teams routinely flagged those terms, and the requirement to propagate the restrictions created friction for enterprises redistributing fine-tuned models. [1][7][8]
The switch removes that friction. Under Apache 2.0 developers can modify Gemma 4, run it privately and entirely offline, and distribute derivatives commercially without the prior usage policy attached. Google framed the change around autonomy, control, and clarity, and Hugging Face CEO Clement Delangue called it "a huge milestone" on the day of release. The move was widely read as a direct response to feedback on the older terms, feedback that Google itself had acknowledged when Gemma 3 launched. [1][7][8]
Gemma 4 weights are available on Hugging Face, Kaggle, Google AI Studio, and Vertex AI, with base and instruction-tuned checkpoints for every size. Day-one runtime support is unusually broad. Hugging Face Transformers offers first-class multimodal support across PyTorch and JAX; llama.cpp added image and text inference from launch with an OpenAI-compatible server; Ollama and vLLM provide local and served deployment; MLX covers Apple Silicon through mlx-vlm; and transformers.js runs the models in the browser via WebGPU. GGUF and ONNX builds, plus the MTP drafter models, ship alongside the main weights. [3][4]
On edge hardware the edge models are the headline. Google reports E2B running at 133 prefill and 7.6 decode tokens per second on a Raspberry Pi 5 CPU, and 3,700 prefill and 31 decode tokens per second on a Qualcomm Dragonwing IQ8 NPU, with E2B and E4B able to run "completely offline with near-zero latency" on phones, Raspberry Pi, and Jetson Nano. By launch the broader Gemma series had been downloaded more than 400 million times and the community Gemmaverse exceeded 100,000 variants, figures Google expects to grow faster now that the licensing obstacle is gone. [1][2][6][7]