# Gemma

> Source: https://aiwiki.ai/wiki/gemma
> Updated: 2026-06-20
> Categories: Google DeepMind, Large Language Models, Open Source AI, Small Language Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Gemma is a family of open-weight [large language models](/wiki/large_language_model) developed by [Google DeepMind](/wiki/google_deepmind), built from the same research and technology used to create Google's [Gemini](/wiki/gemini) models but small enough to run on a single GPU, a laptop, or a smartphone. Google describes Gemma as "a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models" [1]. Named after the Latin word for "precious stone," Gemma launched on February 21, 2024 and has since expanded across four major generations (Gemma 1, 2, 3, and 4) plus specialized variants spanning 270 million to 31 billion parameters. By April 2026, developers had downloaded Gemma more than 400 million times and built over 100,000 community variants, making it one of the most-used open model families and a leading competitor to [Meta](/wiki/meta)'s [Llama](/wiki/llama), [Mistral AI](/wiki/mistral_ai)'s Mistral series, and [Microsoft](/wiki/microsoft)'s [Phi](/wiki/phi) models [16][17].

## Overview

Google DeepMind introduced Gemma on February 21, 2024, alongside a blog post emphasizing the company's commitment to making capable AI models available to the broader developer and research community [1]. "At Google, we believe in making AI helpful for everyone," the launch announcement stated, framing Gemma as a contribution to "the open community of developers and researchers powering AI innovation" [1]. The motivation behind Gemma was straightforward: while frontier models like Gemini Ultra and Gemini Pro deliver state-of-the-art performance, their size and computational requirements put them out of reach for many researchers, independent developers, and organizations that need to run models locally or on constrained hardware. Gemma fills that gap by distilling key insights from Gemini research into models with parameter counts ranging from 270 million to 31 billion.

All Gemma models are released with both pre-trained (base) and instruction-tuned variants. The instruction-tuned versions have undergone additional training with supervised fine-tuning on demonstration data and [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) ([RLHF](/wiki/rlhf)) to make them more helpful and safer for conversational use. Model weights are distributed through platforms like [Hugging Face](/wiki/hugging_face), Kaggle, and Google's own Vertex AI, with support for popular frameworks including [PyTorch](/wiki/pytorch), [JAX](/wiki/jax), and Keras.

The table below summarizes all major releases in the Gemma family:

| Release | Date | Model Sizes | Key Features |
|---|---|---|---|
| Gemma 1 | February 21, 2024 | 2B, 7B | First open-weight release; 8K context; MQA/MHA |
| Gemma 2 | June 27, 2024 | 2B, 9B, 27B | [Knowledge distillation](/wiki/knowledge_distillation); GQA; sliding window attention |
| Gemma 3 | March 12, 2025 | 1B, 4B, 12B, 27B | Multimodal vision; 128K context; 140+ languages |
| Gemma 3n | June 26, 2025 | E2B, E4B | MatFormer architecture; on-device; audio/video input |
| Gemma 3 270M | August 14, 2025 | 270M | Ultra-compact; on-device fine-tuning |
| Gemma 4 | April 2, 2026 | E2B, E4B, 26B (MoE), 31B | MoE + dense; up to 256K context; audio input; reasoning |

## When was Gemma released?

Gemma 1 was released on February 21, 2024 [1]. Subsequent generations followed roughly every several months: Gemma 2 on June 27, 2024, Gemma 3 on March 12, 2025, Gemma 3n (full release) on June 26, 2025, Gemma 3 270M on August 14, 2025, and Gemma 4 on April 2, 2026 [3][4][6][12][16]. The cadence reflects an aggressive iteration schedule aimed at keeping Gemma at the frontier of small open-weight models.

## Gemma 1 (February 2024)

The first generation of Gemma was released on February 21, 2024, in two sizes: 2 billion (2B) and 7 billion (7B) parameters [1]. Both models use a [decoder-only transformer](/wiki/transformer) architecture with several modifications drawn from the Gemini research program.

### Architecture

Gemma 1 incorporates four notable architectural features that distinguish it from a vanilla transformer:

- **Multi-Query [Attention](/wiki/attention) (MQA):** The 2B model uses MQA, where a single key-value head serves multiple query heads, reducing memory bandwidth requirements during inference. The 7B model uses multi-head attention (MHA) instead.
- **Rotary Position [Embeddings](/wiki/embeddings) (RoPE):** Both models use [RoPE](/wiki/rotary_position_embedding) for positional encoding, allowing the model to generalize to sequence lengths beyond those seen during training.
- **GeGLU activation function:** Gemma uses the GeGLU variant of the gated linear unit as its feedforward activation function, which has been shown to improve training efficiency.
- **RMSNorm:** Both models use RMSNorm (Root Mean Square Layer Normalization) for input normalization, which is computationally simpler than standard LayerNorm.

The detailed architecture specifications for Gemma 1 are shown below:

| Parameter | Gemma 2B | Gemma 7B |
|---|---|---|
| Layers | 18 | 28 |
| Hidden Dimension (d_model) | 2,048 | 3,072 |
| Intermediate Size (FFN) | 32,768 | 49,152 |
| Attention Heads | 8 | 16 |
| KV Heads | 1 (MQA) | 16 (MHA) |
| Head Dimension | 256 | 256 |
| Vocabulary Size | 256,128 | 256,128 |
| Context Length | 8,192 | 8,192 |

### Training

The 2B model was trained on 2 trillion tokens and the 7B model on 6 trillion tokens. The training data consists primarily of web documents, code, and mathematics content, filtered for quality and safety. Google did not release the full details of the training data composition but noted that extensive filtering was applied to remove personally identifiable information and other sensitive content [1]. Both models use a [SentencePiece](/wiki/sentencepiece) tokenizer with a vocabulary of 256,128 tokens, shared with the Gemini model family.

### Performance

At launch, Gemma 1 models demonstrated strong performance relative to their size. The 7B model outperformed [Llama 2](/wiki/llama_2) 7B and [Mistral](/wiki/mistral) 7B on multiple academic benchmarks [2]. In particular, Gemma 7B showed notable gains in mathematical reasoning ([GSM8K](/wiki/gsm8k), [MATH](/wiki/math)) and code generation ([HumanEval](/wiki/humaneval)), areas where earlier open models at this scale had struggled.

| Benchmark | Gemma 7B | Llama 2 7B | Mistral 7B |
|---|---|---|---|
| MMLU (5-shot) | 64.3% | 45.3% | 62.5% |
| HumanEval | 32.3% | 12.8% | 26.2% |
| GSM8K | 46.4% | 14.6% | 35.4% |
| MATH | 24.3% | 2.5% | 12.7% |
| HellaSwag | 82.3% | 77.2% | 81.3% |

Google's performance benchmarking using the MaxText reference implementation also showed up to 3x better performance-per-dollar for the Gemma 7B model compared to baseline training performance with Llama 2 7B on Google Cloud infrastructure [2].

## Gemma 2 (June 2024)

Google DeepMind released Gemma 2 on June 27, 2024, with a focus on improving performance at practical model sizes. The second generation was available in three sizes: 2B, 9B, and 27B parameters [3]. The paper describing Gemma 2, titled "Gemma 2: Improving Open Language Models at a Practical Size," emphasized architectural innovations aimed at maximizing quality-per-parameter.

### Architectural Changes

Gemma 2 introduced several improvements over the first generation:

- **[Knowledge distillation](/wiki/knowledge_distillation):** The smaller Gemma 2 models (2B and 9B) were trained using knowledge distillation from the 27B teacher model. Rather than a standard approach, Google DeepMind used on-policy distillation, where the student model generates its own completions from supervised fine-tuning prompts. The KL divergence between the teacher's and student's logit distributions is then minimized during training, allowing the student to learn from a richer signal than next-token prediction alone. This approach reduces the train-inference mismatch that can occur with off-policy distillation methods [3].
- **Grouped-Query Attention (GQA):** All Gemma 2 models use [GQA](/wiki/grouped_query_attention), a middle ground between MQA and full MHA that balances memory efficiency with representational capacity.
- **Sliding window attention:** Gemma 2 alternates between local sliding window attention (4,096 tokens) and full global attention (8,192 tokens) across layers, reducing the computational cost of processing long sequences while maintaining the ability to capture long-range dependencies.
- **Logit soft capping:** A logit soft-capping mechanism was introduced to improve training stability. The formula applies a hyperbolic tangent function scaled by a cap value (50.0 for self-attention logits and 30.0 for the final output layer), preventing logits from growing excessively large during training [3].

The detailed specifications for all three Gemma 2 sizes are:

| Parameter | Gemma 2 2B | Gemma 2 9B | Gemma 2 27B |
|---|---|---|---|
| Layers | 26 | 42 | 46 |
| Hidden Dimension | 2,304 | 3,584 | 4,608 |
| Attention Heads | 8 | 16 | 32 |
| KV Heads | 4 | 8 | 16 |
| Local Attention Window | 4,096 | 4,096 | 4,096 |
| Global Attention Span | 8,192 | 8,192 | 8,192 |
| Training Tokens | 2T | 8T | 13T |
| Vocabulary Size | 256,000 | 256,000 | 256,000 |

The 27B model was trained from scratch on 13 trillion tokens without distillation, while the 9B model was distilled from the 27B model using more than 50x the compute-optimal quantity predicted by scaling law theory [3]. This "over-training" strategy, combined with distillation, allowed the 9B model to punch well above its weight class on benchmarks.

### Performance

Gemma 2 delivered substantial improvements across benchmarks, with the 27B model competing against models significantly larger in parameter count.

| Benchmark | Gemma 2 2B | Gemma 2 9B | Gemma 2 27B |
|---|---|---|---|
| MMLU (5-shot) | 52.2% | 71.3% | 75.2% |
| HellaSwag (10-shot) | 72.9% | 81.9% | 86.4% |
| GSM8K | 23.9% | 68.6% | 74.0% |
| ARC-c | 55.4% | 68.4% | 71.4% |
| Winogrande | 70.9% | 80.6% | 83.7% |

On the LMSys [Chatbot Arena](/wiki/lmsys_chatbot_arena) leaderboard, the Gemma 2 27B instruction-tuned model achieved an Elo score of 1218, surpassing [Llama 3](/wiki/llama) 70B (Elo 1206), a model nearly three times its size [3]. Human evaluations also demonstrated that Gemma 2 models exhibited significantly lower memorization rates compared to prior models, with verbatim memorization below 0.1%.

## Gemma 3 (March 2025)

Gemma 3 was released on March 12, 2025, representing the most significant expansion of the family to date. It introduced four model sizes (1B, 4B, 12B, and 27B), multimodal capabilities for vision and text understanding, support for over 140 languages, and context windows of up to 128,000 tokens [4]. Google positioned Gemma 3 as "the most capable model you can run on a single GPU or TPU" [5].

### Multimodal Vision Support

The headline feature of Gemma 3 is native multimodal support. The 4B, 12B, and 27B models can process both images and text as input, while the 1B model remains text-only due to its compact size. Image understanding is enabled through a 400M-parameter variant of the SigLIP vision encoder, a [Vision Transformer](/wiki/vision_transformer) (ViT) trained with a variant of the [CLIP](/wiki/clip) contrastive loss [4].

The vision encoder takes square images resized to 896 x 896 pixels and encodes them into a sequence of visual tokens. These tokens are then condensed into a fixed set of 256 image token vectors before being fed into the language model alongside text tokens. This condensation step keeps computational costs manageable even when processing multiple images within a single prompt.

For images with non-standard aspect ratios, Gemma 3 employs a Pan and Scan (P&S) method inspired by [LLaVA](/wiki/llava). This approach segments images into non-overlapping crops of equal size that cover the entire image, resizes each crop to 896 x 896 pixels, and processes them individually through the encoder. The result is that Gemma 3 can handle images of varying resolutions and aspect ratios without distorting or losing important details [4].

This allows Gemma 3 to perform tasks like image captioning, visual question answering, document understanding, chart interpretation, and optical character recognition.

### Architecture and Context Window

Gemma 3 uses a decoder-only [transformer](/wiki/transformer) architecture with Grouped-Query Attention (GQA) and RMSNorm, consistent with Gemma 2. A key change from Gemma 2 is the replacement of logit soft capping with QK-norm (query-key normalization), which normalizes query and key vectors before computing attention scores [4].

Gemma 3 dramatically increased the context window compared to Gemma 2's 8K limit. The 1B model supports 32,768 tokens, while the 4B, 12B, and 27B models support 128,000 tokens [4]. This 16x increase in context length is achieved through an interleaved attention pattern: for every 1 global attention layer, there are 5 local attention layers. Local layers use a sliding window of just 1,024 tokens, while global layers attend to the full context. This design significantly reduces the computational cost of long-context processing, since most layers only need to attend to a small window. The RoPE base frequency was increased from 10,000 to 1,000,000 to support the longer context lengths.

The full architecture specifications for each Gemma 3 variant:

| Parameter | Gemma 3 1B | Gemma 3 4B | Gemma 3 12B | Gemma 3 27B |
|---|---|---|---|---|
| Embedding Parameters | 302M | 675M | 1,012M | 1,416M |
| Non-embedding Parameters | 698M | 3,209M | 10,759M | 25,600M |
| Total Parameters | 1B | ~3.9B | ~11.8B | ~27B |
| Context Window | 32K | 128K | 128K | 128K |
| Vocabulary Size | 256,000 | 256,000 | 256,000 | 256,000 |
| Multimodal | Text only | Vision + Text | Vision + Text | Vision + Text |

### Training Scale

The training data volume increased substantially across all model sizes compared to previous generations:

| Model | Parameters | Training Tokens | Context Window | Multimodal |
|---|---|---|---|---|
| Gemma 3 1B | 1 billion | 2 trillion | 32K | Text only |
| Gemma 3 4B | 4 billion | 4 trillion | 128K | Vision + Text |
| Gemma 3 12B | 12 billion | 12 trillion | 128K | Vision + Text |
| Gemma 3 27B | 27 billion | 14 trillion | 128K | Vision + Text |

The training data includes web documents, code, mathematics, science articles, and multilingual content spanning over 140 languages. Compared to Gemma 2, the 27B model was trained on 14 trillion tokens (up from 13 trillion), and the midsize models saw even larger relative increases in data volume [4].

### Performance

Gemma 3 achieved remarkable benchmark results across all sizes. The instruction-tuned models showed large improvements over Gemma 2, particularly in mathematical reasoning and code generation:

| Benchmark | Gemma 3 1B IT | Gemma 3 4B IT | Gemma 3 12B IT | Gemma 3 27B IT |
|---|---|---|---|---|
| MMLU | 38.8% | 58.1% | 71.9% | 76.9% |
| MMLU-Pro | 14.7% | 43.6% | 60.6% | 67.5% |
| HumanEval | 41.5% | 71.3% | 85.4% | 87.8% |
| GSM8K | 62.8% | 89.2% | 94.4% | 95.9% |
| MATH | 48.0% | 75.6% | 83.8% | 89.0% |
| HellaSwag | 62.3% | 77.2% | 84.2% | 85.6% |
| LiveCodeBench | 1.9% | 12.6% | 24.6% | 29.7% |
| GPQA Diamond | 19.2% | 30.8% | 40.9% | 42.4% |

On the LMSys Chatbot Arena, the Gemma 3 27B instruction-tuned model scored an Elo of 1338, placing it among the top 10 models overall and above much larger models such as [DeepSeek](/wiki/deepseek)-V3 (1318), [Llama 3](/wiki/llama) 405B (1257), and [Qwen](/wiki/qwen) 2.5 70B (1257) [4][5]. The Gemma 3 27B Elo of 1338 was a large jump from Gemma 2 27B's 1220 [4]. This performance level, achieved with a model small enough to run on a single GPU, represented a significant milestone for the open model ecosystem.

### Gemma 3 270M

On August 14, 2025, Google released Gemma 3 270M, the smallest model in the Gemma family [12]. With just 270 million parameters (170 million embedding parameters and 100 million transformer block parameters), it is designed for ultra-efficient on-device deployment and task-specific [fine-tuning](/wiki/fine_tuning). Despite its compact size, Gemma 3 270M "establishes a new level of performance for its size" on the IFEval instruction-following benchmark, according to Google [12]. Internal testing on a Pixel 9 Pro showed the INT4-quantized model consumed only 0.75% of battery life over 25 conversations, making it one of the most power-efficient language models available [12]. Google framed the release around right-sizing models to tasks: "In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame," the announcement explained [12]. Google also released FunctionGemma, a specialized fine-tune of the 270M model for function calling, enabling on-device agents to translate natural-language commands into structured API calls [13].

## Gemma 3n (June 2025)

Gemma 3n is a variant of the Gemma family specifically optimized for on-device and [edge computing](/wiki/edge_computing) deployment. Previewed at Google I/O 2025 and fully released on June 26, 2025, Gemma 3n introduces architectural innovations that allow powerful models to run with minimal memory footprints on smartphones, tablets, and other resource-constrained devices [6].

### MatFormer Architecture

The key innovation in Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, a novel nested transformer design built for elastic inference. Like Russian nesting dolls (Matryoshka dolls), a MatFormer model contains smaller, fully functional sub-models within its parameter space. During training of the E4B (4 billion effective parameter) model, a smaller E2B (2 billion effective parameter) sub-model is simultaneously optimized within it. This allows a single trained model to be deployed at multiple compute and memory levels without retraining, providing flexibility for devices with different capabilities [6].

Developers can use Gemma 3n in two modes:

- **Pre-extracted models:** Download either the standalone E4B or E2B variant for direct deployment, with the E2B sub-model offering up to 2x faster inference than the E4B.
- **Mix-n-Match:** Create custom model sizes between E2B and E4B by adjusting feed-forward dimensions and selectively skipping layers, enabling fine-grained control over the accuracy-latency tradeoff.

### Per-Layer Embeddings (PLE)

The second major innovation in Gemma 3n is Per-Layer Embeddings (PLE), a technique that dramatically reduces accelerator memory (GPU/TPU VRAM) usage. In a standard transformer, the embedding matrix is loaded into high-speed accelerator memory. PLE instead associates separate embedding parameters with each transformer layer and stores them in regular CPU memory. Only the core transformer weights need to reside in accelerator memory, which is the bottleneck for on-device deployment. As a result, while the raw parameter counts for Gemma 3n are 5 billion (E2B) and 8 billion (E4B), the effective accelerator memory footprint is comparable to traditional 2B and 4B models [6].

### Model Sizes and Memory

| Specification | E2B | E4B |
|---|---|---|
| Raw Parameter Count | 5 billion | 8 billion |
| Effective Parameters | ~2 billion | ~4 billion |
| Accelerator Memory | ~2 GB | ~3 GB |
| LMArena Score | N/A | >1,300 |
| Modalities (Input) | Text, image, audio, video | Text, image, audio, video |
| Modalities (Output) | Text | Text |
| Language Support (Text) | 140 languages | 140 languages |
| Language Support (Multimodal) | 35 languages | 35 languages |

The E4B model became the first model under 10 billion raw parameters to exceed an LMArena score of 1,300, a milestone that underscored the effectiveness of the MatFormer and PLE innovations [6]. The released model can run on devices with as little as 2 GB of memory [6].

### Multimodal Capabilities

Unlike Gemma 3, which only supports vision and text inputs, Gemma 3n expands multimodal support to include audio and video in addition to images and text:

- **Vision:** Uses a [MobileNet](/wiki/mobilenet)-V5-300M encoder (replacing Gemma 3's SigLIP) optimized for on-device inference. It supports input resolutions of 256x256, 512x512, and 768x768 pixels, achieves 60 frames per second on a Google Pixel device, and provides a 13x speedup with quantization compared to the SigLIP baseline, with 46% fewer parameters and a 4x smaller memory footprint [6].
- **Audio:** Uses a Universal Speech Model (USM) encoder that generates approximately one token per 160 milliseconds (~6 tokens/second). It supports automatic speech recognition (ASR) and automatic speech translation (AST) for audio clips up to 30 seconds [6].
- **Video:** Processes video by extracting frames and encoding them through the vision encoder, enabling basic video understanding tasks.

### KV Cache Sharing

Gemma 3n introduces KV cache sharing to optimize prefill performance for long-context inputs. This technique delivers approximately 2x improvement on prefill performance compared to Gemma 3 4B, which is critical for responsive on-device inference where users expect near-instant replies [6].

## Gemma 4 (April 2026)

Google released Gemma 4 on April 2, 2026, describing it as "the most capable model family you can run on your hardware" [16]. The fourth generation spans four sizes at launch and is the first Gemma generation to ship a [mixture-of-experts](/wiki/mixture_of_experts) (MoE) model alongside dense models, while extending native audio input and reasoning capabilities across the family.

### What sizes does Gemma 4 come in?

Gemma 4 launched in four sizes: two edge models (E2B and E4B, effective 2 billion and 4 billion parameters) that use Per-Layer Embeddings, a 26B mixture-of-experts model that activates only 3.8 billion parameters during inference, and a 31B fully dense model [16]. Google subsequently expanded the lineup with a Gemma 4 12B Unified model on June 3, 2026.

| Model | Type | Active Parameters | Context Window | Multimodal |
|---|---|---|---|---|
| Gemma 4 E2B | PLE edge | ~2 billion (effective) | 128K | Video, image, audio |
| Gemma 4 E4B | PLE edge | ~4 billion (effective) | 128K | Video, image, audio |
| Gemma 4 26B | Mixture-of-experts | 3.8 billion | Up to 256K | Video, image |
| Gemma 4 31B | Dense | 31 billion | Up to 256K | Video, image |

### Capabilities and benchmarks

Gemma 4 adds multi-step reasoning, native function-calling, and structured JSON output, and all sizes process video and images at variable resolutions, with native audio input on the E2B and E4B edge models [16]. The family is natively trained on over 140 languages and supports context windows of 128K tokens on the edge models and up to 256K tokens on the larger models [16].

On the LMArena (Arena AI) text leaderboard, Google reported that the Gemma 4 31B model was the "#3 open model in the world," with the 26B mixture-of-experts model "securing the #6 spot" [16]. Google stated that Gemma 4 "outcompetes models 20x its size" [16].

## How many times has Gemma been downloaded?

Gemma has become one of the most-downloaded open model families in the AI community, and Google has reported its adoption through a series of public milestones. By Gemma's one-year anniversary in February 2025, the models had passed 100 million downloads and more than 60,000 community-created variants in what Google calls the "Gemmaverse" [17]. "Gemma just passed 150 million downloads and over 70k variants on Hugging Face," Google DeepMind developer relations engineer Omar Sanseviero said in May 2025 [17]. By the Gemma 4 launch in April 2026, the cumulative total had surpassed 400 million downloads and more than 100,000 variants [16].

| Milestone | Date | Cumulative Downloads | Gemmaverse Variants |
|---|---|---|---|
| One-year anniversary | February 2025 | 100 million+ | 60,000+ |
| Mid-2025 milestone | May 2025 | 150 million+ | 70,000+ |
| Gemma 4 launch | April 2026 | 400 million+ | 100,000+ |

## Architecture Evolution Across Generations

The progression from Gemma 1 to Gemma 4 shows a clear trajectory of architectural refinement:

| Feature | Gemma 1 | Gemma 2 | Gemma 3 | Gemma 3n | Gemma 4 |
|---|---|---|---|---|---|
| Attention Type | MQA (2B) / MHA (7B) | GQA (all sizes) | GQA + interleaved local/global | GQA + MatFormer elastic | GQA; MoE (26B) + dense (31B) |
| Position Encoding | [RoPE](/wiki/rotary_position_embedding) (base 10K) | RoPE (base 10K) | RoPE (base 1M) | RoPE (base 1M) | RoPE |
| Normalization | RMSNorm | RMSNorm | RMSNorm + QK-norm | RMSNorm + QK-norm | RMSNorm |
| Activation | GeGLU | GeGLU | GeGLU | GeGLU | GeGLU |
| Max Context | 8,192 | 8,192 | 128,000 | 128,000 | 256,000 |
| Distillation | None | On-policy (2B, 9B) | Not disclosed | Nested (MatFormer) | Nested (PLE edge) |
| Multimodal | No | No | Vision (SigLIP 400M) | Vision (MobileNet-V5), Audio (USM), Video | Video, image (all); audio (edge) |
| Vocabulary | 256,128 | 256,000 | 256,000 | 256,000 | 256,000 |

## Training Data and Methodology

All Gemma models are trained on Google's proprietary data mixture, which Google has described in general terms but has not released publicly. The training data includes:

- **Web documents:** Filtered web crawl data with quality and safety filters applied to remove personally identifiable information, toxic content, and low-quality pages.
- **Code:** Source code from public repositories across multiple programming languages including Python, Java, C++, JavaScript, Go, and Rust.
- **Mathematics:** Mathematical content including textbooks, problem sets, and formal proofs.
- **Science articles:** Scientific publications and technical documentation.
- **Multilingual content:** Starting with Gemma 3, the training data was significantly expanded to cover over 140 languages, with dedicated multilingual data curation.

The total training compute increased substantially with each generation. Gemma 1's 7B model saw 6 trillion tokens, Gemma 2's 27B model was trained on 13 trillion tokens, and Gemma 3's 27B model processed 14 trillion tokens. The smaller Gemma 2 models were "over-trained" relative to [Chinchilla](/wiki/chinchilla) scaling law predictions, with the 9B model trained on 8 trillion tokens (more than 50x the compute-optimal amount for its size) to maximize quality [3].

Google applied extensive safety filtering during data preparation, including the removal of child sexual abuse material (CSAM), personally identifiable information, and content that violates Google's policies. The exact composition and proportions of the training data have not been disclosed, which has been a point of criticism from researchers who argue that full data transparency is necessary for reproducible science.

## Specialized Variants

Beyond the core Gemma models, Google DeepMind has released several task-specific variants that build on the Gemma architecture.

### CodeGemma

CodeGemma is a family of models specialized for code generation and completion tasks. Released alongside the first Gemma generation, CodeGemma models support multiple programming languages including Python, Java, C++, JavaScript, and more. The models are available in sizes that mirror the base Gemma lineup and are designed for both code completion (fill-in-the-middle) and general coding assistance [7].

### PaliGemma

[PaliGemma](/wiki/paligemma) is a vision-language model that combines a SigLIP image encoder with a Gemma language model decoder. It is specifically designed for fine-tuning on visual understanding tasks such as image captioning, [object detection](/wiki/object_detection), and document understanding. PaliGemma 2, released alongside Gemma 2, expanded support to multiple image resolutions and additional model sizes [8].

### ShieldGemma

ShieldGemma is a safety classifier built on the Gemma architecture. ShieldGemma 2, a 4B parameter model built on Gemma 3, functions as an image safety classifier that can identify potentially harmful content across three categories: dangerous content, sexually explicit material, and violence. It is intended for use as a guardrail in production applications that process user-generated or model-generated images [9].

### RecurrentGemma

RecurrentGemma is a variant that replaces the standard transformer attention mechanism with a linear recurrence based on the Griffin architecture. Available in 2B and 9B parameter sizes, it offers faster inference at long sequence lengths due to the constant memory footprint of recurrent computation, though it trades some quality for this efficiency gain.

### FunctionGemma

FunctionGemma is a 270M parameter model fine-tuned from Gemma 3 270M for function calling tasks. It translates natural-language user commands into structured API or tool calls, enabling on-device agents that can control mobile applications, IoT devices, and other tools without sending data to the cloud [13].

| Variant | Size(s) | Purpose | Key Features |
|---|---|---|---|
| CodeGemma | 2B, 7B | Code generation and completion | Multi-language support; fill-in-the-middle capability |
| [PaliGemma](/wiki/paligemma) / PaliGemma 2 | Multiple | Vision-language tasks | Fine-tunable for image understanding; multi-resolution |
| ShieldGemma 2 | 4B | Image safety classification | Classifies dangerous, explicit, and violent content |
| RecurrentGemma | 2B, 9B | Efficient long-sequence inference | Griffin linear recurrence; constant memory |
| Gemma 3 270M | 270M | On-device fine-tuning | Ultra-compact; 0.75% battery per 25 conversations |
| FunctionGemma | 270M | Function calling | Structured API calls from natural language |
| Gemma 3n | E2B, E4B | On-device deployment | MatFormer architecture; multimodal; ultra-low memory |

## Responsible AI Toolkit

Google released the Responsible [Generative AI](/wiki/generative_ai) Toolkit alongside the Gemma models to help developers build safe and responsible applications. The toolkit provides several resources [14]:

- **Safety tuning guidance:** Documentation on best practices for setting safety policies and applying safety-focused fine-tuning to Gemma models.
- **Safety classifiers:** ShieldGemma and related classifiers that can filter inputs and outputs for harmful content, including hate speech, violence, and sexually explicit material.
- **Learning [Interpretability](/wiki/interpretability) Tool (LIT):** An interactive tool for investigating and debugging Gemma's behavior in response to different prompts. LIT helps developers understand why a model produces certain outputs and identify potential failure modes.
- **LLM Comparator:** A tool for running and visualizing comparative evaluations between different model versions, fine-tunes, or configurations.
- **Safeguards documentation:** Guidance on building input/output filtering pipelines, setting content policies, and implementing defense-in-depth strategies for production deployments.

The toolkit encourages a holistic approach to responsible AI that addresses safety, privacy, fairness, and accountability at both the model and application levels [14].

## Google AI Edge Deployment

Google AI Edge is the primary platform for deploying Gemma models on mobile and edge devices. The SDK provides optimized inference runtimes for Android, iOS, and web applications [15].

Key deployment capabilities include:

- **Gemma 3 1B on mobile:** The INT4-quantized Gemma 3 1B model is just 529 MB in size and achieves up to 2,585 tokens per second on prefill via Google AI Edge's LLM inference runtime, allowing it to process a page of content in under a second [15].
- **Gemma 3n on mobile:** The E2B and E4B models can be deployed through the Google AI Edge SDK on CPUs, NPUs, and mobile GPUs. Google collaborated with mobile hardware partners including Qualcomm Technologies, MediaTek, and Samsung's System LSI to optimize Gemma 3n for their respective chipsets [6].
- **LiteRT format:** Gemma models can be converted to Google's LiteRT (formerly TensorFlow Lite) format for deployment on resource-constrained devices.
- **Web deployment:** Through WebAssembly and WebGPU, Gemma models can run directly in web browsers without a server backend.

The on-device deployment capabilities are significant for privacy-sensitive applications, since data never needs to leave the user's device. Healthcare, finance, and enterprise applications in particular benefit from the ability to run inference locally.

## Community Fine-Tunes and Ecosystem

The Gemma family has generated a large ecosystem of community-created fine-tunes, quantizations, and adaptations. As of early 2026, Gemma models are among the most downloaded model families on [Hugging Face](/wiki/hugging_face), with more than 400 million cumulative downloads and over 100,000 variants across all releases [16].

Popular community contributions include:

- **Quantized versions:** Organizations like Unsloth have published 2-bit through 16-bit [GGUF](/wiki/gguf) quantizations of all Gemma model sizes, enabling deployment on consumer hardware with minimal quality loss.
- **Domain-specific fine-tunes:** Researchers and companies have created fine-tunes for specific domains including medical question answering, legal document analysis, multilingual translation, and customer support automation.
- **Alignment-tuned models:** Community members have applied techniques like [Direct Preference Optimization](/wiki/direct_preference_optimization_dpo) (DPO) and other alignment methods to create Gemma variants optimized for helpfulness, harmlessness, and honesty.
- **Framework support:** Gemma is supported across a wide range of inference and fine-tuning frameworks including [Hugging Face](/wiki/hugging_face) Transformers, [Ollama](/wiki/ollama), [llama.cpp](/wiki/llama_cpp), MLX (for Apple Silicon), [vLLM](/wiki/vllm), [SGLang](/wiki/sglang), NVIDIA NIM, Keras, and Google's own GenAI API [6].

Google has also released Gemma Scope, a set of [sparse autoencoders](/wiki/sparse_autoencoder) trained on Gemma models, to support the [AI safety](/wiki/ai_safety) community's work on [mechanistic interpretability](/wiki/mechanistic_interpretability). Gemma Scope 2, released in 2025, expanded coverage to Gemma 2 models and provided deeper tools for understanding complex model behaviors [11].

## Comparison with Other Small Open Models

Gemma competes in the rapidly growing market for small-to-medium open-weight language models. The following table compares Gemma with its primary competitors at similar parameter counts:

| Model Family | Developer | Key Sizes | License | Multimodal | Max Context | Notable Strengths |
|---|---|---|---|---|---|---|
| Gemma 3 | [Google DeepMind](/wiki/google_deepmind) | 1B, 4B, 12B, 27B | Gemma Terms of Use | Vision + Text | 128K | Strong chat quality; 140+ languages; on-device variants |
| [Llama](/wiki/llama) 3.2 | [Meta](/wiki/meta) | 1B, 3B, 11B, 90B | Llama Community License | Vision + Text (11B, 90B) | 128K | Large ecosystem; strong code performance |
| [Mistral](/wiki/mistral) / Mixtral | [Mistral AI](/wiki/mistral_ai) | 7B, 8x7B, 8x22B | Apache 2.0 / Custom | Text only (base) | 32K-128K | Mixture-of-experts; fast inference |
| [Phi](/wiki/phi)-4 | [Microsoft](/wiki/microsoft) | 3.8B (mini), 14B | MIT | Text only (base) | 128K | Strong reasoning at small sizes; MIT license |
| [Qwen](/wiki/qwen) 2.5 | Alibaba | 0.5B to 72B | Apache 2.0 / Custom | Vision + Text (VL variants) | 128K | Multilingual; strong coding; wide size range |

### Benchmark Comparison (Small Models, Instruction-Tuned)

The following table compares instruction-tuned models in the sub-10B parameter range, a popular category for local and on-device deployment:

| Benchmark | Gemma 3 4B IT | [Phi](/wiki/phi)-4-mini (3.8B) | [Llama](/wiki/llama) 3.2 3B | [Qwen](/wiki/qwen) 2.5 7B |
|---|---|---|---|---|
| MMLU-Pro | 43.6 | 52.8 | N/A | 56.3 |
| HumanEval | 71.3 | N/A | 71.3 | 57.9 |
| GSM8K | 89.2% | 88.6% | 77.7% | 91.6% |
| ARC-c | 56.2 | 83.7 | 78.6 | N/A |
| Approx. RAM (Q4) | ~3 GB | ~2.5 GB | ~2 GB | ~5 GB |

At the 27B parameter level, Gemma 3 27B IT's LMArena Elo of 1338 placed it well ahead of similarly sized or even larger open models. Its combination of multimodal capabilities, 128K context, and broad language support gives it advantages in use cases requiring vision understanding or multilingual processing, while competitors like Phi-4-mini offer stronger reasoning at smaller sizes under a more permissive MIT license.

## Is Gemma open source?

Gemma models are released under the Gemma [Terms](/wiki/terms) of Use, a custom license created by Google rather than a standard open-source license like MIT or Apache 2.0 [10]. Google says these "terms of use permit responsible commercial usage and distribution for all organizations, regardless of size" [1]. The license permits free use for individual developers, researchers, and commercial entities, including the right to redistribute and modify model weights. However, it includes several restrictions:

- A **Prohibited Use Policy** that forbids using the models for generating hate speech, malware, or other harmful content
- A requirement to **pass usage restrictions downstream** to any users of applications built on Gemma
- Google's right to **update the terms** or terminate access

While the Gemma license is permissive enough for the vast majority of commercial applications, it does not meet the strict definition of "open source" as defined by the [Open Source Initiative](/wiki/open_source_ai) (OSI). This distinction has been a point of discussion in the AI community, with some advocates arguing that licenses like Gemma's (and similar ones from Meta for Llama) create a grey area between fully open and proprietary models [10].

In practical terms, developers can use Gemma models to build and sell commercial products, deploy them on their own infrastructure, and modify them through fine-tuning or other techniques, as long as they comply with the usage restrictions.

## Impact and Adoption

Since its initial release, Gemma has become one of the most downloaded and used open model families in the AI community. The models are available through Hugging Face, Google's Vertex AI and AI Studio platforms, Kaggle, and numerous third-party inference providers. Cumulative downloads grew from 100 million at the one-year mark (February 2025) to more than 400 million by April 2026, accompanied by a Gemmaverse of over 100,000 community variants [16][17].

Gemma's impact extends beyond direct usage. The release of model weights has enabled academic researchers to study [transformer](/wiki/transformer) internals, develop new [fine-tuning](/wiki/fine_tuning) techniques, and create specialized models for domains ranging from healthcare to legal analysis. The Gemma Scope interpretability tools have become a resource for the [mechanistic interpretability](/wiki/mechanistic_interpretability) research community, helping researchers understand how language models represent and process information internally.

The on-device deployment story has also influenced the competitive landscape. Gemma 3n's success in running multimodal models with audio, video, and image understanding in under 3 GB of memory has raised the bar for what is expected from on-device AI models. Hardware partners including Qualcomm, MediaTek, and Samsung have integrated Gemma 3n optimizations into their chipset software stacks, signaling that on-device open models are becoming a mainstream deployment target rather than a niche use case [6].

The progression from Gemma 1 to Gemma 4 illustrates a broader industry trend: the frontier of what is possible with small, locally runnable models is advancing rapidly, driven by improvements in training data, architecture, distillation techniques, and post-training optimization. Each generation of Gemma has closed the gap between open and proprietary models, making capable AI more accessible to developers and researchers worldwide.

## See Also

- [RecurrentGemma](/wiki/recurrentgemma)
- [Gemini](/wiki/gemini)
- [Llama](/wiki/llama)
- [Mistral](/wiki/mistral)
- [Phi](/wiki/phi)
- [Qwen](/wiki/qwen)
- [Knowledge Distillation](/wiki/knowledge_distillation)
- [Open-Weight Models](/wiki/open_weight_models)
- [Vision Transformer](/wiki/vision_transformer)
- [Edge Computing](/wiki/edge_computing)

## References

1. Google. "Gemma: Google introduces new state-of-the-art open models." Google Blog, February 21, 2024. https://blog.google/technology/developers/gemma-open-models/
2. Google DeepMind. "Gemma: Open Models Based on Gemini Research and Technology." Technical Report, February 2024. https://ai.google.dev/gemma/docs
3. Gemma Team, Google DeepMind. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118, June 2024. https://arxiv.org/abs/2408.00118
4. Gemma Team, Google DeepMind. "Gemma 3 Technical Report." arXiv:2503.19786, March 2025. https://arxiv.org/abs/2503.19786
5. Hugging Face. "Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM." Hugging Face Blog, March 2025. https://huggingface.co/blog/gemma3
6. Google Developers Blog. "Introducing Gemma 3n: The developer guide." June 2025. https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/
7. Google DeepMind. "Gemma models overview." Google AI for Developers. https://ai.google.dev/gemma/docs
8. Google DeepMind. "Gemma: PaliGemma." Google DeepMind. https://deepmind.google/models/gemma/
9. Google DeepMind. "ShieldGemma." Google DeepMind. https://deepmind.google/models/gemma/
10. Google. "Gemma Terms of Use." Google AI for Developers. https://ai.google.dev/gemma/terms
11. Google DeepMind. "Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior." Google DeepMind Blog. https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/
12. Google Developers Blog. "Introducing Gemma 3 270M: The compact model for hyper-efficient AI." August 14, 2025. https://developers.googleblog.com/introducing-gemma-3-270m/
13. Google Blog. "FunctionGemma: Bringing bespoke function calling to the edge." 2025. https://blog.google/innovation-and-ai/technology/developers-tools/functiongemma/
14. Google Developers Blog. "Smaller, Safer, More Transparent: Advancing [Responsible AI](/wiki/responsible_ai) with Gemma." 2024. https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/
15. Google Developers Blog. "Gemma 3 on mobile and web with Google AI Edge." 2025. https://developers.googleblog.com/en/gemma-3-on-mobile-and-web-with-google-ai-edge/
16. Google Blog. "Gemma 4: Byte for byte, the most capable open models." Google Blog, April 2, 2026. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/
17. TechCrunch. "Google's Gemma AI models surpass 150M downloads." May 12, 2025. https://techcrunch.com/2025/05/12/googles-gemma-ai-models-surpass-150m-downloads/

