Gemma

Google DeepMind Large Language Models Open Source AI Small Language Models

32 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v8 · 6,417 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gemma is a family of open-weight large language models developed by Google DeepMind, built from the same research and technology used to create Google's Gemini models but small enough to run on a single GPU, a laptop, or a smartphone. Google describes Gemma as "a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models" ^[1]. Named after the Latin word for "precious stone," Gemma launched on February 21, 2024 and has since expanded across four major generations (Gemma 1, 2, 3, and 4) plus specialized variants spanning 270 million to 31 billion parameters. By April 2026, developers had downloaded Gemma more than 400 million times and built over 100,000 community variants, making it one of the most-used open model families and a leading competitor to Meta's Llama, Mistral AI's Mistral series, and Microsoft's Phi models ^[16]^[17].

Overview

Google DeepMind introduced Gemma on February 21, 2024, alongside a blog post emphasizing the company's commitment to making capable AI models available to the broader developer and research community ^[1]. "At Google, we believe in making AI helpful for everyone," the launch announcement stated, framing Gemma as a contribution to "the open community of developers and researchers powering AI innovation" ^[1]. The motivation behind Gemma was straightforward: while frontier models like Gemini Ultra and Gemini Pro deliver state-of-the-art performance, their size and computational requirements put them out of reach for many researchers, independent developers, and organizations that need to run models locally or on constrained hardware. Gemma fills that gap by distilling key insights from Gemini research into models with parameter counts ranging from 270 million to 31 billion.

All Gemma models are released with both pre-trained (base) and instruction-tuned variants. The instruction-tuned versions have undergone additional training with supervised fine-tuning on demonstration data and reinforcement learning from human feedback (RLHF) to make them more helpful and safer for conversational use. Model weights are distributed through platforms like Hugging Face, Kaggle, and Google's own Vertex AI, with support for popular frameworks including PyTorch, JAX, and Keras.

The table below summarizes all major releases in the Gemma family:

Release	Date	Model Sizes	Key Features
Gemma 1	February 21, 2024	2B, 7B	First open-weight release; 8K context; MQA/MHA
Gemma 2	June 27, 2024	2B, 9B, 27B	Knowledge distillation; GQA; sliding window attention
Gemma 3	March 12, 2025	1B, 4B, 12B, 27B	Multimodal vision; 128K context; 140+ languages
Gemma 3n	June 26, 2025	E2B, E4B	MatFormer architecture; on-device; audio/video input
Gemma 3 270M	August 14, 2025	270M	Ultra-compact; on-device fine-tuning
Gemma 4	April 2, 2026	E2B, E4B, 26B (MoE), 31B	MoE + dense; up to 256K context; audio input; reasoning

When was Gemma released?

Gemma 1 was released on February 21, 2024 ^[1]. Subsequent generations followed roughly every several months: Gemma 2 on June 27, 2024, Gemma 3 on March 12, 2025, Gemma 3n (full release) on June 26, 2025, Gemma 3 270M on August 14, 2025, and Gemma 4 on April 2, 2026 ^[3]^[4]^[6]^[12]^[16]. The cadence reflects an aggressive iteration schedule aimed at keeping Gemma at the frontier of small open-weight models.

Gemma 1 (February 2024)

The first generation of Gemma was released on February 21, 2024, in two sizes: 2 billion (2B) and 7 billion (7B) parameters ^[1]. Both models use a decoder-only transformer architecture with several modifications drawn from the Gemini research program.

Architecture

Gemma 1 incorporates four notable architectural features that distinguish it from a vanilla transformer:

Multi-Query Attention (MQA): The 2B model uses MQA, where a single key-value head serves multiple query heads, reducing memory bandwidth requirements during inference. The 7B model uses multi-head attention (MHA) instead.
Rotary Position Embeddings (RoPE): Both models use RoPE for positional encoding, allowing the model to generalize to sequence lengths beyond those seen during training.
GeGLU activation function: Gemma uses the GeGLU variant of the gated linear unit as its feedforward activation function, which has been shown to improve training efficiency.
RMSNorm: Both models use RMSNorm (Root Mean Square Layer Normalization) for input normalization, which is computationally simpler than standard LayerNorm.

The detailed architecture specifications for Gemma 1 are shown below:

Parameter	Gemma 2B	Gemma 7B
Layers	18	28
Hidden Dimension (d_model)	2,048	3,072
Intermediate Size (FFN)	32,768	49,152
Attention Heads	8	16
KV Heads	1 (MQA)	16 (MHA)
Head Dimension	256	256
Vocabulary Size	256,128	256,128
Context Length	8,192	8,192

Training

The 2B model was trained on 2 trillion tokens and the 7B model on 6 trillion tokens. The training data consists primarily of web documents, code, and mathematics content, filtered for quality and safety. Google did not release the full details of the training data composition but noted that extensive filtering was applied to remove personally identifiable information and other sensitive content ^[1]. Both models use a SentencePiece tokenizer with a vocabulary of 256,128 tokens, shared with the Gemini model family.

Performance

At launch, Gemma 1 models demonstrated strong performance relative to their size. The 7B model outperformed Llama 2 7B and Mistral 7B on multiple academic benchmarks ^[2]. In particular, Gemma 7B showed notable gains in mathematical reasoning (GSM8K, MATH) and code generation (HumanEval), areas where earlier open models at this scale had struggled.

Benchmark	Gemma 7B	Llama 2 7B	Mistral 7B
MMLU (5-shot)	64.3%	45.3%	62.5%
HumanEval	32.3%	12.8%	26.2%
GSM8K	46.4%	14.6%	35.4%
MATH	24.3%	2.5%	12.7%
HellaSwag	82.3%	77.2%	81.3%

Google's performance benchmarking using the MaxText reference implementation also showed up to 3x better performance-per-dollar for the Gemma 7B model compared to baseline training performance with Llama 2 7B on Google Cloud infrastructure ^[2].

Gemma 2 (June 2024)

Google DeepMind released Gemma 2 on June 27, 2024, with a focus on improving performance at practical model sizes. The second generation was available in three sizes: 2B, 9B, and 27B parameters ^[3]. The paper describing Gemma 2, titled "Gemma 2: Improving Open Language Models at a Practical Size," emphasized architectural innovations aimed at maximizing quality-per-parameter.

Architectural Changes

Gemma 2 introduced several improvements over the first generation:

Knowledge distillation: The smaller Gemma 2 models (2B and 9B) were trained using knowledge distillation from the 27B teacher model. Rather than a standard approach, Google DeepMind used on-policy distillation, where the student model generates its own completions from supervised fine-tuning prompts. The KL divergence between the teacher's and student's logit distributions is then minimized during training, allowing the student to learn from a richer signal than next-token prediction alone. This approach reduces the train-inference mismatch that can occur with off-policy distillation methods ^[3].
Grouped-Query Attention (GQA): All Gemma 2 models use GQA, a middle ground between MQA and full MHA that balances memory efficiency with representational capacity.
Sliding window attention: Gemma 2 alternates between local sliding window attention (4,096 tokens) and full global attention (8,192 tokens) across layers, reducing the computational cost of processing long sequences while maintaining the ability to capture long-range dependencies.
Logit soft capping: A logit soft-capping mechanism was introduced to improve training stability. The formula applies a hyperbolic tangent function scaled by a cap value (50.0 for self-attention logits and 30.0 for the final output layer), preventing logits from growing excessively large during training ^[3].

The detailed specifications for all three Gemma 2 sizes are:

Parameter	Gemma 2 2B	Gemma 2 9B	Gemma 2 27B
Layers	26	42	46
Hidden Dimension	2,304	3,584	4,608
Attention Heads	8	16	32
KV Heads	4	8	16
Local Attention Window	4,096	4,096	4,096
Global Attention Span	8,192	8,192	8,192
Training Tokens	2T	8T	13T
Vocabulary Size	256,000	256,000	256,000

The 27B model was trained from scratch on 13 trillion tokens without distillation, while the 9B model was distilled from the 27B model using more than 50x the compute-optimal quantity predicted by scaling law theory ^[3]. This "over-training" strategy, combined with distillation, allowed the 9B model to punch well above its weight class on benchmarks.

Performance

Gemma 2 delivered substantial improvements across benchmarks, with the 27B model competing against models significantly larger in parameter count.

Benchmark	Gemma 2 2B	Gemma 2 9B	Gemma 2 27B
MMLU (5-shot)	52.2%	71.3%	75.2%
HellaSwag (10-shot)	72.9%	81.9%	86.4%
GSM8K	23.9%	68.6%	74.0%
ARC-c	55.4%	68.4%	71.4%
Winogrande	70.9%	80.6%	83.7%

On the LMSys Chatbot Arena leaderboard, the Gemma 2 27B instruction-tuned model achieved an Elo score of 1218, surpassing Llama 3 70B (Elo 1206), a model nearly three times its size ^[3]. Human evaluations also demonstrated that Gemma 2 models exhibited significantly lower memorization rates compared to prior models, with verbatim memorization below 0.1%.

Gemma 3 (March 2025)

Gemma 3 was released on March 12, 2025, representing the most significant expansion of the family to date. It introduced four model sizes (1B, 4B, 12B, and 27B), multimodal capabilities for vision and text understanding, support for over 140 languages, and context windows of up to 128,000 tokens ^[4]. Google positioned Gemma 3 as "the most capable model you can run on a single GPU or TPU" ^[5].

Multimodal Vision Support

The headline feature of Gemma 3 is native multimodal support. The 4B, 12B, and 27B models can process both images and text as input, while the 1B model remains text-only due to its compact size. Image understanding is enabled through a 400M-parameter variant of the SigLIP vision encoder, a Vision Transformer (ViT) trained with a variant of the CLIP contrastive loss ^[4].

The vision encoder takes square images resized to 896 x 896 pixels and encodes them into a sequence of visual tokens. These tokens are then condensed into a fixed set of 256 image token vectors before being fed into the language model alongside text tokens. This condensation step keeps computational costs manageable even when processing multiple images within a single prompt.

For images with non-standard aspect ratios, Gemma 3 employs a Pan and Scan (P&S) method inspired by LLaVA. This approach segments images into non-overlapping crops of equal size that cover the entire image, resizes each crop to 896 x 896 pixels, and processes them individually through the encoder. The result is that Gemma 3 can handle images of varying resolutions and aspect ratios without distorting or losing important details ^[4].

This allows Gemma 3 to perform tasks like image captioning, visual question answering, document understanding, chart interpretation, and optical character recognition.

Architecture and Context Window

Gemma 3 uses a decoder-only transformer architecture with Grouped-Query Attention (GQA) and RMSNorm, consistent with Gemma 2. A key change from Gemma 2 is the replacement of logit soft capping with QK-norm (query-key normalization), which normalizes query and key vectors before computing attention scores ^[4].

Gemma 3 dramatically increased the context window compared to Gemma 2's 8K limit. The 1B model supports 32,768 tokens, while the 4B, 12B, and 27B models support 128,000 tokens ^[4]. This 16x increase in context length is achieved through an interleaved attention pattern: for every 1 global attention layer, there are 5 local attention layers. Local layers use a sliding window of just 1,024 tokens, while global layers attend to the full context. This design significantly reduces the computational cost of long-context processing, since most layers only need to attend to a small window. The RoPE base frequency was increased from 10,000 to 1,000,000 to support the longer context lengths.

The full architecture specifications for each Gemma 3 variant:

Parameter	Gemma 3 1B	Gemma 3 4B	Gemma 3 12B	Gemma 3 27B
Embedding Parameters	302M	675M	1,012M	1,416M
Non-embedding Parameters	698M	3,209M	10,759M	25,600M
Total Parameters	1B	~3.9B	~11.8B	~27B
Context Window	32K	128K	128K	128K
Vocabulary Size	256,000	256,000	256,000	256,000
Multimodal	Text only	Vision + Text	Vision + Text	Vision + Text

Training Scale

The training data volume increased substantially across all model sizes compared to previous generations:

Model	Parameters	Training Tokens	Context Window	Multimodal
Gemma 3 1B	1 billion	2 trillion	32K	Text only
Gemma 3 4B	4 billion	4 trillion	128K	Vision + Text
Gemma 3 12B	12 billion	12 trillion	128K	Vision + Text
Gemma 3 27B	27 billion	14 trillion	128K	Vision + Text

The training data includes web documents, code, mathematics, science articles, and multilingual content spanning over 140 languages. Compared to Gemma 2, the 27B model was trained on 14 trillion tokens (up from 13 trillion), and the midsize models saw even larger relative increases in data volume ^[4].

Performance

Gemma 3 achieved remarkable benchmark results across all sizes. The instruction-tuned models showed large improvements over Gemma 2, particularly in mathematical reasoning and code generation:

Benchmark	Gemma 3 1B IT	Gemma 3 4B IT	Gemma 3 12B IT	Gemma 3 27B IT
MMLU	38.8%	58.1%	71.9%	76.9%
MMLU-Pro	14.7%	43.6%	60.6%	67.5%
HumanEval	41.5%	71.3%	85.4%	87.8%
GSM8K	62.8%	89.2%	94.4%	95.9%
MATH	48.0%	75.6%	83.8%	89.0%
HellaSwag	62.3%	77.2%	84.2%	85.6%
LiveCodeBench	1.9%	12.6%	24.6%	29.7%
GPQA Diamond	19.2%	30.8%	40.9%	42.4%

On the LMSys Chatbot Arena, the Gemma 3 27B instruction-tuned model scored an Elo of 1338, placing it among the top 10 models overall and above much larger models such as DeepSeek-V3 (1318), Llama 3 405B (1257), and Qwen 2.5 70B (1257) ^[4]^[5]. The Gemma 3 27B Elo of 1338 was a large jump from Gemma 2 27B's 1220 ^[4]. This performance level, achieved with a model small enough to run on a single GPU, represented a significant milestone for the open model ecosystem.

Gemma 3 270M

On August 14, 2025, Google released Gemma 3 270M, the smallest model in the Gemma family ^[12]. With just 270 million parameters (170 million embedding parameters and 100 million transformer block parameters), it is designed for ultra-efficient on-device deployment and task-specific fine-tuning. Despite its compact size, Gemma 3 270M "establishes a new level of performance for its size" on the IFEval instruction-following benchmark, according to Google ^[12]. Internal testing on a Pixel 9 Pro showed the INT4-quantized model consumed only 0.75% of battery life over 25 conversations, making it one of the most power-efficient language models available ^[12]. Google framed the release around right-sizing models to tasks: "In engineering, success is defined by efficiency, not just raw power. You wouldn't use a sledgehammer to hang a picture frame," the announcement explained ^[12]. Google also released FunctionGemma, a specialized fine-tune of the 270M model for function calling, enabling on-device agents to translate natural-language commands into structured API calls ^[13].

Gemma 3n (June 2025)

Gemma 3n is a variant of the Gemma family specifically optimized for on-device and edge computing deployment. Previewed at Google I/O 2025 and fully released on June 26, 2025, Gemma 3n introduces architectural innovations that allow powerful models to run with minimal memory footprints on smartphones, tablets, and other resource-constrained devices ^[6].

MatFormer Architecture

The key innovation in Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, a novel nested transformer design built for elastic inference. Like Russian nesting dolls (Matryoshka dolls), a MatFormer model contains smaller, fully functional sub-models within its parameter space. During training of the E4B (4 billion effective parameter) model, a smaller E2B (2 billion effective parameter) sub-model is simultaneously optimized within it. This allows a single trained model to be deployed at multiple compute and memory levels without retraining, providing flexibility for devices with different capabilities ^[6].

Developers can use Gemma 3n in two modes:

Pre-extracted models: Download either the standalone E4B or E2B variant for direct deployment, with the E2B sub-model offering up to 2x faster inference than the E4B.
Mix-n-Match: Create custom model sizes between E2B and E4B by adjusting feed-forward dimensions and selectively skipping layers, enabling fine-grained control over the accuracy-latency tradeoff.

Per-Layer Embeddings (PLE)

The second major innovation in Gemma 3n is Per-Layer Embeddings (PLE), a technique that dramatically reduces accelerator memory (GPU/TPU VRAM) usage. In a standard transformer, the embedding matrix is loaded into high-speed accelerator memory. PLE instead associates separate embedding parameters with each transformer layer and stores them in regular CPU memory. Only the core transformer weights need to reside in accelerator memory, which is the bottleneck for on-device deployment. As a result, while the raw parameter counts for Gemma 3n are 5 billion (E2B) and 8 billion (E4B), the effective accelerator memory footprint is comparable to traditional 2B and 4B models ^[6].

Model Sizes and Memory

Specification	E2B	E4B
Raw Parameter Count	5 billion	8 billion
Effective Parameters	~2 billion	~4 billion
Accelerator Memory	~2 GB	~3 GB
LMArena Score	N/A	>1,300
Modalities (Input)	Text, image, audio, video	Text, image, audio, video
Modalities (Output)	Text	Text
Language Support (Text)	140 languages	140 languages
Language Support (Multimodal)	35 languages	35 languages

The E4B model became the first model under 10 billion raw parameters to exceed an LMArena score of 1,300, a milestone that underscored the effectiveness of the MatFormer and PLE innovations ^[6]. The released model can run on devices with as little as 2 GB of memory ^[6].

Multimodal Capabilities

Unlike Gemma 3, which only supports vision and text inputs, Gemma 3n expands multimodal support to include audio and video in addition to images and text:

Vision: Uses a MobileNet-V5-300M encoder (replacing Gemma 3's SigLIP) optimized for on-device inference. It supports input resolutions of 256x256, 512x512, and 768x768 pixels, achieves 60 frames per second on a Google Pixel device, and provides a 13x speedup with quantization compared to the SigLIP baseline, with 46% fewer parameters and a 4x smaller memory footprint ^[6].
Audio: Uses a Universal Speech Model (USM) encoder that generates approximately one token per 160 milliseconds (~6 tokens/second). It supports automatic speech recognition (ASR) and automatic speech translation (AST) for audio clips up to 30 seconds ^[6].
Video: Processes video by extracting frames and encoding them through the vision encoder, enabling basic video understanding tasks.

Gemma 3n introduces KV cache sharing to optimize prefill performance for long-context inputs. This technique delivers approximately 2x improvement on prefill performance compared to Gemma 3 4B, which is critical for responsive on-device inference where users expect near-instant replies ^[6].

Gemma 4 (April 2026)

Google released Gemma 4 on April 2, 2026, describing it as "the most capable model family you can run on your hardware" ^[16]. The fourth generation spans four sizes at launch and is the first Gemma generation to ship a mixture-of-experts (MoE) model alongside dense models, while extending native audio input and reasoning capabilities across the family.

What sizes does Gemma 4 come in?

Gemma 4 launched in four sizes: two edge models (E2B and E4B, effective 2 billion and 4 billion parameters) that use Per-Layer Embeddings, a 26B mixture-of-experts model that activates only 3.8 billion parameters during inference, and a 31B fully dense model ^[16]. Google subsequently expanded the lineup with a Gemma 4 12B Unified model on June 3, 2026.

Model	Type	Active Parameters	Context Window	Multimodal
Gemma 4 E2B	PLE edge	~2 billion (effective)	128K	Video, image, audio
Gemma 4 E4B	PLE edge	~4 billion (effective)	128K	Video, image, audio
Gemma 4 26B	Mixture-of-experts	3.8 billion	Up to 256K	Video, image
Gemma 4 31B	Dense	31 billion	Up to 256K	Video, image

Capabilities and benchmarks

Gemma 4 adds multi-step reasoning, native function-calling, and structured JSON output, and all sizes process video and images at variable resolutions, with native audio input on the E2B and E4B edge models ^[16]. The family is natively trained on over 140 languages and supports context windows of 128K tokens on the edge models and up to 256K tokens on the larger models ^[16].

On the LMArena (Arena AI) text leaderboard, Google reported that the Gemma 4 31B model was the "#3 open model in the world," with the 26B mixture-of-experts model "securing the #6 spot" ^[16]. Google stated that Gemma 4 "outcompetes models 20x its size" ^[16].

How many times has Gemma been downloaded?

Gemma has become one of the most-downloaded open model families in the AI community, and Google has reported its adoption through a series of public milestones. By Gemma's one-year anniversary in February 2025, the models had passed 100 million downloads and more than 60,000 community-created variants in what Google calls the "Gemmaverse" ^[17]. "Gemma just passed 150 million downloads and over 70k variants on Hugging Face," Google DeepMind developer relations engineer Omar Sanseviero said in May 2025 ^[17]. By the Gemma 4 launch in April 2026, the cumulative total had surpassed 400 million downloads and more than 100,000 variants ^[16].

Milestone	Date	Cumulative Downloads	Gemmaverse Variants
One-year anniversary	February 2025	100 million+	60,000+
Mid-2025 milestone	May 2025	150 million+	70,000+
Gemma 4 launch	April 2026	400 million+	100,000+

Architecture Evolution Across Generations

The progression from Gemma 1 to Gemma 4 shows a clear trajectory of architectural refinement:

Feature	Gemma 1	Gemma 2	Gemma 3	Gemma 3n	Gemma 4
Attention Type	MQA (2B) / MHA (7B)	GQA (all sizes)	GQA + interleaved local/global	GQA + MatFormer elastic	GQA; MoE (26B) + dense (31B)
Position Encoding	RoPE (base 10K)	RoPE (base 10K)	RoPE (base 1M)	RoPE (base 1M)	RoPE
Normalization	RMSNorm	RMSNorm	RMSNorm + QK-norm	RMSNorm + QK-norm	RMSNorm
Activation	GeGLU	GeGLU	GeGLU	GeGLU	GeGLU
Max Context	8,192	8,192	128,000	128,000	256,000
Distillation	None	On-policy (2B, 9B)	Not disclosed	Nested (MatFormer)	Nested (PLE edge)
Multimodal	No	No	Vision (SigLIP 400M)	Vision (MobileNet-V5), Audio (USM), Video	Video, image (all); audio (edge)
Vocabulary	256,128	256,000	256,000	256,000	256,000

Training Data and Methodology

All Gemma models are trained on Google's proprietary data mixture, which Google has described in general terms but has not released publicly. The training data includes:

Web documents: Filtered web crawl data with quality and safety filters applied to remove personally identifiable information, toxic content, and low-quality pages.
Code: Source code from public repositories across multiple programming languages including Python, Java, C++, JavaScript, Go, and Rust.
Mathematics: Mathematical content including textbooks, problem sets, and formal proofs.
Science articles: Scientific publications and technical documentation.
Multilingual content: Starting with Gemma 3, the training data was significantly expanded to cover over 140 languages, with dedicated multilingual data curation.

The total training compute increased substantially with each generation. Gemma 1's 7B model saw 6 trillion tokens, Gemma 2's 27B model was trained on 13 trillion tokens, and Gemma 3's 27B model processed 14 trillion tokens. The smaller Gemma 2 models were "over-trained" relative to Chinchilla scaling law predictions, with the 9B model trained on 8 trillion tokens (more than 50x the compute-optimal amount for its size) to maximize quality ^[3].

Google applied extensive safety filtering during data preparation, including the removal of child sexual abuse material (CSAM), personally identifiable information, and content that violates Google's policies. The exact composition and proportions of the training data have not been disclosed, which has been a point of criticism from researchers who argue that full data transparency is necessary for reproducible science.

Specialized Variants

Beyond the core Gemma models, Google DeepMind has released several task-specific variants that build on the Gemma architecture.

CodeGemma

CodeGemma is a family of models specialized for code generation and completion tasks. Released alongside the first Gemma generation, CodeGemma models support multiple programming languages including Python, Java, C++, JavaScript, and more. The models are available in sizes that mirror the base Gemma lineup and are designed for both code completion (fill-in-the-middle) and general coding assistance ^[7].

PaliGemma

PaliGemma is a vision-language model that combines a SigLIP image encoder with a Gemma language model decoder. It is specifically designed for fine-tuning on visual understanding tasks such as image captioning, object detection, and document understanding. PaliGemma 2, released alongside Gemma 2, expanded support to multiple image resolutions and additional model sizes ^[8].

ShieldGemma

ShieldGemma is a safety classifier built on the Gemma architecture. ShieldGemma 2, a 4B parameter model built on Gemma 3, functions as an image safety classifier that can identify potentially harmful content across three categories: dangerous content, sexually explicit material, and violence. It is intended for use as a guardrail in production applications that process user-generated or model-generated images ^[9].

RecurrentGemma

RecurrentGemma is a variant that replaces the standard transformer attention mechanism with a linear recurrence based on the Griffin architecture. Available in 2B and 9B parameter sizes, it offers faster inference at long sequence lengths due to the constant memory footprint of recurrent computation, though it trades some quality for this efficiency gain.

FunctionGemma

FunctionGemma is a 270M parameter model fine-tuned from Gemma 3 270M for function calling tasks. It translates natural-language user commands into structured API or tool calls, enabling on-device agents that can control mobile applications, IoT devices, and other tools without sending data to the cloud ^[13].

Variant	Size(s)	Purpose	Key Features
CodeGemma	2B, 7B	Code generation and completion	Multi-language support; fill-in-the-middle capability
PaliGemma / PaliGemma 2	Multiple	Vision-language tasks	Fine-tunable for image understanding; multi-resolution
ShieldGemma 2	4B	Image safety classification	Classifies dangerous, explicit, and violent content
RecurrentGemma	2B, 9B	Efficient long-sequence inference	Griffin linear recurrence; constant memory
Gemma 3 270M	270M	On-device fine-tuning	Ultra-compact; 0.75% battery per 25 conversations
FunctionGemma	270M	Function calling	Structured API calls from natural language
Gemma 3n	E2B, E4B	On-device deployment	MatFormer architecture; multimodal; ultra-low memory

Responsible AI Toolkit

Google released the Responsible Generative AI Toolkit alongside the Gemma models to help developers build safe and responsible applications. The toolkit provides several resources ^[14]:

Safety tuning guidance: Documentation on best practices for setting safety policies and applying safety-focused fine-tuning to Gemma models.
Safety classifiers: ShieldGemma and related classifiers that can filter inputs and outputs for harmful content, including hate speech, violence, and sexually explicit material.
Learning Interpretability Tool (LIT): An interactive tool for investigating and debugging Gemma's behavior in response to different prompts. LIT helps developers understand why a model produces certain outputs and identify potential failure modes.
LLM Comparator: A tool for running and visualizing comparative evaluations between different model versions, fine-tunes, or configurations.
Safeguards documentation: Guidance on building input/output filtering pipelines, setting content policies, and implementing defense-in-depth strategies for production deployments.

The toolkit encourages a holistic approach to responsible AI that addresses safety, privacy, fairness, and accountability at both the model and application levels ^[14].

Google AI Edge Deployment

Google AI Edge is the primary platform for deploying Gemma models on mobile and edge devices. The SDK provides optimized inference runtimes for Android, iOS, and web applications ^[15].

Key deployment capabilities include:

Gemma 3 1B on mobile: The INT4-quantized Gemma 3 1B model is just 529 MB in size and achieves up to 2,585 tokens per second on prefill via Google AI Edge's LLM inference runtime, allowing it to process a page of content in under a second ^[15].
Gemma 3n on mobile: The E2B and E4B models can be deployed through the Google AI Edge SDK on CPUs, NPUs, and mobile GPUs. Google collaborated with mobile hardware partners including Qualcomm Technologies, MediaTek, and Samsung's System LSI to optimize Gemma 3n for their respective chipsets ^[6].
LiteRT format: Gemma models can be converted to Google's LiteRT (formerly TensorFlow Lite) format for deployment on resource-constrained devices.
Web deployment: Through WebAssembly and WebGPU, Gemma models can run directly in web browsers without a server backend.

The on-device deployment capabilities are significant for privacy-sensitive applications, since data never needs to leave the user's device. Healthcare, finance, and enterprise applications in particular benefit from the ability to run inference locally.

Community Fine-Tunes and Ecosystem

The Gemma family has generated a large ecosystem of community-created fine-tunes, quantizations, and adaptations. As of early 2026, Gemma models are among the most downloaded model families on Hugging Face, with more than 400 million cumulative downloads and over 100,000 variants across all releases ^[16].

Popular community contributions include:

Quantized versions: Organizations like Unsloth have published 2-bit through 16-bit GGUF quantizations of all Gemma model sizes, enabling deployment on consumer hardware with minimal quality loss.
Domain-specific fine-tunes: Researchers and companies have created fine-tunes for specific domains including medical question answering, legal document analysis, multilingual translation, and customer support automation.
Alignment-tuned models: Community members have applied techniques like Direct Preference Optimization (DPO) and other alignment methods to create Gemma variants optimized for helpfulness, harmlessness, and honesty.
Framework support: Gemma is supported across a wide range of inference and fine-tuning frameworks including Hugging Face Transformers, Ollama, llama.cpp, MLX (for Apple Silicon), vLLM, SGLang, NVIDIA NIM, Keras, and Google's own GenAI API ^[6].

Google has also released Gemma Scope, a set of sparse autoencoders trained on Gemma models, to support the AI safety community's work on mechanistic interpretability. Gemma Scope 2, released in 2025, expanded coverage to Gemma 2 models and provided deeper tools for understanding complex model behaviors ^[11].

Comparison with Other Small Open Models

Gemma competes in the rapidly growing market for small-to-medium open-weight language models. The following table compares Gemma with its primary competitors at similar parameter counts:

Model Family	Developer	Key Sizes	License	Multimodal	Max Context	Notable Strengths
Gemma 3	Google DeepMind	1B, 4B, 12B, 27B	Gemma Terms of Use	Vision + Text	128K	Strong chat quality; 140+ languages; on-device variants
Llama 3.2	Meta	1B, 3B, 11B, 90B	Llama Community License	Vision + Text (11B, 90B)	128K	Large ecosystem; strong code performance
Mistral / Mixtral	Mistral AI	7B, 8x7B, 8x22B	Apache 2.0 / Custom	Text only (base)	32K-128K	Mixture-of-experts; fast inference
Phi-4	Microsoft	3.8B (mini), 14B	MIT	Text only (base)	128K	Strong reasoning at small sizes; MIT license
Qwen 2.5	Alibaba	0.5B to 72B	Apache 2.0 / Custom	Vision + Text (VL variants)	128K	Multilingual; strong coding; wide size range

Benchmark Comparison (Small Models, Instruction-Tuned)

The following table compares instruction-tuned models in the sub-10B parameter range, a popular category for local and on-device deployment:

Benchmark	Gemma 3 4B IT	Phi-4-mini (3.8B)	Llama 3.2 3B	Qwen 2.5 7B
MMLU-Pro	43.6	52.8	N/A	56.3
HumanEval	71.3	N/A	71.3	57.9
GSM8K	89.2%	88.6%	77.7%	91.6%
ARC-c	56.2	83.7	78.6	N/A
Approx. RAM (Q4)	~3 GB	~2.5 GB	~2 GB	~5 GB

At the 27B parameter level, Gemma 3 27B IT's LMArena Elo of 1338 placed it well ahead of similarly sized or even larger open models. Its combination of multimodal capabilities, 128K context, and broad language support gives it advantages in use cases requiring vision understanding or multilingual processing, while competitors like Phi-4-mini offer stronger reasoning at smaller sizes under a more permissive MIT license.

Is Gemma open source?

Gemma models are released under the Gemma Terms of Use, a custom license created by Google rather than a standard open-source license like MIT or Apache 2.0 ^[10]. Google says these "terms of use permit responsible commercial usage and distribution for all organizations, regardless of size" ^[1]. The license permits free use for individual developers, researchers, and commercial entities, including the right to redistribute and modify model weights. However, it includes several restrictions:

A Prohibited Use Policy that forbids using the models for generating hate speech, malware, or other harmful content
A requirement to pass usage restrictions downstream to any users of applications built on Gemma
Google's right to update the terms or terminate access

While the Gemma license is permissive enough for the vast majority of commercial applications, it does not meet the strict definition of "open source" as defined by the Open Source Initiative (OSI). This distinction has been a point of discussion in the AI community, with some advocates arguing that licenses like Gemma's (and similar ones from Meta for Llama) create a grey area between fully open and proprietary models ^[10].

In practical terms, developers can use Gemma models to build and sell commercial products, deploy them on their own infrastructure, and modify them through fine-tuning or other techniques, as long as they comply with the usage restrictions.

Impact and Adoption

Since its initial release, Gemma has become one of the most downloaded and used open model families in the AI community. The models are available through Hugging Face, Google's Vertex AI and AI Studio platforms, Kaggle, and numerous third-party inference providers. Cumulative downloads grew from 100 million at the one-year mark (February 2025) to more than 400 million by April 2026, accompanied by a Gemmaverse of over 100,000 community variants ^[16]^[17].

Gemma's impact extends beyond direct usage. The release of model weights has enabled academic researchers to study transformer internals, develop new fine-tuning techniques, and create specialized models for domains ranging from healthcare to legal analysis. The Gemma Scope interpretability tools have become a resource for the mechanistic interpretability research community, helping researchers understand how language models represent and process information internally.

The on-device deployment story has also influenced the competitive landscape. Gemma 3n's success in running multimodal models with audio, video, and image understanding in under 3 GB of memory has raised the bar for what is expected from on-device AI models. Hardware partners including Qualcomm, MediaTek, and Samsung have integrated Gemma 3n optimizations into their chipset software stacks, signaling that on-device open models are becoming a mainstream deployment target rather than a niche use case ^[6].

The progression from Gemma 1 to Gemma 4 illustrates a broader industry trend: the frontier of what is possible with small, locally runnable models is advancing rapidly, driven by improvements in training data, architecture, distillation techniques, and post-training optimization. Each generation of Gemma has closed the gap between open and proprietary models, making capable AI more accessible to developers and researchers worldwide.

References

Google. "Gemma: Google introduces new state-of-the-art open models." Google Blog, February 21, 2024. https://blog.google/technology/developers/gemma-open-models/ ↩
Google DeepMind. "Gemma: Open Models Based on Gemini Research and Technology." Technical Report, February 2024. https://ai.google.dev/gemma/docs ↩
Gemma Team, Google DeepMind. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118, June 2024. https://arxiv.org/abs/2408.00118 ↩
Gemma Team, Google DeepMind. "Gemma 3 Technical Report." arXiv:2503.19786, March 2025. https://arxiv.org/abs/2503.19786 ↩
Hugging Face. "Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM." Hugging Face Blog, March 2025. https://huggingface.co/blog/gemma3 ↩
Google Developers Blog. "Introducing Gemma 3n: The developer guide." June 2025. https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/ ↩
Google DeepMind. "Gemma models overview." Google AI for Developers. https://ai.google.dev/gemma/docs ↩
Google DeepMind. "Gemma: PaliGemma." Google DeepMind. https://deepmind.google/models/gemma/ ↩
Google DeepMind. "ShieldGemma." Google DeepMind. https://deepmind.google/models/gemma/ ↩
Google. "Gemma Terms of Use." Google AI for Developers. https://ai.google.dev/gemma/terms ↩
Google DeepMind. "Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior." Google DeepMind Blog. https://deepmind.google/blog/gemma-scope-2-helping-the-ai-safety-community-deepen-understanding-of-complex-language-model-behavior/ ↩
Google Developers Blog. "Introducing Gemma 3 270M: The compact model for hyper-efficient AI." August 14, 2025. https://developers.googleblog.com/introducing-gemma-3-270m/ ↩
Google Blog. "FunctionGemma: Bringing bespoke function calling to the edge." 2025. https://blog.google/innovation-and-ai/technology/developers-tools/functiongemma/ ↩
Google Developers Blog. "Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma." 2024. https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/ ↩
Google Developers Blog. "Gemma 3 on mobile and web with Google AI Edge." 2025. https://developers.googleblog.com/en/gemma-3-on-mobile-and-web-with-google-ai-edge/ ↩
Google Blog. "Gemma 4: Byte for byte, the most capable open models." Google Blog, April 2, 2026. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ ↩
TechCrunch. "Google's Gemma AI models surpass 150M downloads." May 12, 2025. https://techcrunch.com/2025/05/12/googles-gemma-ai-models-surpass-150m-downloads/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

Overview

When was Gemma released?

Gemma 1 (February 2024)

Architecture

Training

Performance

Gemma 2 (June 2024)

Architectural Changes

Performance

Gemma 3 (March 2025)

Multimodal Vision Support

Architecture and Context Window

Training Scale

Performance

Gemma 3 270M

Gemma 3n (June 2025)

MatFormer Architecture

Per-Layer Embeddings (PLE)

Model Sizes and Memory

Multimodal Capabilities

KV Cache Sharing

Gemma 4 (April 2026)

What sizes does Gemma 4 come in?

Capabilities and benchmarks

How many times has Gemma been downloaded?

Architecture Evolution Across Generations

Training Data and Methodology

Specialized Variants

CodeGemma

PaliGemma

ShieldGemma

RecurrentGemma

FunctionGemma

Responsible AI Toolkit

Google AI Edge Deployment

Community Fine-Tunes and Ecosystem

Comparison with Other Small Open Models

Benchmark Comparison (Small Models, Instruction-Tuned)

Is Gemma open source?

Impact and Adoption

See Also

References

Improve this article

Related Articles

Gemma 2

Gemma 3

Gemma 3n

Phi (language model)

Phi-3

Phi-4

What links here (24 of 107)

Related Articles

Gemma 2

Gemma 3

Gemma 3n

Phi (language model)

Phi-3

Phi-4

What links here (24 of 107)