Gemma is a family of open-weight large language models developed by Google DeepMind and released under a permissive license that allows free use for most purposes. Named after the Latin word for "precious stone," the Gemma models are built from the same research and technology that underpin Google's larger Gemini models, but are designed to be small enough to run on a single GPU, a laptop, or even a smartphone. Since the first release in February 2024, the Gemma family has expanded across three major generations and several specialized variants, establishing itself as one of the leading open model families competing with Meta's Llama, Mistral AI's Mistral series, and Microsoft's Phi models.
Google DeepMind introduced Gemma on February 21, 2024, alongside a blog post emphasizing the company's commitment to making capable AI models available to the broader developer and research community [1]. The motivation behind Gemma was straightforward: while frontier models like Gemini Ultra and Gemini Pro deliver state-of-the-art performance, their size and computational requirements put them out of reach for many researchers, independent developers, and organizations that need to run models locally or on constrained hardware. Gemma fills that gap by distilling key insights from Gemini research into models with parameter counts ranging from 270 million to 27 billion.
All Gemma models are released with both pre-trained (base) and instruction-tuned variants. The instruction-tuned versions have undergone additional training with supervised fine-tuning on demonstration data and reinforcement learning from human feedback (RLHF) to make them more helpful and safer for conversational use. Model weights are distributed through platforms like Hugging Face, Kaggle, and Google's own Vertex AI, with support for popular frameworks including PyTorch, JAX, and Keras.
The table below summarizes all major releases in the Gemma family:
| Release | Date | Model Sizes | Key Features |
|---|---|---|---|
| Gemma 1 | February 21, 2024 | 2B, 7B | First open-weight release; 8K context; MQA/MHA |
| Gemma 2 | June 27, 2024 | 2B, 9B, 27B | Knowledge distillation; GQA; sliding window attention |
| Gemma 3 | March 12, 2025 | 1B, 4B, 12B, 27B | Multimodal vision; 128K context; 140+ languages |
| Gemma 3 270M | August 14, 2025 | 270M | Ultra-compact; on-device fine-tuning |
| Gemma 3n | June 26, 2025 | E2B, E4B | MatFormer architecture; on-device; audio/video input |
The first generation of Gemma was released on February 21, 2024, in two sizes: 2 billion (2B) and 7 billion (7B) parameters [1]. Both models use a decoder-only transformer architecture with several modifications drawn from the Gemini research program.
Gemma 1 incorporates four notable architectural features that distinguish it from a vanilla transformer:
The detailed architecture specifications for Gemma 1 are shown below:
| Parameter | Gemma 2B | Gemma 7B |
|---|---|---|
| Layers | 18 | 28 |
| Hidden Dimension (d_model) | 2,048 | 3,072 |
| Intermediate Size (FFN) | 32,768 | 49,152 |
| Attention Heads | 8 | 16 |
| KV Heads | 1 (MQA) | 16 (MHA) |
| Head Dimension | 256 | 256 |
| Vocabulary Size | 256,128 | 256,128 |
| Context Length | 8,192 | 8,192 |
The 2B model was trained on 2 trillion tokens and the 7B model on 6 trillion tokens. The training data consists primarily of web documents, code, and mathematics content, filtered for quality and safety. Google did not release the full details of the training data composition but noted that extensive filtering was applied to remove personally identifiable information and other sensitive content [1]. Both models use a SentencePiece tokenizer with a vocabulary of 256,128 tokens, shared with the Gemini model family.
At launch, Gemma 1 models demonstrated strong performance relative to their size. The 7B model outperformed Llama 2 7B and Mistral 7B on multiple academic benchmarks [2]. In particular, Gemma 7B showed notable gains in mathematical reasoning (GSM8K, MATH) and code generation (HumanEval), areas where earlier open models at this scale had struggled.
| Benchmark | Gemma 7B | Llama 2 7B | Mistral 7B |
|---|---|---|---|
| MMLU (5-shot) | 64.3% | 45.3% | 62.5% |
| HumanEval | 32.3% | 12.8% | 26.2% |
| GSM8K | 46.4% | 14.6% | 35.4% |
| MATH | 24.3% | 2.5% | 12.7% |
| HellaSwag | 82.3% | 77.2% | 81.3% |
Google's performance benchmarking using the MaxText reference implementation also showed up to 3x better performance-per-dollar for the Gemma 7B model compared to baseline training performance with Llama 2 7B on Google Cloud infrastructure [2].
Google DeepMind released Gemma 2 on June 27, 2024, with a focus on improving performance at practical model sizes. The second generation was available in three sizes: 2B, 9B, and 27B parameters [3]. The paper describing Gemma 2, titled "Gemma 2: Improving Open Language Models at a Practical Size," emphasized architectural innovations aimed at maximizing quality-per-parameter.
Gemma 2 introduced several improvements over the first generation:
The detailed specifications for all three Gemma 2 sizes are:
| Parameter | Gemma 2 2B | Gemma 2 9B | Gemma 2 27B |
|---|---|---|---|
| Layers | 26 | 42 | 46 |
| Hidden Dimension | 2,304 | 3,584 | 4,608 |
| Attention Heads | 8 | 16 | 32 |
| KV Heads | 4 | 8 | 16 |
| Local Attention Window | 4,096 | 4,096 | 4,096 |
| Global Attention Span | 8,192 | 8,192 | 8,192 |
| Training Tokens | 2T | 8T | 13T |
| Vocabulary Size | 256,000 | 256,000 | 256,000 |
The 27B model was trained from scratch on 13 trillion tokens without distillation, while the 9B model was distilled from the 27B model using more than 50x the compute-optimal quantity predicted by scaling law theory [3]. This "over-training" strategy, combined with distillation, allowed the 9B model to punch well above its weight class on benchmarks.
Gemma 2 delivered substantial improvements across benchmarks, with the 27B model competing against models significantly larger in parameter count.
| Benchmark | Gemma 2 2B | Gemma 2 9B | Gemma 2 27B |
|---|---|---|---|
| MMLU (5-shot) | 52.2% | 71.3% | 75.2% |
| HellaSwag (10-shot) | 72.9% | 81.9% | 86.4% |
| GSM8K | 23.9% | 68.6% | 74.0% |
| ARC-c | 55.4% | 68.4% | 71.4% |
| Winogrande | 70.9% | 80.6% | 83.7% |
On the LMSys Chatbot Arena leaderboard, the Gemma 2 27B instruction-tuned model achieved an Elo score of 1218, surpassing Llama 3 70B (Elo 1206), a model nearly three times its size [3]. Human evaluations also demonstrated that Gemma 2 models exhibited significantly lower memorization rates compared to prior models, with verbatim memorization below 0.1%.
Gemma 3 was released on March 12, 2025, representing the most significant expansion of the family to date. It introduced four model sizes (1B, 4B, 12B, and 27B), multimodal capabilities for vision and text understanding, support for over 140 languages, and context windows of up to 128,000 tokens [4].
The headline feature of Gemma 3 is native multimodal support. The 4B, 12B, and 27B models can process both images and text as input, while the 1B model remains text-only due to its compact size. Image understanding is enabled through a 400M-parameter variant of the SigLIP vision encoder, a Vision Transformer (ViT) trained with a variant of the CLIP contrastive loss [4].
The vision encoder takes square images resized to 896 x 896 pixels and encodes them into a sequence of visual tokens. These tokens are then condensed into a fixed set of 256 image token vectors before being fed into the language model alongside text tokens. This condensation step keeps computational costs manageable even when processing multiple images within a single prompt.
For images with non-standard aspect ratios, Gemma 3 employs a Pan and Scan (P&S) method inspired by LLaVA. This approach segments images into non-overlapping crops of equal size that cover the entire image, resizes each crop to 896 x 896 pixels, and processes them individually through the encoder. The result is that Gemma 3 can handle images of varying resolutions and aspect ratios without distorting or losing important details [4].
This allows Gemma 3 to perform tasks like image captioning, visual question answering, document understanding, chart interpretation, and optical character recognition.
Gemma 3 uses a decoder-only transformer architecture with Grouped-Query Attention (GQA) and RMSNorm, consistent with Gemma 2. A key change from Gemma 2 is the replacement of logit soft capping with QK-norm (query-key normalization), which normalizes query and key vectors before computing attention scores [4].
Gemma 3 dramatically increased the context window compared to Gemma 2's 8K limit. The 1B model supports 32,768 tokens, while the 4B, 12B, and 27B models support 128,000 tokens [4]. This 16x increase in context length is achieved through an interleaved attention pattern: for every 1 global attention layer, there are 5 local attention layers. Local layers use a sliding window of just 1,024 tokens, while global layers attend to the full context. This design significantly reduces the computational cost of long-context processing, since most layers only need to attend to a small window. The RoPE base frequency was increased from 10,000 to 1,000,000 to support the longer context lengths.
The full architecture specifications for each Gemma 3 variant:
| Parameter | Gemma 3 1B | Gemma 3 4B | Gemma 3 12B | Gemma 3 27B |
|---|---|---|---|---|
| Embedding Parameters | 302M | 675M | 1,012M | 1,416M |
| Non-embedding Parameters | 698M | 3,209M | 10,759M | 25,600M |
| Total Parameters | 1B | ~3.9B | ~11.8B | ~27B |
| Context Window | 32K | 128K | 128K | 128K |
| Vocabulary Size | 256,000 | 256,000 | 256,000 | 256,000 |
| Multimodal | Text only | Vision + Text | Vision + Text | Vision + Text |
The training data volume increased substantially across all model sizes compared to previous generations:
| Model | Parameters | Training Tokens | Context Window | Multimodal |
|---|---|---|---|---|
| Gemma 3 1B | 1 billion | 2 trillion | 32K | Text only |
| Gemma 3 4B | 4 billion | 4 trillion | 128K | Vision + Text |
| Gemma 3 12B | 12 billion | 12 trillion | 128K | Vision + Text |
| Gemma 3 27B | 27 billion | 14 trillion | 128K | Vision + Text |
The training data includes web documents, code, mathematics, science articles, and multilingual content spanning over 140 languages. Compared to Gemma 2, the 27B model was trained on 14 trillion tokens (up from 13 trillion), and the midsize models saw even larger relative increases in data volume [4].
Gemma 3 achieved remarkable benchmark results across all sizes. The instruction-tuned models showed large improvements over Gemma 2, particularly in mathematical reasoning and code generation:
| Benchmark | Gemma 3 1B IT | Gemma 3 4B IT | Gemma 3 12B IT | Gemma 3 27B IT |
|---|---|---|---|---|
| MMLU | 38.8% | 58.1% | 71.9% | 76.9% |
| MMLU-Pro | 14.7% | 43.6% | 60.6% | 67.5% |
| HumanEval | 41.5% | 71.3% | 85.4% | 87.8% |
| GSM8K | 62.8% | 89.2% | 94.4% | 95.9% |
| MATH | 48.0% | 75.6% | 83.8% | 89.0% |
| HellaSwag | 62.3% | 77.2% | 84.2% | 85.6% |
| LiveCodeBench | 1.9% | 12.6% | 24.6% | 29.7% |
| GPQA Diamond | 19.2% | 30.8% | 40.9% | 42.4% |
On the LMSys Chatbot Arena, the Gemma 3 27B instruction-tuned model scored an Elo of 1338, placing it at rank 9 overall and above much larger models such as DeepSeek-V3 (1318), Llama 3 405B (1257), and Qwen 2.5 70B (1257) [5]. This performance level, achieved with a model small enough to run on a single GPU, represented a significant milestone for the open model ecosystem.
On August 14, 2025, Google released Gemma 3 270M, the smallest model in the Gemma family [12]. With just 270 million parameters (170 million embedding parameters and 100 million transformer block parameters), it is designed for ultra-efficient on-device deployment and task-specific fine-tuning. Despite its compact size, Gemma 3 270M demonstrates strong instruction-following capabilities as measured by the IFEval benchmark. Internal testing on a Pixel 9 Pro showed the INT4-quantized model consumed only 0.75% of battery life over 25 conversations, making it one of the most power-efficient language models available. Google also released FunctionGemma, a specialized fine-tune of the 270M model for function calling, enabling on-device agents to translate natural-language commands into structured API calls [13].
Gemma 3n is a variant of the Gemma family specifically optimized for on-device and edge computing deployment. Previewed at Google I/O 2025 and fully released on June 26, 2025, Gemma 3n introduces architectural innovations that allow powerful models to run with minimal memory footprints on smartphones, tablets, and other resource-constrained devices [6].
The key innovation in Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, a novel nested transformer design built for elastic inference. Like Russian nesting dolls (Matryoshka dolls), a MatFormer model contains smaller, fully functional sub-models within its parameter space. During training of the E4B (4 billion effective parameter) model, a smaller E2B (2 billion effective parameter) sub-model is simultaneously optimized within it. This allows a single trained model to be deployed at multiple compute and memory levels without retraining, providing flexibility for devices with different capabilities [6].
Developers can use Gemma 3n in two modes:
The second major innovation in Gemma 3n is Per-Layer Embeddings (PLE), a technique that dramatically reduces accelerator memory (GPU/TPU VRAM) usage. In a standard transformer, the embedding matrix is loaded into high-speed accelerator memory. PLE instead associates separate embedding parameters with each transformer layer and stores them in regular CPU memory. Only the core transformer weights need to reside in accelerator memory, which is the bottleneck for on-device deployment. As a result, while the raw parameter counts for Gemma 3n are 5 billion (E2B) and 8 billion (E4B), the effective accelerator memory footprint is comparable to traditional 2B and 4B models [6].
| Specification | E2B | E4B |
|---|---|---|
| Raw Parameter Count | 5 billion | 8 billion |
| Effective Parameters | ~2 billion | ~4 billion |
| Accelerator Memory | ~2 GB | ~3 GB |
| LMArena Score | N/A | >1,300 |
| Modalities (Input) | Text, image, audio, video | Text, image, audio, video |
| Modalities (Output) | Text | Text |
| Language Support (Text) | 140 languages | 140 languages |
| Language Support (Multimodal) | 35 languages | 35 languages |
The E4B model became the first model under 10 billion raw parameters to exceed an LMArena score of 1,300, a milestone that underscored the effectiveness of the MatFormer and PLE innovations [6].
Unlike Gemma 3, which only supports vision and text inputs, Gemma 3n expands multimodal support to include audio and video in addition to images and text:
Gemma 3n introduces KV cache sharing to optimize prefill performance for long-context inputs. This technique delivers approximately 2x improvement on prefill performance compared to Gemma 3 4B, which is critical for responsive on-device inference where users expect near-instant replies [6].
The progression from Gemma 1 to Gemma 3n shows a clear trajectory of architectural refinement:
| Feature | Gemma 1 | Gemma 2 | Gemma 3 | Gemma 3n |
|---|---|---|---|---|
| Attention Type | MQA (2B) / MHA (7B) | GQA (all sizes) | GQA + interleaved local/global | GQA + MatFormer elastic |
| Position Encoding | RoPE (base 10K) | RoPE (base 10K) | RoPE (base 1M) | RoPE (base 1M) |
| Normalization | RMSNorm | RMSNorm | RMSNorm + QK-norm | RMSNorm + QK-norm |
| Activation | GeGLU | GeGLU | GeGLU | GeGLU |
| Max Context | 8,192 | 8,192 | 128,000 | 128,000 |
| Distillation | None | On-policy (2B, 9B) | Not disclosed | Nested (MatFormer) |
| Multimodal | No | No | Vision (SigLIP 400M) | Vision (MobileNet-V5), Audio (USM), Video |
| Vocabulary | 256,128 | 256,000 | 256,000 | 256,000 |
All Gemma models are trained on Google's proprietary data mixture, which Google has described in general terms but has not released publicly. The training data includes:
The total training compute increased substantially with each generation. Gemma 1's 7B model saw 6 trillion tokens, Gemma 2's 27B model was trained on 13 trillion tokens, and Gemma 3's 27B model processed 14 trillion tokens. The smaller Gemma 2 models were "over-trained" relative to Chinchilla scaling law predictions, with the 9B model trained on 8 trillion tokens (more than 50x the compute-optimal amount for its size) to maximize quality [3].
Google applied extensive safety filtering during data preparation, including the removal of child sexual abuse material (CSAM), personally identifiable information, and content that violates Google's policies. The exact composition and proportions of the training data have not been disclosed, which has been a point of criticism from researchers who argue that full data transparency is necessary for reproducible science.
Beyond the core Gemma models, Google DeepMind has released several task-specific variants that build on the Gemma architecture.
CodeGemma is a family of models specialized for code generation and completion tasks. Released alongside the first Gemma generation, CodeGemma models support multiple programming languages including Python, Java, C++, JavaScript, and more. The models are available in sizes that mirror the base Gemma lineup and are designed for both code completion (fill-in-the-middle) and general coding assistance [7].
PaliGemma is a vision-language model that combines a SigLIP image encoder with a Gemma language model decoder. It is specifically designed for fine-tuning on visual understanding tasks such as image captioning, object detection, and document understanding. PaliGemma 2, released alongside Gemma 2, expanded support to multiple image resolutions and additional model sizes [8].
ShieldGemma is a safety classifier built on the Gemma architecture. ShieldGemma 2, a 4B parameter model built on Gemma 3, functions as an image safety classifier that can identify potentially harmful content across three categories: dangerous content, sexually explicit material, and violence. It is intended for use as a guardrail in production applications that process user-generated or model-generated images [9].
RecurrentGemma is a variant that replaces the standard transformer attention mechanism with a linear recurrence based on the Griffin architecture. Available in 2B and 9B parameter sizes, it offers faster inference at long sequence lengths due to the constant memory footprint of recurrent computation, though it trades some quality for this efficiency gain.
FunctionGemma is a 270M parameter model fine-tuned from Gemma 3 270M for function calling tasks. It translates natural-language user commands into structured API or tool calls, enabling on-device agents that can control mobile applications, IoT devices, and other tools without sending data to the cloud [13].
| Variant | Size(s) | Purpose | Key Features |
|---|---|---|---|
| CodeGemma | 2B, 7B | Code generation and completion | Multi-language support; fill-in-the-middle capability |
| PaliGemma / PaliGemma 2 | Multiple | Vision-language tasks | Fine-tunable for image understanding; multi-resolution |
| ShieldGemma 2 | 4B | Image safety classification | Classifies dangerous, explicit, and violent content |
| RecurrentGemma | 2B, 9B | Efficient long-sequence inference | Griffin linear recurrence; constant memory |
| Gemma 3 270M | 270M | On-device fine-tuning | Ultra-compact; 0.75% battery per 25 conversations |
| FunctionGemma | 270M | Function calling | Structured API calls from natural language |
| Gemma 3n | E2B, E4B | On-device deployment | MatFormer architecture; multimodal; ultra-low memory |
Google released the Responsible Generative AI Toolkit alongside the Gemma models to help developers build safe and responsible applications. The toolkit provides several resources [14]:
The toolkit encourages a holistic approach to responsible AI that addresses safety, privacy, fairness, and accountability at both the model and application levels [14].
Google AI Edge is the primary platform for deploying Gemma models on mobile and edge devices. The SDK provides optimized inference runtimes for Android, iOS, and web applications [15].
Key deployment capabilities include:
The on-device deployment capabilities are significant for privacy-sensitive applications, since data never needs to leave the user's device. Healthcare, finance, and enterprise applications in particular benefit from the ability to run inference locally.
The Gemma family has generated a large ecosystem of community-created fine-tunes, quantizations, and adaptations. As of early 2026, Gemma models are among the most downloaded model families on Hugging Face, with millions of cumulative downloads across all variants.
Popular community contributions include:
Google has also released Gemma Scope, a set of sparse autoencoders trained on Gemma models, to support the AI safety community's work on mechanistic interpretability. Gemma Scope 2, released in 2025, expanded coverage to Gemma 2 models and provided deeper tools for understanding complex model behaviors [11].
Gemma competes in the rapidly growing market for small-to-medium open-weight language models. The following table compares Gemma with its primary competitors at similar parameter counts:
| Model Family | Developer | Key Sizes | License | Multimodal | Max Context | Notable Strengths |
|---|---|---|---|---|---|---|
| Gemma 3 | Google DeepMind | 1B, 4B, 12B, 27B | Gemma Terms of Use | Vision + Text | 128K | Strong chat quality; 140+ languages; on-device variants |
| Llama 3.2 | Meta | 1B, 3B, 11B, 90B | Llama Community License | Vision + Text (11B, 90B) | 128K | Large ecosystem; strong code performance |
| Mistral / Mixtral | Mistral AI | 7B, 8x7B, 8x22B | Apache 2.0 / Custom | Text only (base) | 32K-128K | Mixture-of-experts; fast inference |
| Phi-4 | Microsoft | 3.8B (mini), 14B | MIT | Text only (base) | 128K | Strong reasoning at small sizes; MIT license |
| Qwen 2.5 | Alibaba | 0.5B to 72B | Apache 2.0 / Custom | Vision + Text (VL variants) | 128K | Multilingual; strong coding; wide size range |
The following table compares instruction-tuned models in the sub-10B parameter range, a popular category for local and on-device deployment:
| Benchmark | Gemma 3 4B IT | Phi-4-mini (3.8B) | Llama 3.2 3B | Qwen 2.5 7B |
|---|---|---|---|---|
| MMLU-Pro | 43.6 | 52.8 | N/A | 56.3 |
| HumanEval | 71.3 | N/A | 71.3 | 57.9 |
| GSM8K | 89.2% | 88.6% | 77.7% | 91.6% |
| ARC-c | 56.2 | 83.7 | 78.6 | N/A |
| Approx. RAM (Q4) | ~3 GB | ~2.5 GB | ~2 GB | ~5 GB |
At the 27B parameter level, Gemma 3 27B IT's LMArena Elo of 1338 placed it well ahead of similarly sized or even larger open models. Its combination of multimodal capabilities, 128K context, and broad language support gives it advantages in use cases requiring vision understanding or multilingual processing, while competitors like Phi-4-mini offer stronger reasoning at smaller sizes under a more permissive MIT license.
Gemma models are released under the Gemma Terms of Use, a custom license created by Google rather than a standard open-source license like MIT or Apache 2.0 [10]. The license permits free use for individual developers, researchers, and commercial entities, including the right to redistribute and modify model weights. However, it includes several restrictions:
While the Gemma license is permissive enough for the vast majority of commercial applications, it does not meet the strict definition of "open source" as defined by the Open Source Initiative (OSI). This distinction has been a point of discussion in the AI community, with some advocates arguing that licenses like Gemma's (and similar ones from Meta for Llama) create a grey area between fully open and proprietary models [10].
In practical terms, developers can use Gemma models to build and sell commercial products, deploy them on their own infrastructure, and modify them through fine-tuning or other techniques, as long as they comply with the usage restrictions.
Since its initial release, Gemma has become one of the most downloaded and used open model families in the AI community. The models are available through Hugging Face (where they have been downloaded millions of times), Google's Vertex AI and AI Studio platforms, Kaggle, and numerous third-party inference providers.
Gemma's impact extends beyond direct usage. The release of model weights has enabled academic researchers to study transformer internals, develop new fine-tuning techniques, and create specialized models for domains ranging from healthcare to legal analysis. The Gemma Scope interpretability tools have become a resource for the mechanistic interpretability research community, helping researchers understand how language models represent and process information internally.
The on-device deployment story has also influenced the competitive landscape. Gemma 3n's success in running multimodal models with audio, video, and image understanding in under 3 GB of memory has raised the bar for what is expected from on-device AI models. Hardware partners including Qualcomm, MediaTek, and Samsung have integrated Gemma 3n optimizations into their chipset software stacks, signaling that on-device open models are becoming a mainstream deployment target rather than a niche use case [6].
The progression from Gemma 1 to Gemma 3n illustrates a broader industry trend: the frontier of what is possible with small, locally runnable models is advancing rapidly, driven by improvements in training data, architecture, distillation techniques, and post-training optimization. Each generation of Gemma has closed the gap between open and proprietary models, making capable AI more accessible to developers and researchers worldwide.