Gemma 2 is a family of open-weights large language models developed by Google DeepMind and released starting June 27, 2024. The family spans three parameter sizes: 2 billion (2B), 9 billion (9B), and 27 billion (27B). Gemma 2 introduced several architectural innovations compared to its predecessor, including interleaved sliding window and global attention, logit soft-capping for training stability, grouped-query attention, and knowledge distillation from a larger teacher model. On the LMSYS Chatbot Arena leaderboard, Gemma 2 27B posted an Elo score above LLaMA 3 70B despite having less than half the parameters, and the 9B model matched GPT-4-0314. Google released both base and instruction-tuned variants under a custom Gemma Terms of Use that permits commercial deployment. The release was accompanied by Gemma Scope, an open collection of sparse autoencoders for mechanistic interpretability research, and ShieldGemma, a safety content moderation classifier built on the same weights.
The Gemma model family began in February 2024, when Google DeepMind published Gemma 1 in 2B and 7B parameter sizes. That first generation used a standard decoder-only transformer architecture derived from the one underlying Gemini. Gemma 1 was notable for demonstrating that compact, openly released models could compete with models two to four times larger on standard reasoning and language benchmarks. Both base and instruction-tuned checkpoints were published under a relatively permissive custom license, enabling fine-tuning and commercial use for most applications. The initial reception was strong enough that Google committed to treating Gemma as an ongoing model family rather than a one-time release.
Despite that positive start, Gemma 1 had several constraints. Its training used roughly 6 trillion tokens for the 7B model, and the architecture did not incorporate techniques like alternating attention patterns or knowledge distillation. Context length was set at 8,192 tokens. The instruction-tuning pipeline was relatively straightforward compared to what Google applied in Gemma 2.
Gemma 2 was announced at Google I/O 2024 and formally released on June 27, 2024, with the 9B and 27B sizes immediately available. The 2B model followed on July 31, 2024. A Japanese-language variant, Gemma 2 JPN 2B, shipped on October 3, 2024. The technical report was posted to arXiv under the identifier 2408.00118.
Gemma 2 comes in three sizes, each available in a base (pretrained) and an instruction-tuned (IT) version. The instruction-tuned versions are aligned for conversational and instruction-following tasks through supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and a three-stage model merging procedure.
| Variant | Parameters | Layers | Model dimension | Attention heads | KV heads | Training tokens |
|---|---|---|---|---|---|---|
| Gemma 2 2B | 2.6 billion | 26 | 2,304 | 8 | 4 | 2 trillion |
| Gemma 2 9B | 9 billion | 42 | 3,584 | 16 | 8 | 8 trillion |
| Gemma 2 27B | 27 billion | 46 | 4,608 | 32 | 16 | 13 trillion |
All three share a vocabulary of 256,128 tokens built with SentencePiece, a context window of 8,192 tokens, RoPE (Rotary Position Embedding) positional encoding, and GeGLU activation functions. The head size is 256 for the 2B and 9B models and 128 for the 27B.
The 27B model was trained from scratch using standard next-token prediction. The 9B and 2B models used knowledge distillation from the 27B model during pretraining, learning from the teacher's output distributions rather than only from raw data targets. All three underwent the same post-training alignment pipeline.
The 2B model, released on July 31, 2024, drew particular attention for its size-to-performance ratio. With 2.6 billion parameters, it is small enough to run inference on the free tier of an NVIDIA T4 GPU and to fit comfortably on edge hardware. LMSYS Chatbot Arena evaluations placed the Gemma 2 2B instruction-tuned model at an Elo of approximately 1,126 to 1,130, ahead of GPT-3.5-Turbo-0613 (1,117) and Mixtral 8x7B (1,114), the latter being a mixture-of-experts model with ten times as many active parameters. On MMLU the 2B model scored 52.2%, and on MBPP (Mostly Basic Python Programming) it reached 36.6%.
VentureBeat described the 2B release as a "surprising upset," citing the rarity of a sub-3B model outperforming much larger baselines on human preference evaluations. The result was attributed largely to the knowledge distillation training approach, where the small model absorbed signals from the full 27B teacher distribution rather than learning only from one-hot next-token labels.
Gemma 2 shares the decoder-only transformer foundation of its predecessor but adds four architectural features not present in Gemma 1: interleaved attention patterns, logit soft-capping, grouped-query attention, and a hybrid normalization scheme.
Every other transformer layer in Gemma 2 applies local sliding window attention, where each token can attend only to the 4,096 preceding tokens. The alternating layers use standard global (full quadratic) attention across the full 8,192-token context. This means the model alternates between a local view (4K window) and a global view (8K window) as tokens pass through successive layers.
The motivation is computational: self-attention scales quadratically with sequence length, so applying full attention in every layer is expensive. Using sliding window attention in half the layers reduces the total attention cost while the global layers preserve the ability to reason across distant tokens. A similar hybrid approach appeared in Mistral 7B, which applies sliding window attention uniformly rather than alternating. Gemma 2's layer-by-layer alternation differs from Mistral's uniform application and was found empirically to perform better in the technical report's ablations.
Gemma 2 applies a non-linearity to attention logits before the softmax operation and again to the final vocabulary logits before the output projection. The formula is:
logits <- soft_cap * tanh(logits / soft_cap)
The self-attention layers use a cap value of 50.0. The final output layer uses 30.0. This operation squashes extreme logit values into a bounded range without the discontinuities introduced by hard clipping. The practical effect is a more stable loss curve during training, particularly in the early stages when logit values tend to grow rapidly.
A significant engineering consequence is that soft-capping breaks compatibility with FlashAttention and PyTorch's fused scaled-dot-product attention (SDPA), both of which assume standard unbounded logits. Users fine-tuning Gemma 2 must switch to eager (unfused) attention, which uses more GPU memory and runs more slowly. This tradeoff was acknowledged in the Hugging Face Transformers documentation for Gemma 2 and was a common point of friction for practitioners.
All three Gemma 2 sizes use grouped-query attention (GQA) with two query heads per key-value head pair (num_groups=2). Standard multi-head attention (MHA) maintains separate key and value tensors for every attention head, which grows the key-value cache proportionally with the number of heads and the sequence length. GQA reduces this by having multiple query heads share a single key and value head. At num_groups=2, the KV cache is roughly half the size of what standard MHA would require, which translates directly to lower GPU memory usage during inference and enables higher batch sizes or longer sequences on the same hardware.
Gemma 2 applies RMSNorm in both the pre-norm position (before the attention or feedforward sublayer) and the post-norm position (after the residual addition). Most contemporary decoder-only models use only pre-norm. The dual-norm approach reduces activation magnitude drift in deep networks and was found to improve training stability in ablations, particularly for the 27B model where gradient flow through 46 layers can be problematic.
The instruction-tuned models use a procedure called WARP (Weight Averaging with Regularization and Perturbation) during the reinforcement learning stage of post-training. WARP applies three sequential operations:
The third step is unusual and addresses a known problem with RLHF: heavily optimized reward models tend to collapse toward narrow, reward-hacking behaviors. By pulling weights slightly back toward initialization, LITI acts as a regularizer that preserves broader capabilities while retaining alignment gains. The technical report showed improved benchmark scores for all three model sizes when using WARP compared to standard RLHF.
The 27B model trained on approximately 13 trillion tokens, making it one of the most extensively trained models in its parameter class at the time of release. The 9B model used 8 trillion tokens and the 2B model 2 trillion tokens. All three corpora draw from web documents (primarily English-language), code repositories, and scientific articles. The exact data mix, filtering methodology, and source proportions were not disclosed, consistent with Google's general practice for Gemini-family training.
Data processing included quality filtering to remove boilerplate and low-quality web text, deduplication to limit the influence of repeated content, and safety filtering to reduce explicit harmful content in the training signal. The SentencePiece tokenizer uses a 256,128-token vocabulary, considerably larger than the 32,000 tokens in Llama 2 or the 50,256 in GPT-2-era models. The enlarged vocabulary helps the model handle non-English text, programming languages, and mathematical notation more efficiently, because common multi-character patterns in those domains are represented as single tokens rather than character sequences.
For the 9B and 2B models, the pretraining objective was augmented with knowledge distillation from the 27B teacher. Rather than training only on the binary signal of whether the model predicted the correct next token, the smaller models were trained to match the full probability distribution the teacher assigned to all tokens at each position. The loss function minimizes the KL divergence between teacher and student distributions, computed over the same training corpus.
This approach, sometimes called token-level distillation, provides a richer training signal than standard cross-entropy on one-hot targets. The teacher's distribution encodes information about which tokens are nearly correct alternatives, which is information that gets discarded in standard language modeling. The benefit was quantified in the technical report: a 2B model trained with distillation for 500 billion tokens averaged 67.7 on three evaluation benchmarks, compared to 60.3 for the same model trained without distillation, a difference of about 12% in relative terms.
All Gemma 2 sizes underwent a multi-stage post-training pipeline:
The teacher model for post-training distillation was not named in the report. Given the context and Google's infrastructure, it is widely understood to be a variant of Gemini, most likely a version of Gemini 1.5.
Training used Google Cloud TPU pods of different generations by model size: TPUv5e (512 chips arranged in a 2x16x16 configuration) for the 2B model, TPUv4 (4,096 chips in an 8x16x32 configuration) for the 9B, and TPUv5p (6,144 chips in an 8x24x32 configuration) for the 27B. All training ran on JAX with the ML Pathways system for distributed computation. Total reported carbon emissions were approximately 1,247.61 tCO2eq across all three models, produced at Google's carbon-neutral data centers.
Gemma 2 is distributed under the Gemma Terms of Use, a custom license that permits commercial use but differs from standard open-source licenses such as Apache 2.0 or MIT. The key provisions are:
The revocation provision was the most-discussed limitation. Unlike Apache 2.0, which grants irrevocable rights once distributed, the Gemma license lets Google remotely restrict access. In practice this provision was rarely invoked, but it meant that enterprises with strict procurement requirements around irrevocable licenses could not adopt Gemma 2 under the same terms they might accept for an Apache-licensed model.
Gemma 4, released in March 2025, switched to Apache 2.0, removing all custom restrictions. Gemma 2 remains under the original custom terms.
Alongside the Gemma 2 models, Google DeepMind released Gemma Scope, an open collection of sparse autoencoders (SAEs) trained on the internal activations of Gemma 2. The goal is to support mechanistic interpretability research: the practice of reverse-engineering what computations a neural network performs, rather than studying its external input-output behavior only.
Sparse autoencoder techniques for language models work by training an auxiliary model to reconstruct a target layer's activations as a sparse linear combination of learned feature vectors. Because the reconstruction must be sparse, the learned features tend to be interpretable concepts rather than statistical noise. Each feature can be inspected by finding the inputs that maximally activate it, which often reveals that a given feature corresponds to a recognizable concept such as a person's name, a syntactic pattern, or a topic domain.
Gemma Scope trained SAEs at every layer and sublayer of Gemma 2 2B and Gemma 2 9B, and at selected layers of Gemma 2 27B. The resulting collection comprises over 400 individual autoencoders with more than 30 million total learned features. The SAE training consumed approximately 15% of the total compute used to pretrain Gemma 2 9B, and the team stored roughly 20 pebibytes of intermediate activations. The JumpReLU SAE architecture was selected because it achieves a better balance between detecting feature presence and estimating feature magnitude than standard ReLU-based SAEs.
Google DeepMind described the release as the largest open-source interpretability tool release from any AI lab at the time. Weights were published on Hugging Face under the google/gemma-scope repository. The Neuronpedia platform provided an interactive web interface for browsing learned features without requiring a local GPU. The technical paper was posted to arXiv as 2408.05147.
Intended use cases include studying how models represent factual knowledge, analyzing where unsafe behaviors originate in the computation graph, investigating chain-of-thought faithfulness, and detecting deceptive reasoning. The scale of the Gemma Scope release made it possible for small research groups to conduct mechanistic interpretability experiments that would have required significant proprietary infrastructure to replicate from scratch.
Gemma Scope 2, released alongside Gemma 3 in 2025, extended coverage to all Gemma 3 model sizes (270M to 27B) and added transcoders, which model how activations in one layer are transformed to produce activations in the next, enabling analysis of computational flow rather than static representations alone.
ShieldGemma is a suite of safety content moderation classifiers released by Google alongside the Gemma 2 family. The models are fine-tuned from Gemma 2 base weights to classify text inputs and outputs for four harm categories: sexually explicit content, dangerous or harmful content, hate speech, and harassment. ShieldGemma is available in 2B, 9B, and 27B parameter sizes, matching the Gemma 2 family.
The classifier is designed for two deployment patterns. In the pre-filter configuration, ShieldGemma processes incoming user messages before they reach the main generation model, blocking requests classified as harmful above the operator's chosen threshold. In the post-filter configuration, it screens the model's responses before they are returned to the user. Both patterns can be used simultaneously for defense in depth.
Unlike binary classifiers that output only a harmful/not-harmful label, ShieldGemma outputs a probability score for each harm category, giving operators control over the sensitivity-specificity tradeoff for their specific application.
On public safety benchmarks, ShieldGemma outperformed the then-current Llama Guard by 10.8 percentage points on AU-PRC (area under the precision-recall curve) and WildGuard by 4.3 percentage points. The technical paper was posted to arXiv as 2407.21772.
Limitations documented in the ShieldGemma paper include high sensitivity to the precise wording of the safety principles provided as context, inconsistent handling of implicit harm (where harmful intent is inferable but not stated), and potential gaps for harm categories underrepresented in evaluation data. ShieldGemma is included in Google's Responsible Generative AI Toolkit alongside other safety and evaluation resources.
The following tables present Gemma 2 performance on standard academic benchmarks as reported in the technical report (arXiv:2408.00118). Base model results use few-shot prompting. Chatbot Arena scores reflect human preference ratings collected by LMSYS.
| Benchmark | Gemma 2 2B | Gemma 2 9B | Gemma 2 27B | Llama 3 8B | Llama 3 70B |
|---|---|---|---|---|---|
| MMLU (5-shot) | 52.2% | 71.3% | 75.2% | 66.6% | 79.2% |
| GSM8K (5-shot) | 24.3% | 68.6% | 75.1% | 45.7% | 76.9% |
| MATH (4-shot) | 16.0% | 36.6% | 42.3% | -- | -- |
| HumanEval (0-shot) | 20.1% | 40.2% | 51.8% | -- | -- |
| ARC-Challenge | 55.7% | 68.4% | 71.4% | 59.2% | 68.8% |
| HellaSwag | -- | 81.9% | 86.4% | 82.0% | 88.0% |
| Winogrande | -- | 80.6% | 83.7% | 78.5% | 85.3% |
| Benchmark | Mistral 7B | Llama 3 8B | Gemma 2 9B |
|---|---|---|---|
| MMLU | 62.5% | 66.6% | 71.3% |
| GSM8K | 34.5% | 45.7% | 62.3% |
| ARC-Challenge | 60.5% | 59.2% | 68.4% |
| HellaSwag | 83.0% | 82.0% | 81.9% |
| Winogrande | 78.5% | 78.5% | 80.6% |
| Model | Chatbot Arena Elo |
|---|---|
| GPT-4o | ~1,285 |
| Gemma 2 27B | 1,218 |
| Llama 3 70B | 1,206 |
| Gemma 2 9B | 1,187 |
| GPT-4-0314 | 1,186 |
| Gemma 2 2B | 1,126 |
| GPT-3.5-Turbo | 1,116 |
| Mixtral 8x7B | 1,114 |
Chatbot Arena ratings are derived from head-to-head human preference votes using the Bradley-Terry model. The Gemma 2 27B instruction-tuned model exceeded Llama 3 70B (a model with more than twice the parameters) by 12 Elo points. Gemma 2 9B matched GPT-4-0314. Gemma 2 2B surpassed both GPT-3.5-Turbo and Mixtral 8x7B.
LLaMA 3, released by Meta in April 2024, and Mistral 7B, released by Mistral AI in September 2023, represent the two most widely used open-weights baselines at the time of Gemma 2's release.
| Dimension | Mistral 7B | Llama 3 8B | Gemma 2 9B | Llama 3 70B | Gemma 2 27B |
|---|---|---|---|---|---|
| Parameters | 7B | 8B | 9B | 70B | 27B |
| Context window | 32K | 8K | 8K | 8K | 8K |
| Training tokens | 1T | 15T | 8T | 15T | 13T |
| MMLU | 62.5% | 66.6% | 71.3% | 79.2% | 75.2% |
| GSM8K | 34.5% | 45.7% | 62.3% | 76.9% | 75.1% |
| License | Apache 2.0 | Llama 3 Community | Gemma ToU | Llama 3 Community | Gemma ToU |
| Architecture | Sliding window | Dense transformer | Hybrid attention | Dense transformer | Hybrid attention |
Gemma 2 9B outperformed Mistral 7B on every reported benchmark except HellaSwag, where Mistral 7B held a narrow margin of 83.0% to 81.9%. The gap on math and reasoning was large: Gemma 2 9B scored 62.3% on GSM8K compared to 34.5% for Mistral 7B. Against LLaMA 3 8B, Gemma 2 9B led on all reported benchmarks. Gemma 2 27B came within a few points of Llama 3 70B on most tasks despite having 2.5 times fewer parameters and training on 2 trillion fewer tokens.
Mistral 7B retains a significant advantage in context length: its 32K window (versus Gemma 2's 8K) supports tasks involving long documents, codebases, or extended multi-turn conversations that exceed what Gemma 2 can process. Mistral 7B also distributes under Apache 2.0, offering irrevocable commercial rights. Llama 3's community license imposes restrictions on redistribution for applications with more than 700 million monthly active users, a threshold well above what most organizations encounter but a consideration for large-scale deployment.
Gemma 2 was designed with practical deployment efficiency as a core goal. The 27B model runs at full bfloat16 precision on a single NVIDIA A100 80GB or H100 80GB GPU, or on a Google Cloud TPU v5e host. The 9B model requires approximately 18 GB of VRAM, fitting on high-end consumer cards including the NVIDIA RTX 4090 (24 GB) and the RTX 3090 (24 GB). With 4-bit quantization via GGUF or GPTQ formats, the 27B model compresses to roughly 18 GB, enabling it to run on the same hardware as the unquantized 9B.
Documented deployment use cases include:
The instruction-tuned models use a chat template with <start_of_turn> and <end_of_turn> tokens marking speaker boundaries, which maps straightforwardly onto Hugging Face Transformers chat template formats and is supported natively in Ollama and llama.cpp.
Gemma 2 integration across the open model ecosystem moved quickly after the June 2024 release. Hugging Face Transformers added support in version 4.42, and all six canonical model variants (base and instruction-tuned across three sizes) were hosted in the google organization on Hugging Face. The Text Generation Inference (TGI) server added Gemma 2 support for high-throughput production deployments. Google AI Studio provided free access to the instruction-tuned 9B and 27B models, and Kaggle hosted the models with notebook integration for experimentation.
Community fine-tuned variants appeared within days of the initial release. The Gemmaverse, Google's informal name for the community ecosystem around Gemma models, accumulated hundreds of derivative checkpoints on Hugging Face within the first few months. Notable community derivatives built on Gemma 2 included:
Google published Gemma.cpp as a dedicated C++ inference engine for the Gemma architecture, and compatibility was added to vLLM for high-throughput batched serving. GGUF-quantized community releases for llama.cpp and Ollama appeared within hours or days of each official Google release.
Vertex AI added Gemma 2 to its model garden for managed enterprise deployment, with monitoring and access control capabilities not available in self-hosted configurations. By the time Gemma 3 was announced in early 2025, the Gemma 2 models had accumulated millions of downloads across Hugging Face and Kaggle.
Gemma 2 received broadly positive coverage from the AI research and developer community at launch. The Chatbot Arena results attracted particular attention because they demonstrated that parameter count is not the primary determinant of human preference ratings when architecture and training methodology are both improved. A Hacker News thread shortly after the 27B release called it "exceptionally strong," and several commenters noted that the 27B model's Elo score exceeding Llama 3 70B was a more informative result than standard benchmark scores because it reflected real user preference rather than curated academic tasks.
The Gemma Scope release was described by the interpretability research community as a significant contribution. The scale of the SAE collection (400+ autoencoders, 30 million features) was beyond what any individual lab had previously made available in open form. Researchers noted that having SAEs trained at every layer and sublayer of 2B and 9B models enabled types of circuit analysis that had previously been feasible only at small scales or on proprietary infrastructure.
ShieldGemma received more mixed commentary. The performance improvements over prior safety classifiers were acknowledged, but practitioners found the sensitivity to prompt phrasing a practical obstacle in production. Setting the right safety principle description to get consistent results required iteration that was not always straightforward.
Some reviews focused on the Gemma Terms of Use. The commercial permissions were seen as adequate for most use cases, but the revocation provision was flagged as a structural difference from Apache 2.0 that made Gemma 2 less suitable for enterprise deployments where legal counsel required irrevocable rights. This was frequently contrasted with Mistral AI's fully permissive licensing approach.
Gemma 2 also generated significant discussion about what the distillation results imply for the practical value of scaling pretraining compute. The paper's data showed that a student trained with distillation reached a performance level that a same-size student trained from scratch could not reach regardless of the quantity of compute applied. This was taken as evidence that access to a strong teacher model may matter as much as raw scale, a point with implications for how organizations without frontier-model training budgets should approach model development.
All Gemma 2 models are capped at an 8,192-token context window. This is identical to Llama 3 8B at release, but substantially shorter than the 32K offered by Mistral 7B and far below the 128K or longer contexts provided by Claude 3, GPT-4 Turbo, and later models. Tasks involving long documents, full codebases, multi-document research summaries, or extended dialogue histories are constrained by this limit. The sliding window attention mechanism operates efficiently within 8K but does not provide a path to extending the effective context at inference time without retraining.
Gemma 3, released by Google DeepMind in early 2025, extended the context window to 128K tokens across all model sizes, directly addressing this gap.
The technical report describes the training corpus as web documents, code, and scientific articles with English-primary coverage, but does not disclose source proportions, filtering criteria, or specific datasets. This limits the ability of external researchers to audit for data contamination on benchmarks, identify demographic or cultural biases, or assess the representation of specific domains. The opacity is consistent across the Gemma and Gemini families and reflects Google's standard practice, but it differs from more transparent documentation in some contemporaneous releases.
The primary English-language training corpus means Gemma 2 performs unevenly on non-English tasks. The Gemma 2 JPN 2B variant addressed Japanese specifically, but no equivalent targeted release was made for other languages at launch. Community projects like SEA-LION and Navarasa partially filled this gap through continued pre-training, but these required additional resources to produce and were not available at release.
The logit soft-capping design means Gemma 2 cannot use FlashAttention or PyTorch SDPA in standard form, both of which provide 2-4x speedups and significant memory savings during fine-tuning. Users must run eager attention, which is slower and uses more GPU memory. This practical constraint was a recurring topic in community discussions around fine-tuning Gemma 2, especially for the 27B model where memory pressure is already significant.
Human evaluation results in the technical report noted weaknesses on complex end-to-end tasks requiring sequential reasoning, tool use, or multi-step planning. The model passed fewer end-to-end task challenges than human baselines in certain evaluations, including multi-step system interaction tasks. This limitation was present across open models at the time and was not unique to Gemma 2, but it is worth noting as a constraint on agentic use cases.