Gemma 2

Gemma 2 is a family of open-weights large language models developed by Google DeepMind and released starting June 27, 2024. The family spans three parameter sizes: 2 billion (2B), 9 billion (9B), and 27 billion (27B). Gemma 2 introduced several architectural innovations compared to its predecessor, including interleaved sliding window and global attention, logit soft-capping for training stability, grouped-query attention, and knowledge distillation from a larger teacher model. On the LMSYS Chatbot Arena leaderboard, Gemma 2 27B posted an Elo score above LLaMA 3 70B despite having less than half the parameters, and the 9B model matched GPT-4-0314. Google released both base and instruction-tuned variants under a custom Gemma Terms of Use that permits commercial deployment. The release was accompanied by Gemma Scope, an open collection of sparse autoencoders for mechanistic interpretability research, and ShieldGemma, a safety content moderation classifier built on the same weights.

Background and predecessor

The Gemma model family began in February 2024, when Google DeepMind published Gemma 1 in 2B and 7B parameter sizes. That first generation used a standard decoder-only transformer architecture derived from the one underlying Gemini. Gemma 1 was notable for demonstrating that compact, openly released models could compete with models two to four times larger on standard reasoning and language benchmarks. Both base and instruction-tuned checkpoints were published under a relatively permissive custom license, enabling fine-tuning and commercial use for most applications. The initial reception was strong enough that Google committed to treating Gemma as an ongoing model family rather than a one-time release.

Despite that positive start, Gemma 1 had several constraints. Its training used roughly 6 trillion tokens for the 7B model, and the architecture did not incorporate techniques like alternating attention patterns or knowledge distillation. Context length was set at 8,192 tokens. The instruction-tuning pipeline was relatively straightforward compared to what Google applied in Gemma 2.

Gemma 2 was announced at Google I/O 2024 and formally released on June 27, 2024, with the 9B and 27B sizes immediately available. The 2B model followed on July 31, 2024. A Japanese-language variant, Gemma 2 JPN 2B, shipped on October 3, 2024. The technical report was posted to arXiv under the identifier 2408.00118.

Model variants

Gemma 2 comes in three sizes, each available in a base (pretrained) and an instruction-tuned (IT) version. The instruction-tuned versions are aligned for conversational and instruction-following tasks through supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and a three-stage model merging procedure.

Variant	Parameters	Layers	Model dimension	Attention heads	KV heads	Training tokens
Gemma 2 2B	2.6 billion	26	2,304	8	4	2 trillion
Gemma 2 9B	9 billion	42	3,584	16	8	8 trillion
Gemma 2 27B	27 billion	46	4,608	32	16	13 trillion

All three share a vocabulary of 256,128 tokens built with SentencePiece, a context window of 8,192 tokens, RoPE (Rotary Position Embedding) positional encoding, and GeGLU activation functions. The head size is 256 for the 2B and 9B models and 128 for the 27B.

The 27B model was trained from scratch using standard next-token prediction. The 9B and 2B models used knowledge distillation from the 27B model during pretraining, learning from the teacher's output distributions rather than only from raw data targets. All three underwent the same post-training alignment pipeline.

The 2B model

The 2B model, released on July 31, 2024, drew particular attention for its size-to-performance ratio. With 2.6 billion parameters, it is small enough to run inference on the free tier of an NVIDIA T4 GPU and to fit comfortably on edge hardware. LMSYS Chatbot Arena evaluations placed the Gemma 2 2B instruction-tuned model at an Elo of approximately 1,126 to 1,130, ahead of GPT-3.5-Turbo-0613 (1,117) and Mixtral 8x7B (1,114), the latter being a mixture-of-experts model with ten times as many active parameters. On MMLU the 2B model scored 52.2%, and on MBPP (Mostly Basic Python Programming) it reached 36.6%.

VentureBeat described the 2B release as a "surprising upset," citing the rarity of a sub-3B model outperforming much larger baselines on human preference evaluations. The result was attributed largely to the knowledge distillation training approach, where the small model absorbed signals from the full 27B teacher distribution rather than learning only from one-hot next-token labels.

Architecture

Gemma 2 shares the decoder-only transformer foundation of its predecessor but adds four architectural features not present in Gemma 1: interleaved attention patterns, logit soft-capping, grouped-query attention, and a hybrid normalization scheme.

Interleaved sliding window and global attention

Every other transformer layer in Gemma 2 applies local sliding window attention, where each token can attend only to the 4,096 preceding tokens. The alternating layers use standard global (full quadratic) attention across the full 8,192-token context. This means the model alternates between a local view (4K window) and a global view (8K window) as tokens pass through successive layers.

The motivation is computational: self-attention scales quadratically with sequence length, so applying full attention in every layer is expensive. Using sliding window attention in half the layers reduces the total attention cost while the global layers preserve the ability to reason across distant tokens. A similar hybrid approach appeared in Mistral 7B, which applies sliding window attention uniformly rather than alternating. Gemma 2's layer-by-layer alternation differs from Mistral's uniform application and was found empirically to perform better in the technical report's ablations.

Logit soft-capping

Gemma 2 applies a non-linearity to attention logits before the softmax operation and again to the final vocabulary logits before the output projection. The formula is:

logits <- soft_cap * tanh(logits / soft_cap)

The self-attention layers use a cap value of 50.0. The final output layer uses 30.0. This operation squashes extreme logit values into a bounded range without the discontinuities introduced by hard clipping. The practical effect is a more stable loss curve during training, particularly in the early stages when logit values tend to grow rapidly.

A significant engineering consequence is that soft-capping breaks compatibility with FlashAttention and PyTorch's fused scaled-dot-product attention (SDPA), both of which assume standard unbounded logits. Users fine-tuning Gemma 2 must switch to eager (unfused) attention, which uses more GPU memory and runs more slowly. This tradeoff was acknowledged in the Hugging Face Transformers documentation for Gemma 2 and was a common point of friction for practitioners.

Grouped-query attention

All three Gemma 2 sizes use grouped-query attention (GQA) with two query heads per key-value head pair (num_groups=2). Standard multi-head attention (MHA) maintains separate key and value tensors for every attention head, which grows the key-value cache proportionally with the number of heads and the sequence length. GQA reduces this by having multiple query heads share a single key and value head. At num_groups=2, the KV cache is roughly half the size of what standard MHA would require, which translates directly to lower GPU memory usage during inference and enables higher batch sizes or longer sequences on the same hardware.

Post-norm and pre-norm hybrid

Gemma 2 applies RMSNorm in both the pre-norm position (before the attention or feedforward sublayer) and the post-norm position (after the residual addition). Most contemporary decoder-only models use only pre-norm. The dual-norm approach reduces activation magnitude drift in deep networks and was found to improve training stability in ablations, particularly for the 27B model where gradient flow through 46 layers can be problematic.

WARP model merging for alignment

The instruction-tuned models use a procedure called WARP (Weight Averaging with Regularization and Perturbation) during the reinforcement learning stage of post-training. WARP applies three sequential operations:

Exponential Moving Average (EMA) of weights during RL fine-tuning, so each parameter update is smoothed over recent iterations.
Spherical Linear Interpolation (SLERP) between the RL-tuned checkpoint and the supervised fine-tuning checkpoint after RL training ends.
Linear Interpolation Toward Initialization (LITI), which moves the merged checkpoint partway back toward the original pretrained initialization weights.

The third step is unusual and addresses a known problem with RLHF: heavily optimized reward models tend to collapse toward narrow, reward-hacking behaviors. By pulling weights slightly back toward initialization, LITI acts as a regularizer that preserves broader capabilities while retaining alignment gains. The technical report showed improved benchmark scores for all three model sizes when using WARP compared to standard RLHF.

Training data

The 27B model trained on approximately 13 trillion tokens, making it one of the most extensively trained models in its parameter class at the time of release. The 9B model used 8 trillion tokens and the 2B model 2 trillion tokens. All three corpora draw from web documents (primarily English-language), code repositories, and scientific articles. The exact data mix, filtering methodology, and source proportions were not disclosed, consistent with Google's general practice for Gemini-family training.

Data processing included quality filtering to remove boilerplate and low-quality web text, deduplication to limit the influence of repeated content, and safety filtering to reduce explicit harmful content in the training signal. The SentencePiece tokenizer uses a 256,128-token vocabulary, considerably larger than the 32,000 tokens in Llama 2 or the 50,256 in GPT-2-era models. The enlarged vocabulary helps the model handle non-English text, programming languages, and mathematical notation more efficiently, because common multi-character patterns in those domains are represented as single tokens rather than character sequences.

Knowledge distillation during pretraining

For the 9B and 2B models, the pretraining objective was augmented with knowledge distillation from the 27B teacher. Rather than training only on the binary signal of whether the model predicted the correct next token, the smaller models were trained to match the full probability distribution the teacher assigned to all tokens at each position. The loss function minimizes the KL divergence between teacher and student distributions, computed over the same training corpus.

This approach, sometimes called token-level distillation, provides a richer training signal than standard cross-entropy on one-hot targets. The teacher's distribution encodes information about which tokens are nearly correct alternatives, which is information that gets discarded in standard language modeling. The benefit was quantified in the technical report: a 2B model trained with distillation for 500 billion tokens averaged 67.7 on three evaluation benchmarks, compared to 60.3 for the same model trained without distillation, a difference of about 12% in relative terms.

Post-training pipeline

All Gemma 2 sizes underwent a multi-stage post-training pipeline:

Supervised fine-tuning on curated conversational datasets covering question answering, instruction following, and multi-turn dialogue.
On-policy distillation, where the student model generates completions for SFT prompts and the teacher scores those completions. This reduces the mismatch between teacher signals (computed on reference completions) and student behavior (computed on its own generated text).
Reinforcement learning with a reward model trained on human preference data, combined with WARP merging to stabilize the result.

The teacher model for post-training distillation was not named in the report. Given the context and Google's infrastructure, it is widely understood to be a variant of Gemini, most likely a version of Gemini 1.5.

Training infrastructure

Training used Google Cloud TPU pods of different generations by model size: TPUv5e (512 chips arranged in a 2x16x16 configuration) for the 2B model, TPUv4 (4,096 chips in an 8x16x32 configuration) for the 9B, and TPUv5p (6,144 chips in an 8x24x32 configuration) for the 27B. All training ran on JAX with the ML Pathways system for distributed computation. Total reported carbon emissions were approximately 1,247.61 tCO2eq across all three models, produced at Google's carbon-neutral data centers.

Gemma Terms of Use and licensing

Gemma 2 is distributed under the Gemma Terms of Use, a custom license that permits commercial use but differs from standard open-source licenses such as Apache 2.0 or MIT. The key provisions are:

Users may use, reproduce, modify, distribute, and create derivative works from the models.
Commercial deployment is permitted with no user count thresholds or scaling fees.
Redistributors must include the Gemma Terms of Use as an enforceable provision in downstream agreements and must include a notice that the model is subject to those terms.
Google claims no rights in outputs generated using Gemma; users own generated content.
Usage is subject to a Prohibited Use Policy covering harms to minors, critical infrastructure attacks, and other standard categories.
Google may revoke a user's rights if it determines a violation of the Prohibited Use Policy or applicable law.

The revocation provision was the most-discussed limitation. Unlike Apache 2.0, which grants irrevocable rights once distributed, the Gemma license lets Google remotely restrict access. In practice this provision was rarely invoked, but it meant that enterprises with strict procurement requirements around irrevocable licenses could not adopt Gemma 2 under the same terms they might accept for an Apache-licensed model.

Gemma 4, released in March 2025, switched to Apache 2.0, removing all custom restrictions. Gemma 2 remains under the original custom terms.

Gemma Scope

Alongside the Gemma 2 models, Google DeepMind released Gemma Scope, an open collection of sparse autoencoders (SAEs) trained on the internal activations of Gemma 2. The goal is to support mechanistic interpretability research: the practice of reverse-engineering what computations a neural network performs, rather than studying its external input-output behavior only.

Sparse autoencoder techniques for language models work by training an auxiliary model to reconstruct a target layer's activations as a sparse linear combination of learned feature vectors. Because the reconstruction must be sparse, the learned features tend to be interpretable concepts rather than statistical noise. Each feature can be inspected by finding the inputs that maximally activate it, which often reveals that a given feature corresponds to a recognizable concept such as a person's name, a syntactic pattern, or a topic domain.

Gemma Scope trained SAEs at every layer and sublayer of Gemma 2 2B and Gemma 2 9B, and at selected layers of Gemma 2 27B. The resulting collection comprises over 400 individual autoencoders with more than 30 million total learned features. The SAE training consumed approximately 15% of the total compute used to pretrain Gemma 2 9B, and the team stored roughly 20 pebibytes of intermediate activations. The JumpReLU SAE architecture was selected because it achieves a better balance between detecting feature presence and estimating feature magnitude than standard ReLU-based SAEs.

Google DeepMind described the release as the largest open-source interpretability tool release from any AI lab at the time. Weights were published on Hugging Face under the google/gemma-scope repository. The Neuronpedia platform provided an interactive web interface for browsing learned features without requiring a local GPU. The technical paper was posted to arXiv as 2408.05147.

Intended use cases include studying how models represent factual knowledge, analyzing where unsafe behaviors originate in the computation graph, investigating chain-of-thought faithfulness, and detecting deceptive reasoning. The scale of the Gemma Scope release made it possible for small research groups to conduct mechanistic interpretability experiments that would have required significant proprietary infrastructure to replicate from scratch.

Gemma Scope 2, released alongside Gemma 3 in 2025, extended coverage to all Gemma 3 model sizes (270M to 27B) and added transcoders, which model how activations in one layer are transformed to produce activations in the next, enabling analysis of computational flow rather than static representations alone.

ShieldGemma

ShieldGemma is a suite of safety content moderation classifiers released by Google alongside the Gemma 2 family. The models are fine-tuned from Gemma 2 base weights to classify text inputs and outputs for four harm categories: sexually explicit content, dangerous or harmful content, hate speech, and harassment. ShieldGemma is available in 2B, 9B, and 27B parameter sizes, matching the Gemma 2 family.

The classifier is designed for two deployment patterns. In the pre-filter configuration, ShieldGemma processes incoming user messages before they reach the main generation model, blocking requests classified as harmful above the operator's chosen threshold. In the post-filter configuration, it screens the model's responses before they are returned to the user. Both patterns can be used simultaneously for defense in depth.

Unlike binary classifiers that output only a harmful/not-harmful label, ShieldGemma outputs a probability score for each harm category, giving operators control over the sensitivity-specificity tradeoff for their specific application.

On public safety benchmarks, ShieldGemma outperformed the then-current Llama Guard by 10.8 percentage points on AU-PRC (area under the precision-recall curve) and WildGuard by 4.3 percentage points. The technical paper was posted to arXiv as 2407.21772.

Limitations documented in the ShieldGemma paper include high sensitivity to the precise wording of the safety principles provided as context, inconsistent handling of implicit harm (where harmful intent is inferable but not stated), and potential gaps for harm categories underrepresented in evaluation data. ShieldGemma is included in Google's Responsible Generative AI Toolkit alongside other safety and evaluation resources.

Benchmarks

The following tables present Gemma 2 performance on standard academic benchmarks as reported in the technical report (arXiv:2408.00118). Base model results use few-shot prompting. Chatbot Arena scores reflect human preference ratings collected by LMSYS.

Pre-trained model benchmarks

Benchmark	Gemma 2 2B	Gemma 2 9B	Gemma 2 27B	Llama 3 8B	Llama 3 70B
MMLU (5-shot)	52.2%	71.3%	75.2%	66.6%	79.2%
GSM8K (5-shot)	24.3%	68.6%	75.1%	45.7%	76.9%
MATH (4-shot)	16.0%	36.6%	42.3%	--	--
HumanEval (0-shot)	20.1%	40.2%	51.8%	--	--
ARC-Challenge	55.7%	68.4%	71.4%	59.2%	68.8%
HellaSwag	--	81.9%	86.4%	82.0%	88.0%
Winogrande	--	80.6%	83.7%	78.5%	85.3%

Comparison with similar-size models (9B class)

Benchmark	Mistral 7B	Llama 3 8B	Gemma 2 9B
MMLU	62.5%	66.6%	71.3%
GSM8K	34.5%	45.7%	62.3%
ARC-Challenge	60.5%	59.2%	68.4%
HellaSwag	83.0%	82.0%	81.9%
Winogrande	78.5%	78.5%	80.6%

Chatbot Arena Elo ratings (instruction-tuned models, July 2024)

Model	Chatbot Arena Elo
GPT-4o	~1,285
Gemma 2 27B	1,218
Llama 3 70B	1,206
Gemma 2 9B	1,187
GPT-4-0314	1,186
Gemma 2 2B	1,126
GPT-3.5-Turbo	1,116
Mixtral 8x7B	1,114

Chatbot Arena ratings are derived from head-to-head human preference votes using the Bradley-Terry model. The Gemma 2 27B instruction-tuned model exceeded Llama 3 70B (a model with more than twice the parameters) by 12 Elo points. Gemma 2 9B matched GPT-4-0314. Gemma 2 2B surpassed both GPT-3.5-Turbo and Mixtral 8x7B.

Comparison with Llama 3 and Mistral 7B

LLaMA 3, released by Meta in April 2024, and Mistral 7B, released by Mistral AI in September 2023, represent the two most widely used open-weights baselines at the time of Gemma 2's release.

Dimension	Mistral 7B	Llama 3 8B	Gemma 2 9B	Llama 3 70B	Gemma 2 27B
Parameters	7B	8B	9B	70B	27B
Context window	32K	8K	8K	8K	8K
Training tokens	1T	15T	8T	15T	13T
MMLU	62.5%	66.6%	71.3%	79.2%	75.2%
GSM8K	34.5%	45.7%	62.3%	76.9%	75.1%
License	Apache 2.0	Llama 3 Community	Gemma ToU	Llama 3 Community	Gemma ToU
Architecture	Sliding window	Dense transformer	Hybrid attention	Dense transformer	Hybrid attention

Gemma 2 9B outperformed Mistral 7B on every reported benchmark except HellaSwag, where Mistral 7B held a narrow margin of 83.0% to 81.9%. The gap on math and reasoning was large: Gemma 2 9B scored 62.3% on GSM8K compared to 34.5% for Mistral 7B. Against LLaMA 3 8B, Gemma 2 9B led on all reported benchmarks. Gemma 2 27B came within a few points of Llama 3 70B on most tasks despite having 2.5 times fewer parameters and training on 2 trillion fewer tokens.

Mistral 7B retains a significant advantage in context length: its 32K window (versus Gemma 2's 8K) supports tasks involving long documents, codebases, or extended multi-turn conversations that exceed what Gemma 2 can process. Mistral 7B also distributes under Apache 2.0, offering irrevocable commercial rights. Llama 3's community license imposes restrictions on redistribution for applications with more than 700 million monthly active users, a threshold well above what most organizations encounter but a consideration for large-scale deployment.

Use cases

Gemma 2 was designed with practical deployment efficiency as a core goal. The 27B model runs at full bfloat16 precision on a single NVIDIA A100 80GB or H100 80GB GPU, or on a Google Cloud TPU v5e host. The 9B model requires approximately 18 GB of VRAM, fitting on high-end consumer cards including the NVIDIA RTX 4090 (24 GB) and the RTX 3090 (24 GB). With 4-bit quantization via GGUF or GPTQ formats, the 27B model compresses to roughly 18 GB, enabling it to run on the same hardware as the unquantized 9B.

Documented deployment use cases include:

Text generation, summarization, and question answering in enterprise chatbots where organizations want to run inference on-premises to avoid sending sensitive data to external APIs.
Code assistance, where the 9B and 27B models perform competitively on HumanEval and MBPP for organizations operating in air-gapped or high-security environments.
Domain-specific fine-tuning using QLoRA (quantized low-rank adaptation) or full fine-tuning. The 9B model's relatively modest hardware requirements make it practical to fine-tune on a single A100 80GB GPU in a few hours for many task types.
On-device and edge inference using GGUF-quantized variants through llama.cpp and Ollama, which run Gemma 2 on consumer laptops including Apple Silicon MacBooks and Windows laptops with NVIDIA RTX cards.
Interpretability research using Gemma Scope, with the complete SAE weights enabling experiments on feature discovery, circuit analysis, and safety research without requiring original training infrastructure.
Multilingual adaptation, where community fine-tuners applied continuous pre-training on non-English corpora to extend Gemma 2's primarily English capabilities to target languages.

The instruction-tuned models use a chat template with <start_of_turn> and <end_of_turn> tokens marking speaker boundaries, which maps straightforwardly onto Hugging Face Transformers chat template formats and is supported natively in Ollama and llama.cpp.

Adoption

Gemma 2 integration across the open model ecosystem moved quickly after the June 2024 release. Hugging Face Transformers added support in version 4.42, and all six canonical model variants (base and instruction-tuned across three sizes) were hosted in the google organization on Hugging Face. The Text Generation Inference (TGI) server added Gemma 2 support for high-throughput production deployments. Google AI Studio provided free access to the instruction-tuned 9B and 27B models, and Kaggle hosted the models with notebook integration for experimentation.

Community fine-tuned variants appeared within days of the initial release. The Gemmaverse, Google's informal name for the community ecosystem around Gemma models, accumulated hundreds of derivative checkpoints on Hugging Face within the first few months. Notable community derivatives built on Gemma 2 included:

SEA-LION V3, a Southeast Asian language model from the AI Singapore team that used Gemma 2 9B as a base for continuous pre-training on multilingual Southeast Asian corpora, improving performance on Indonesian, Thai, Vietnamese, and other regional languages.
Navarasa 2.0, a fine-tune adapted for 15 Indian languages, part of a broader effort to extend Gemma coverage to Indic scripts and dialects.
A Bulgarian continuous pre-training project that trained on roughly 85 billion Bulgarian tokens using a combination of continued pre-training, instruction fine-tuning, and model merging techniques presented at EMNLP 2024.
A Swahili adaptation that extended coverage to a language with over 200 million speakers.

Google published Gemma.cpp as a dedicated C++ inference engine for the Gemma architecture, and compatibility was added to vLLM for high-throughput batched serving. GGUF-quantized community releases for llama.cpp and Ollama appeared within hours or days of each official Google release.

Vertex AI added Gemma 2 to its model garden for managed enterprise deployment, with monitoring and access control capabilities not available in self-hosted configurations. By the time Gemma 3 was announced in early 2025, the Gemma 2 models had accumulated millions of downloads across Hugging Face and Kaggle.

Reception

Gemma 2 received broadly positive coverage from the AI research and developer community at launch. The Chatbot Arena results attracted particular attention because they demonstrated that parameter count is not the primary determinant of human preference ratings when architecture and training methodology are both improved. A Hacker News thread shortly after the 27B release called it "exceptionally strong," and several commenters noted that the 27B model's Elo score exceeding Llama 3 70B was a more informative result than standard benchmark scores because it reflected real user preference rather than curated academic tasks.

The Gemma Scope release was described by the interpretability research community as a significant contribution. The scale of the SAE collection (400+ autoencoders, 30 million features) was beyond what any individual lab had previously made available in open form. Researchers noted that having SAEs trained at every layer and sublayer of 2B and 9B models enabled types of circuit analysis that had previously been feasible only at small scales or on proprietary infrastructure.

ShieldGemma received more mixed commentary. The performance improvements over prior safety classifiers were acknowledged, but practitioners found the sensitivity to prompt phrasing a practical obstacle in production. Setting the right safety principle description to get consistent results required iteration that was not always straightforward.

Some reviews focused on the Gemma Terms of Use. The commercial permissions were seen as adequate for most use cases, but the revocation provision was flagged as a structural difference from Apache 2.0 that made Gemma 2 less suitable for enterprise deployments where legal counsel required irrevocable rights. This was frequently contrasted with Mistral AI's fully permissive licensing approach.

Gemma 2 also generated significant discussion about what the distillation results imply for the practical value of scaling pretraining compute. The paper's data showed that a student trained with distillation reached a performance level that a same-size student trained from scratch could not reach regardless of the quantity of compute applied. This was taken as evidence that access to a strong teacher model may matter as much as raw scale, a point with implications for how organizations without frontier-model training budgets should approach model development.

Limitations

Context window

All Gemma 2 models are capped at an 8,192-token context window. This is identical to Llama 3 8B at release, but substantially shorter than the 32K offered by Mistral 7B and far below the 128K or longer contexts provided by Claude 3, GPT-4 Turbo, and later models. Tasks involving long documents, full codebases, multi-document research summaries, or extended dialogue histories are constrained by this limit. The sliding window attention mechanism operates efficiently within 8K but does not provide a path to extending the effective context at inference time without retraining.

Gemma 3, released by Google DeepMind in early 2025, extended the context window to 128K tokens across all model sizes, directly addressing this gap.

Training data transparency

The technical report describes the training corpus as web documents, code, and scientific articles with English-primary coverage, but does not disclose source proportions, filtering criteria, or specific datasets. This limits the ability of external researchers to audit for data contamination on benchmarks, identify demographic or cultural biases, or assess the representation of specific domains. The opacity is consistent across the Gemma and Gemini families and reflects Google's standard practice, but it differs from more transparent documentation in some contemporaneous releases.

English-language bias

The primary English-language training corpus means Gemma 2 performs unevenly on non-English tasks. The Gemma 2 JPN 2B variant addressed Japanese specifically, but no equivalent targeted release was made for other languages at launch. Community projects like SEA-LION and Navarasa partially filled this gap through continued pre-training, but these required additional resources to produce and were not available at release.

FlashAttention incompatibility

The logit soft-capping design means Gemma 2 cannot use FlashAttention or PyTorch SDPA in standard form, both of which provide 2-4x speedups and significant memory savings during fine-tuning. Users must run eager attention, which is slower and uses more GPU memory. This practical constraint was a recurring topic in community discussions around fine-tuning Gemma 2, especially for the 27B model where memory pressure is already significant.

End-to-end task performance

Human evaluation results in the technical report noted weaknesses on complex end-to-end tasks requiring sequential reasoning, tool use, or multi-step planning. The model passed fewer end-to-end task challenges than human baselines in certain evaluations, including multi-step system interaction tasks. This limitation was present across open models at the time and was not unique to Gemma 2, but it is worth noting as a constraint on agentic use cases.

References

Google DeepMind. "Google launches Gemma 2, its next generation of open models." Google Blog, June 27, 2024. https://blog.google/technology/developers/google-gemma-2/
Gemma Team, Google DeepMind. "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118, 2024. https://arxiv.org/abs/2408.00118
Hugging Face. "Welcome Gemma 2 - Google's new open LLM." Hugging Face Blog, 2024. https://huggingface.co/blog/gemma2
Google AI for Developers. "Gemma releases." https://ai.google.dev/gemma/docs/releases
Google AI for Developers. "Gemma Terms of Use." https://ai.google.dev/gemma/terms
Google DeepMind. "Gemma Scope: helping the safety community shed light on the inner workings of language models." DeepMind Blog, August 2024. https://deepmind.google/blog/gemma-scope-helping-the-safety-community-shed-light-on-the-inner-workings-of-language-models/
Lieberum, Tom et al. "Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2." arXiv:2408.05147, 2024. https://arxiv.org/abs/2408.05147
Zeng, Wenjun et al. "ShieldGemma: Generative AI Content Moderation Based on Gemma." arXiv:2407.21772, 2024. https://arxiv.org/abs/2407.21772
Google AI for Developers. "ShieldGemma model card." https://ai.google.dev/gemma/docs/shieldgemma/model_card
LMSYS Chatbot Arena. Leaderboard scores, July 2024. https://chat.lmsys.org
Google Developers Blog. "Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma." https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/
Novita AI Blog. "Gemma 2 vs Llama 3: Which Model Is Better for You in 2024?" https://blogs.novita.ai/gemma-2-vs-llama-3-which-model-is-better-for-you-in-2024/
Hugging Face. google/gemma-scope model card. https://huggingface.co/google/gemma-scope
Google DeepMind. "Gemma model page." https://deepmind.google/models/gemma/
The Next Web. "Google DeepMind launches 2B parameter Gemma 2 model." July 31, 2024. https://thenextweb.com/news/google-deepmind-2b-parameter-gemma-2-model
VentureBeat. "Google's tiny AI model 'Gemma 2 2B' challenges tech giants in surprising upset." 2024. https://venturebeat.com/ai/googles-tiny-ai-model-gemma-2-2b-challenges-tech-giants-in-surprising-upset
Google Developers Blog. "Beyond English: How Gemma open models are bridging the language gap." https://developers.googleblog.com/building-more-inclusive-llms-using-gemma-open-models/

Background and predecessor

Model variants

The 2B model

Architecture

Interleaved sliding window and global attention

Logit soft-capping

Grouped-query attention

Post-norm and pre-norm hybrid

WARP model merging for alignment

Training data

Knowledge distillation during pretraining

Post-training pipeline

Training infrastructure

Gemma Terms of Use and licensing

Gemma Scope

ShieldGemma

Benchmarks

Pre-trained model benchmarks

Comparison with similar-size models (9B class)

Chatbot Arena Elo ratings (instruction-tuned models, July 2024)

Comparison with Llama 3 and Mistral 7B

Use cases

Adoption

Reception

Limitations

Context window

Training data transparency

English-language bias

FlashAttention incompatibility

End-to-end task performance

See also

References

Improve this article

Related Articles

Gemma 3

Gemma

Phi-3

Phi-4

Phi-4-mini

Phi-4-mini-flash-reasoning

Background and predecessor

Model variants

The 2B model

Architecture

Interleaved sliding window and global attention

Logit soft-capping

Grouped-query attention

Post-norm and pre-norm hybrid

WARP model merging for alignment

Training data

Knowledge distillation during pretraining

Post-training pipeline

Training infrastructure

Gemma Terms of Use and licensing

Gemma Scope

ShieldGemma

Benchmarks

Pre-trained model benchmarks

Comparison with similar-size models (9B class)

Chatbot Arena Elo ratings (instruction-tuned models, July 2024)

Comparison with Llama 3 and Mistral 7B

Use cases

Adoption

Reception

Limitations

Context window

Training data transparency

English-language bias

FlashAttention incompatibility

End-to-end task performance

See also

References

Related Articles

Gemma 3

Gemma

Phi-3

Phi-4

Phi-4-mini

Phi-4-mini-flash-reasoning