Gemma 3

Gemma 3 is a family of open-weight large language models developed by Google DeepMind and released on March 12, 2025. It is the third generation of the Gemma model series and the first to incorporate native multimodal capabilities alongside text. Gemma 3 comes in four parameter sizes (1B, 4B, 12B, and 27B) and introduces a significantly expanded 128K-token context window, support for over 140 languages, and a hybrid local-global attention architecture that makes large-context inference practical on consumer hardware. The 27B instruction-tuned variant achieved a Chatbot Arena Elo score of 1338 at launch, ranking ninth overall and outperforming much larger models including Llama 3.1 405B and DeepSeek-V3 on that leaderboard.

Google described Gemma 3 as "the most capable model you can run on a single GPU or TPU," referencing the 27B model's ability to run on a single NVIDIA H100, versus eight or more for comparably performing competitors at the time.

Background

Gemma 1 and Gemma 2

The Gemma model family began in February 2024 when Google DeepMind released Gemma 1, a pair of lightweight open-weight text models at 2B and 7B parameter scales. Gemma 1 drew on techniques developed for the Gemini model family and demonstrated that relatively small open models, trained carefully and distilled from larger systems, could achieve competitive performance on standard benchmarks. The release attracted significant developer interest and established a pattern Google would continue: open weights with a custom proprietary license rather than a fully open-source license.

Gemma 2 followed in June 2024 with 2B, 9B, and 27B variants. It introduced local-global interleaved attention as a memory-saving measure and improved benchmark scores across the board. However, Gemma 2 remained text-only with a maximum context of 8K tokens and limited multilingual coverage. The gap between Gemma 2 and frontier models on vision and long-context tasks was a visible limitation.

Gemma 3 addresses both gaps directly. It adds multimodal vision to three of its four variants, expands the context window to 128K tokens for those same variants, and roughly doubles the multilingual training data to cover 140 languages.

Relationship to Gemini

Gemma 3 is tightly coupled to Gemini at the architecture and training levels. The tokenizer matches Gemini 2.0's 262K-entry SentencePiece vocabulary, which provides better encoding efficiency for non-Latin scripts. The pre-training and post-training pipelines use knowledge distillation from Gemini models as a teacher signal. The technical report describes this as sampling 256 logits per token weighted by teacher probabilities, with the student Gemma model trained to match the teacher's output distribution via cross-entropy loss. This distillation approach is central to why relatively small Gemma 3 variants can match the performance of larger independently trained models.

Model variants

Gemma 3 ships in four base sizes, each available in a pre-trained (PT) version and an instruction-tuned (IT) version. Three quantized variants are also available for each instruction-tuned model.

Variant	Parameters	Vision	Context window	Languages
Gemma 3 1B	1B (302M embedding + 698M non-embedding)	Text only	32K tokens	English primarily
Gemma 3 4B	4B (675M embedding + 3,209M non-embedding + 417M vision encoder)	Text and images	128K tokens	140+
Gemma 3 12B	12B (1,012M embedding + 10,759M non-embedding + 417M vision encoder)	Text and images	128K tokens	140+
Gemma 3 27B	27B (1,416M embedding + 25,600M non-embedding + 417M vision encoder)	Text and images	128K tokens	140+

The 1B model is optimized for on-device and edge deployment and runs at up to 2,585 tokens per second on prefill when using Google AI Edge's LLM inference runtime. Its smaller footprint (approximately 529 MB in quantized form) makes it suitable for embedding directly in mobile applications.

The 4B model is notable for punching well above its weight class. According to the technical report, Gemma 3 4B IT scores comparably to Gemma 2 27B IT on standard instruction-following benchmarks, representing a roughly 7x improvement in parameter efficiency across generations.

The 27B model is positioned as a single-GPU deployment target. In bfloat16 precision, the raw model weights occupy 54 GB, which fits within a single high-memory GPU such as an NVIDIA H100 (80 GB HBM). With int4 quantization (QAT), the weights compress to 14.1 GB, and the full model including a 32K token KV cache fits within 32.8 GB, making it accessible on prosumer hardware.

A 270M parameter model was added to the Gemma 3 family in August 2025 for use cases requiring very low memory and compute budgets.

Native multimodal vision

Gemma 3 is the first generation in the Gemma family to support image understanding. The 4B, 12B, and 27B models each incorporate a 417M-parameter SigLIP vision encoder that processes images at a fixed input resolution of 896 by 896 pixels. The encoder is frozen during language model training; only the language model weights and a lightweight multi-modal projector (which converts visual features into 256 "soft tokens" readable by the language model) are updated during training.

The vision architecture follows the same general pattern as PaliGemma, Google's earlier vision-language model. Image tokens are processed with bidirectional (non-causal) attention, while text tokens use standard causal attention. This combination lets the model attend fully across all image patches before generating output tokens.

Multimodal support unlocks a range of tasks including document understanding, chart and table reading, image captioning, visual question answering, and interleaved image-text reasoning. The 27B instruction-tuned model achieved 85.6 on DocVQA, 68.6 on TextVQA, 76.3 on ChartQA, and 64.9 on MMMU in zero-shot evaluation.

Pan and Scan for high-resolution vision

The fixed 896x896 input resolution of the SigLIP encoder creates a problem for images with unusual aspect ratios or fine-grained detail spread across a large image. To address this, Gemma 3 introduces Pan and Scan (P&S), an adaptive windowing algorithm that runs at inference time.

Pan and Scan works by segmenting a non-square or high-resolution source image into overlapping crops, resizing each crop to 896x896, and encoding each crop independently through the SigLIP encoder. The resulting token sequences are concatenated before being passed to the language model. This gives the model the equivalent of "zooming in" on different parts of the image without changing the encoder architecture.

The performance benefit is measurable. On DocVQA, enabling Pan and Scan for the 27B model improves scores by 4.8 percentage points. On InfoVQA, which tests understanding of information-dense documents with complex layouts, the gain is 17.0 percentage points. The cost is additional compute proportional to the number of crops generated.

128K context window

Gemma 2 supported a maximum context of 8K tokens. Gemma 3 extends this to 128K tokens for the 4B, 12B, and 27B models (the 1B model supports 32K). Google estimated this as roughly equivalent to 96,000 words, 198 pages of text, 500 images, or more than eight minutes of video at one frame per second.

Extending context length at this scale without making inference prohibitively expensive required architectural changes. The key mechanism is the hybrid local-global attention design described in the architecture section below. The practical result is that the KV cache memory overhead at 32K context drops to below 15% of total model memory, compared to around 60% for a standard global-attention-only model. This makes 128K context inference feasible on a single GPU with quantization.

Pre-training uses 32K sequence lengths, with 128K support added via a RoPE scaling phase at the end of pre-training. On the RULER long-context benchmark, the 27B pre-trained model achieves 72.9% accuracy at 128K context.

140 language support

Gemma 2 had limited multilingual coverage. Gemma 3 expands coverage to over 140 languages by increasing the proportion of multilingual data in pre-training and adopting the Gemini 2.0 tokenizer, which handles non-Latin scripts more efficiently.

The Gemini 2.0 tokenizer's 262K-entry vocabulary (versus smaller vocabularies in many competing models) encodes Chinese, Japanese, Korean, Arabic, and other non-Latin scripts more compactly. This means equivalent multilingual content consumes fewer tokens, reducing memory pressure and inference cost for non-English languages.

On the MGSM (multilingual grade school math) benchmark, Gemma 3 27B IT scores 74.3, compared to 34.7 for the 4B model and 64.3 for the 12B model. On Global MMLU, the 27B model scores 75.7. On WMT24++ translation, the 27B model scores 55.7.

The 1B model is primarily English-focused and not recommended for multilingual tasks at the same reliability level as the larger variants.

Architecture improvements

Hybrid local-global attention

Gemma 3's most important architectural change from Gemma 2 is the shift in the ratio of local to global attention layers. Gemma 2 used a 1:1 ratio, alternating between local sliding window attention and global full-sequence attention. Gemma 3 moves to a 5:1 ratio: five consecutive local attention layers followed by one global attention layer, repeating throughout the model depth.

Local attention layers use a sliding window of only 1,024 tokens. They efficiently handle short-range dependencies without requiring the model to store key-value pairs for the entire sequence. The single global layer in each group attends to the full 128K context window and handles long-range dependencies.

Because only one in six layers needs to maintain a full-context KV cache, the memory overhead scales much more favorably. At 32K context, this architecture reduces KV-cache memory to below 15% of model memory; a global-only model at the same context would use around 60%.

Soft-capping from Gemma 2 (which applied a tanh function to attention logits to stabilize training) is replaced with QK-norm (query-key normalization). QK-norm proved both more stable and faster at the sequence lengths needed for 128K context training.

Rotary Position Embeddings

Gemma 3 uses a dual RoPE (Rotary Position Embedding) strategy. Local attention layers retain the standard RoPE base frequency of 10K, appropriate for short windows. Global attention layers use a base frequency of 1M, a 100x increase from the 10K used in Gemma 2. This higher base frequency allows the positional encoding to generalize to much longer sequences without degradation.

Models are pre-trained at 32K sequence lengths and then extended to 128K by rescaling RoPE with a scaling factor of 8, following a well-established protocol for extending context post-pre-training.

Tokenizer update

The vocabulary grew from the smaller tokenizers used in Gemma 1 and Gemma 2 to 262K entries, matching Gemini 2.0. The tokenizer uses SentencePiece with split digits, preserved whitespace, and byte-level encodings. The larger vocabulary directly improves efficiency for multilingual text and reduces the token count needed for math and code where precise digit handling matters.

Distillation from Gemini

Knowledge distillation is central to Gemma 3's training methodology at both the pre-training and post-training stages.

During pre-training, Gemma 3 models are trained on tokens generated or supervised by Gemini family teacher models. The specific mechanism described in the technical report involves sampling 256 logits per token from the teacher model, weighted by their probabilities, and training the student to match this compressed distribution. This "top-K logit distillation" approach transfers the teacher's uncertainty and calibration to the student, not just its most likely predictions.

During post-training, instruction-tuned versions of Gemma 3 are trained using a similar distillation process from a large instruction-tuned Gemini teacher. This is supplemented by reinforcement learning from human feedback via three techniques: BOND (Best-of-N Distillation), WARM (Weight Averaged Reward Models), and WARP (Weighted Averaging of Reward Policies). Reward functions target helpfulness, mathematics, coding, reasoning, instruction-following, and multilingual performance, with code execution feedback used as a ground-truth signal for coding tasks.

The result of this training stack is that Gemma 3 models achieve performance well above what their parameter counts would predict from training alone. The 4B model matches Gemma 2 27B on many benchmarks; the 27B model matches Gemini 1.5 Pro on some.

Quantization-aware training variants

Gemma 3 ships with official quantized weights for all instruction-tuned variants, produced through Quantization-Aware Training (QAT). Unlike post-training quantization, which applies precision reduction after training is complete, QAT incorporates the quantization into the training process itself, allowing the model to adapt its weights to minimize the accuracy loss from reduced precision.

Three weight formats are provided:

Format	Description
Per-channel int4	Each output channel has its own scale factor; best accuracy among int4 options
Per-block int4 (blocks=32)	Groups of 32 weights share a scale factor; slightly higher accuracy than per-channel for some hardware
Switched FP8 (SFP8)	8-bit floating point; intermediate between int4 and bfloat16 in size and accuracy

For the 27B model, these translate to the following memory footprints:

Precision	Weights only	Weights + 32K KV cache
bfloat16	54.0 GB	72.7 GB
int4 (per-channel)	14.1 GB	32.8 GB
int4 (blocks=32)	15.3 GB	34.0 GB
SFP8	27.4 GB	46.1 GB

The int4 versions enable running the 27B model on consumer GPUs like the NVIDIA RTX 3090 (24 GB) at reduced context lengths, or on a single H100 at full 32K context.

License

Gemma 3 is released under the Gemma Terms of Service, a custom proprietary license maintained by Google. It is not an open-source license under the Open Source Initiative definition.

The license permits free use, modification, and distribution for most purposes, including commercial applications. Users who redistribute Gemma or derivatives must include a copy of the terms and must incorporate the use restrictions from Section 3.2 (which reference Google's Gemma Prohibited Use Policy) in any downstream distribution agreements.

The Prohibited Use Policy bars uses including generating content for weapons of mass destruction, creating malware, generating child sexual abuse material, and other clearly harmful applications. These terms are broadly consistent with similar "responsible use" provisions in Meta's Llama licenses and Microsoft's Phi licenses.

Gemma 4, released in 2025, moved to the Apache 2.0 license. Gemma 3 retains the custom terms.

Benchmark performance

Core benchmarks

Benchmark	Gemma 3 4B	Gemma 3 12B	Gemma 3 27B	Gemini 2.0 Pro
MMLU-Pro	43.6	60.6	67.5	79.1
MATH	75.6	83.8	89.0	91.8
GPQA Diamond	30.8	40.9	42.4	64.7
LiveCodeBench	12.6	24.6	29.7	36.0
Bird-SQL	36.3	47.9	54.4	59.3
SimpleQA	4.0	6.3	10.0	44.3
Global MMLU	57.0	69.4	75.7	--

Multilingual benchmarks

Benchmark	Gemma 3 4B	Gemma 3 12B	Gemma 3 27B
MGSM	34.7	64.3	74.3
Global-MMLU-Lite	54.5	69.5	75.1
WMT24++	48.4	53.9	55.7
Flores	39.2	46.0	48.8

Vision benchmarks (pre-trained, without Pan and Scan)

Benchmark	Gemma 3 4B	Gemma 3 12B	Gemma 3 27B
DocVQA	72.8	82.3	85.6
TextVQA	58.9	66.5	68.6
ChartQA	63.6	74.7	76.3
VQAv2	63.9	71.2	72.9
MMMU (val)	39.2	50.3	56.1

Chatbot Arena

Gemma 3 27B IT achieved a Chatbot Arena Elo score of 1338 at launch, placing ninth overall. This put it ahead of DeepSeek-V3 (1318), Llama 3.1 405B (1269), Qwen 2.5 72B (1257), and comparable to o1-preview (1335).

Comparison with competing models

Gemma 3 launched into a competitive field of open-weight models. The comparisons most frequently cited in the technical report and community evaluations involve Llama 3.3 70B (Meta), Phi-4 14B (Microsoft), and Qwen 2.5 72B (Alibaba).

Model	Developer	Parameters	Context	Multimodal	MMLU-Pro	MATH	Arena Elo
Gemma 3 27B	Google DeepMind	27B	128K	Yes	67.5	89.0	1338
Gemma 3 12B	Google DeepMind	12B	128K	Yes	60.6	83.8	--
Llama 3.3 70B	Meta	70B	128K	No (text only)	~65	~77	~1280
Phi-4 14B	Microsoft	14B	16K	No (text only)	~70	~80	~1140
Qwen 2.5 72B	Alibaba	72B	128K	No (text only)	~65	83.1	1257

Several differences stand out. Gemma 3 is the only model in this group to include multimodal vision at the time of its March 2025 release. At 27B parameters, it achieves a higher Chatbot Arena Elo than Llama 3.3 70B (which has 2.6x more parameters) and Qwen 2.5 72B (which has 2.7x more parameters). Phi-4 at 14B scores competitively on math and reasoning benchmarks but lacks multimodal capability and has a significantly shorter context window of 16K tokens.

Qwen 2.5 72B shows a stronger advantage on CJK (Chinese, Japanese, Korean) language tasks, benefiting from Alibaba's larger multilingual dataset. Llama 3.3 70B is generally faster at inference due to Meta's optimization of its architecture for throughput-oriented serving. Gemma 3 4B is notable as perhaps the most capable 4B-class multimodal model available at its release date, with a 128K context window that no competing 4B model matched.

Gemma 3n: mobile-optimized variant

In May 2025, Google DeepMind released a preview of Gemma 3n, a separate model family derived from Gemma 3 but designed specifically for on-device deployment on mobile phones and edge hardware. The full release followed in late June 2025.

Gemma 3n introduces several architectural innovations not present in the main Gemma 3 models:

MatFormer (Matryoshka Transformer): Gemma 3n uses a nested architecture where a larger model (nominally 4B parameters, referred to as E4B) contains a smaller functional submodel (nominally 2B parameters, E2B) embedded within it. At inference time, developers can choose to use either the full model or the nested submodel without reloading weights. This enables dynamic performance-quality tradeoff without requiring separate model files.

Per-Layer Embeddings (PLE): Embedding parameters are distributed across layers rather than stored in a single shared table. According to Google DeepMind, this reduces RAM requirements substantially: Gemma 3n E4B achieves performance comparable to a 4B model while requiring only 3 GB of active memory. The E2B submodel runs in approximately 2 GB.

Expanded multimodality: In addition to text and images, Gemma 3n includes audio understanding through a dedicated audio encoder, supporting automatic speech recognition, speech translation, and interleaved audio-text inputs. The vision encoder is a MobileNet-based architecture optimized for mobile inference rather than the SigLIP encoder used in Gemma 3.

Gemma 3n is roughly 1.5x faster on mobile hardware than Gemma 3 4B, according to Google's benchmarks. It integrates with next-generation Gemini Nano features on Android.

Use cases

The combination of multimodal input, 128K context, and competitive performance across a range of sizes has made Gemma 3 models widely used for several application categories.

Retrieval-augmented generation (RAG): The 128K context window allows Gemma 3 models to process large retrieved document sets without truncation. The multimodal variants can handle mixed text-and-image document collections, making them suitable for enterprise document RAG pipelines where scanned PDFs, charts, and tables are common.

Coding assistance: Gemma 3 instruction-tuned models perform well on code generation and debugging tasks. The Bird-SQL benchmark, which tests text-to-SQL conversion, shows the 27B model at 54.4, and LiveCodeBench shows competitive scores. The models support function calling and structured JSON output, which are required for tool-using coding agents.

Multilingual applications: Support for 140 languages with a single model simplifies deployment for global applications. The Gemma 3 tokenizer's efficient encoding of non-Latin scripts reduces inference cost compared to models with smaller vocabularies.

On-device deployment: The 1B model and QAT quantized versions of larger models run on consumer hardware. The 1B model fits in mobile application bundles; the int4 quantized 4B model runs on laptops with 8 GB of VRAM.

Medical and scientific applications: Google released MedGemma, a Gemma 3 4B variant fine-tuned on medical data, which showed state-of-the-art performance among 4B-class models on medical multimodal question answering and chest X-ray classification tasks.

Document and visual understanding: Pan and Scan's ability to process high-resolution documents makes Gemma 3 multimodal models practical for document intelligence workflows involving scanned pages, infographics, and mixed-layout content.

Reception

Gemma 3's March 2025 release generated significant interest in the developer and research communities. Downloads of Gemma models across all generations exceeded 150 million by May 2025. The broader "Gemmaverse" of community fine-tunes and derivative models exceeded 60,000 variants on Hugging Face within two months of the Gemma 3 release.

The Chatbot Arena performance of the 27B model drew particular attention. Ranking ninth overall while being runnable on a single GPU put it in a category that no prior open model had occupied at that scale. Developers noted that comparisons to Llama 3.1 405B (which requires multiple high-memory GPUs) and DeepSeek-V3 (a mixture-of-experts model requiring significant infrastructure) were meaningful for practical deployment.

Google launched a Gemma 3 Academic Program alongside the model, offering $10,000 in Google Cloud credits per award to research groups building on Gemma 3. More than 600 projects were submitted to the Gemma 3n Impact Challenge on Kaggle by mid-2025.

Some developers raised concerns about the Gemma Terms of Service license, noting that the propagation requirements (requiring downstream distributors to pass along the terms) create compliance overhead in commercial redistribution scenarios. The license is more permissive than many enterprise contexts require documentation of, but less permissive than Apache 2.0 or MIT. Google's decision to adopt Apache 2.0 for Gemma 4 was widely seen as a response to this feedback.

Researchers also noted the SimpleQA weakness (10.0 for the 27B model versus 44.3 for Gemini 2.0 Pro), which indicates that despite strong performance on structured reasoning tasks, Gemma 3 can be unreliable on basic factual recall questions. This limitation was acknowledged in the technical report.

Limitations

Gemma 3 has several documented limitations.

SimpleQA and factual recall: The 27B model scores 10.0 on SimpleQA, far below Gemini 2.0 Pro (44.3). This suggests the model is prone to hallucinating on basic factual queries even when strong performance on other benchmarks might suggest otherwise. The gap likely reflects the distillation training approach prioritizing reasoning and instruction-following over memorization of factual content.

Vision resolution ceiling: The SigLIP encoder has a fixed input resolution of 896x896 pixels. Pan and Scan partially addresses this for high-resolution documents, but images requiring continuous fine-grained detail across the entire frame (such as satellite imagery or medical scans with small lesions) may be processed with less accuracy than purpose-built vision models.

Long-context reasoning depth: While Gemma 3 supports 128K tokens and the RULER benchmark shows reasonable retrieval accuracy at that length, the depth of actual multi-hop reasoning over very long contexts is less well-characterized. Many benchmarks measure whether the model can find relevant information in a long context, not whether it can reason across multiple pieces of information spread across 100K+ tokens.

1B model limitations: The 1B variant is English-focused and does not include vision support. It is not recommended for multilingual tasks at comparable reliability to the larger models.

Symbolic hallucination: Research published after the Gemma 3 release found that symbolic hallucinations (errors involving structured reasoning with symbols, counts, or formal relationships) persist even in the 27B model, consistent with patterns observed in other large language models. Question-answering format with minimal constraints tends to elicit higher hallucination rates than constrained generation tasks.

License restrictions: Unlike Apache 2.0 or MIT, the Gemma Terms of Service require downstream distributors to pass along Google's use restrictions. This creates compliance overhead for some commercial redistribution scenarios.

References

Gemma Team, Google DeepMind. "Gemma 3 Technical Report." arXiv:2503.19786 (March 12, 2025). https://arxiv.org/abs/2503.19786
Google Blog. "Introducing Gemma 3: The most capable model you can run on a single GPU or TPU." March 12, 2025. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-3/
Google DeepMind. "Gemma 3." https://deepmind.google/models/gemma/gemma-3/
Hugging Face Blog. "Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM." March 12, 2025. https://huggingface.co/blog/gemma3
Google Developers Blog. "Gemma explained: What's new in Gemma 3." https://developers.googleblog.com/en/gemma-explained-whats-new-in-gemma-3/
Google Developers Blog. "Announcing Gemma 3n preview: powerful, efficient, mobile-first AI." May 2025. https://developers.googleblog.com/en/introducing-gemma-3n/
Google AI for Developers. "Gemma releases." https://ai.google.dev/gemma/docs/releases
Google AI for Developers. "Gemma Terms of Use." https://ai.google.dev/gemma/terms
InfoQ. "Gemma 3n Introduces Novel Techniques for Enhanced Mobile AI Inference." 2025. https://www.infoq.com/news/2025/07/gemma-3n-architecture/
Hugging Face. "google/gemma-3-27b-it." https://huggingface.co/google/gemma-3-27b-it
Google Developers Blog. "Safer and Multimodal: Responsible AI with Gemma." https://developers.googleblog.com/en/safer-and-multimodal-responsible-ai-with-gemma/
Neowin. "Google releases Gemma 3n, a new AI model built for mobile devices." 2025. https://www.neowin.net/news/google-releases-gemma-3n-a-new-ai-model-built-for-mobile-devices/

Background

Gemma 1 and Gemma 2

Relationship to Gemini

Model variants

Native multimodal vision

Pan and Scan for high-resolution vision

128K context window

140 language support

Architecture improvements

Hybrid local-global attention

Rotary Position Embeddings

Tokenizer update

Distillation from Gemini

Quantization-aware training variants

License

Benchmark performance

Core benchmarks

Multilingual benchmarks

Vision benchmarks (pre-trained, without Pan and Scan)

Chatbot Arena

Comparison with competing models

Gemma 3n: mobile-optimized variant

Use cases

Reception

Limitations

See also

References

Improve this article

Related Articles

Gemma 2

Gemini 3

Gemma

Gemini 2.5 Flash

Phi-3

Phi-4

Background

Gemma 1 and Gemma 2

Relationship to Gemini

Model variants

Native multimodal vision

Pan and Scan for high-resolution vision

128K context window

140 language support

Architecture improvements

Hybrid local-global attention

Rotary Position Embeddings

Tokenizer update

Distillation from Gemini

Quantization-aware training variants

License

Benchmark performance

Core benchmarks

Multilingual benchmarks

Vision benchmarks (pre-trained, without Pan and Scan)

Chatbot Arena

Comparison with competing models

Gemma 3n: mobile-optimized variant

Use cases

Reception

Limitations

See also

References

Related Articles

Gemma 2

Gemini 3

Gemma

Gemini 2.5 Flash

Phi-3

Phi-4