Gemma 3 is a family of open-weight large language models developed by Google DeepMind and released on March 12, 2025. It is the third generation of the Gemma model series and the first to incorporate native multimodal capabilities alongside text. Gemma 3 comes in four parameter sizes (1B, 4B, 12B, and 27B) and introduces a significantly expanded 128K-token context window, support for over 140 languages, and a hybrid local-global attention architecture that makes large-context inference practical on consumer hardware. The 27B instruction-tuned variant achieved a Chatbot Arena Elo score of 1338 at launch, ranking ninth overall and outperforming much larger models including Llama 3.1 405B and DeepSeek-V3 on that leaderboard.
Google described Gemma 3 as "the most capable model you can run on a single GPU or TPU," referencing the 27B model's ability to run on a single NVIDIA H100, versus eight or more for comparably performing competitors at the time.
The Gemma model family began in February 2024 when Google DeepMind released Gemma 1, a pair of lightweight open-weight text models at 2B and 7B parameter scales. Gemma 1 drew on techniques developed for the Gemini model family and demonstrated that relatively small open models, trained carefully and distilled from larger systems, could achieve competitive performance on standard benchmarks. The release attracted significant developer interest and established a pattern Google would continue: open weights with a custom proprietary license rather than a fully open-source license.
Gemma 2 followed in June 2024 with 2B, 9B, and 27B variants. It introduced local-global interleaved attention as a memory-saving measure and improved benchmark scores across the board. However, Gemma 2 remained text-only with a maximum context of 8K tokens and limited multilingual coverage. The gap between Gemma 2 and frontier models on vision and long-context tasks was a visible limitation.
Gemma 3 addresses both gaps directly. It adds multimodal vision to three of its four variants, expands the context window to 128K tokens for those same variants, and roughly doubles the multilingual training data to cover 140 languages.
Gemma 3 is tightly coupled to Gemini at the architecture and training levels. The tokenizer matches Gemini 2.0's 262K-entry SentencePiece vocabulary, which provides better encoding efficiency for non-Latin scripts. The pre-training and post-training pipelines use knowledge distillation from Gemini models as a teacher signal. The technical report describes this as sampling 256 logits per token weighted by teacher probabilities, with the student Gemma model trained to match the teacher's output distribution via cross-entropy loss. This distillation approach is central to why relatively small Gemma 3 variants can match the performance of larger independently trained models.
Gemma 3 ships in four base sizes, each available in a pre-trained (PT) version and an instruction-tuned (IT) version. Three quantized variants are also available for each instruction-tuned model.
| Variant | Parameters | Vision | Context window | Languages |
|---|---|---|---|---|
| Gemma 3 1B | 1B (302M embedding + 698M non-embedding) | Text only | 32K tokens | English primarily |
| Gemma 3 4B | 4B (675M embedding + 3,209M non-embedding + 417M vision encoder) | Text and images | 128K tokens | 140+ |
| Gemma 3 12B | 12B (1,012M embedding + 10,759M non-embedding + 417M vision encoder) | Text and images | 128K tokens | 140+ |
| Gemma 3 27B | 27B (1,416M embedding + 25,600M non-embedding + 417M vision encoder) | Text and images | 128K tokens | 140+ |
The 1B model is optimized for on-device and edge deployment and runs at up to 2,585 tokens per second on prefill when using Google AI Edge's LLM inference runtime. Its smaller footprint (approximately 529 MB in quantized form) makes it suitable for embedding directly in mobile applications.
The 4B model is notable for punching well above its weight class. According to the technical report, Gemma 3 4B IT scores comparably to Gemma 2 27B IT on standard instruction-following benchmarks, representing a roughly 7x improvement in parameter efficiency across generations.
The 27B model is positioned as a single-GPU deployment target. In bfloat16 precision, the raw model weights occupy 54 GB, which fits within a single high-memory GPU such as an NVIDIA H100 (80 GB HBM). With int4 quantization (QAT), the weights compress to 14.1 GB, and the full model including a 32K token KV cache fits within 32.8 GB, making it accessible on prosumer hardware.
A 270M parameter model was added to the Gemma 3 family in August 2025 for use cases requiring very low memory and compute budgets.
Gemma 3 is the first generation in the Gemma family to support image understanding. The 4B, 12B, and 27B models each incorporate a 417M-parameter SigLIP vision encoder that processes images at a fixed input resolution of 896 by 896 pixels. The encoder is frozen during language model training; only the language model weights and a lightweight multi-modal projector (which converts visual features into 256 "soft tokens" readable by the language model) are updated during training.
The vision architecture follows the same general pattern as PaliGemma, Google's earlier vision-language model. Image tokens are processed with bidirectional (non-causal) attention, while text tokens use standard causal attention. This combination lets the model attend fully across all image patches before generating output tokens.
Multimodal support unlocks a range of tasks including document understanding, chart and table reading, image captioning, visual question answering, and interleaved image-text reasoning. The 27B instruction-tuned model achieved 85.6 on DocVQA, 68.6 on TextVQA, 76.3 on ChartQA, and 64.9 on MMMU in zero-shot evaluation.
The fixed 896x896 input resolution of the SigLIP encoder creates a problem for images with unusual aspect ratios or fine-grained detail spread across a large image. To address this, Gemma 3 introduces Pan and Scan (P&S), an adaptive windowing algorithm that runs at inference time.
Pan and Scan works by segmenting a non-square or high-resolution source image into overlapping crops, resizing each crop to 896x896, and encoding each crop independently through the SigLIP encoder. The resulting token sequences are concatenated before being passed to the language model. This gives the model the equivalent of "zooming in" on different parts of the image without changing the encoder architecture.
The performance benefit is measurable. On DocVQA, enabling Pan and Scan for the 27B model improves scores by 4.8 percentage points. On InfoVQA, which tests understanding of information-dense documents with complex layouts, the gain is 17.0 percentage points. The cost is additional compute proportional to the number of crops generated.
Gemma 2 supported a maximum context of 8K tokens. Gemma 3 extends this to 128K tokens for the 4B, 12B, and 27B models (the 1B model supports 32K). Google estimated this as roughly equivalent to 96,000 words, 198 pages of text, 500 images, or more than eight minutes of video at one frame per second.
Extending context length at this scale without making inference prohibitively expensive required architectural changes. The key mechanism is the hybrid local-global attention design described in the architecture section below. The practical result is that the KV cache memory overhead at 32K context drops to below 15% of total model memory, compared to around 60% for a standard global-attention-only model. This makes 128K context inference feasible on a single GPU with quantization.
Pre-training uses 32K sequence lengths, with 128K support added via a RoPE scaling phase at the end of pre-training. On the RULER long-context benchmark, the 27B pre-trained model achieves 72.9% accuracy at 128K context.
Gemma 2 had limited multilingual coverage. Gemma 3 expands coverage to over 140 languages by increasing the proportion of multilingual data in pre-training and adopting the Gemini 2.0 tokenizer, which handles non-Latin scripts more efficiently.
The Gemini 2.0 tokenizer's 262K-entry vocabulary (versus smaller vocabularies in many competing models) encodes Chinese, Japanese, Korean, Arabic, and other non-Latin scripts more compactly. This means equivalent multilingual content consumes fewer tokens, reducing memory pressure and inference cost for non-English languages.
On the MGSM (multilingual grade school math) benchmark, Gemma 3 27B IT scores 74.3, compared to 34.7 for the 4B model and 64.3 for the 12B model. On Global MMLU, the 27B model scores 75.7. On WMT24++ translation, the 27B model scores 55.7.
The 1B model is primarily English-focused and not recommended for multilingual tasks at the same reliability level as the larger variants.
Gemma 3's most important architectural change from Gemma 2 is the shift in the ratio of local to global attention layers. Gemma 2 used a 1:1 ratio, alternating between local sliding window attention and global full-sequence attention. Gemma 3 moves to a 5:1 ratio: five consecutive local attention layers followed by one global attention layer, repeating throughout the model depth.
Local attention layers use a sliding window of only 1,024 tokens. They efficiently handle short-range dependencies without requiring the model to store key-value pairs for the entire sequence. The single global layer in each group attends to the full 128K context window and handles long-range dependencies.
Because only one in six layers needs to maintain a full-context KV cache, the memory overhead scales much more favorably. At 32K context, this architecture reduces KV-cache memory to below 15% of model memory; a global-only model at the same context would use around 60%.
Soft-capping from Gemma 2 (which applied a tanh function to attention logits to stabilize training) is replaced with QK-norm (query-key normalization). QK-norm proved both more stable and faster at the sequence lengths needed for 128K context training.
Gemma 3 uses a dual RoPE (Rotary Position Embedding) strategy. Local attention layers retain the standard RoPE base frequency of 10K, appropriate for short windows. Global attention layers use a base frequency of 1M, a 100x increase from the 10K used in Gemma 2. This higher base frequency allows the positional encoding to generalize to much longer sequences without degradation.
Models are pre-trained at 32K sequence lengths and then extended to 128K by rescaling RoPE with a scaling factor of 8, following a well-established protocol for extending context post-pre-training.
The vocabulary grew from the smaller tokenizers used in Gemma 1 and Gemma 2 to 262K entries, matching Gemini 2.0. The tokenizer uses SentencePiece with split digits, preserved whitespace, and byte-level encodings. The larger vocabulary directly improves efficiency for multilingual text and reduces the token count needed for math and code where precise digit handling matters.
Knowledge distillation is central to Gemma 3's training methodology at both the pre-training and post-training stages.
During pre-training, Gemma 3 models are trained on tokens generated or supervised by Gemini family teacher models. The specific mechanism described in the technical report involves sampling 256 logits per token from the teacher model, weighted by their probabilities, and training the student to match this compressed distribution. This "top-K logit distillation" approach transfers the teacher's uncertainty and calibration to the student, not just its most likely predictions.
During post-training, instruction-tuned versions of Gemma 3 are trained using a similar distillation process from a large instruction-tuned Gemini teacher. This is supplemented by reinforcement learning from human feedback via three techniques: BOND (Best-of-N Distillation), WARM (Weight Averaged Reward Models), and WARP (Weighted Averaging of Reward Policies). Reward functions target helpfulness, mathematics, coding, reasoning, instruction-following, and multilingual performance, with code execution feedback used as a ground-truth signal for coding tasks.
The result of this training stack is that Gemma 3 models achieve performance well above what their parameter counts would predict from training alone. The 4B model matches Gemma 2 27B on many benchmarks; the 27B model matches Gemini 1.5 Pro on some.
Gemma 3 ships with official quantized weights for all instruction-tuned variants, produced through Quantization-Aware Training (QAT). Unlike post-training quantization, which applies precision reduction after training is complete, QAT incorporates the quantization into the training process itself, allowing the model to adapt its weights to minimize the accuracy loss from reduced precision.
Three weight formats are provided:
| Format | Description |
|---|---|
| Per-channel int4 | Each output channel has its own scale factor; best accuracy among int4 options |
| Per-block int4 (blocks=32) | Groups of 32 weights share a scale factor; slightly higher accuracy than per-channel for some hardware |
| Switched FP8 (SFP8) | 8-bit floating point; intermediate between int4 and bfloat16 in size and accuracy |
For the 27B model, these translate to the following memory footprints:
| Precision | Weights only | Weights + 32K KV cache |
|---|---|---|
| bfloat16 | 54.0 GB | 72.7 GB |
| int4 (per-channel) | 14.1 GB | 32.8 GB |
| int4 (blocks=32) | 15.3 GB | 34.0 GB |
| SFP8 | 27.4 GB | 46.1 GB |
The int4 versions enable running the 27B model on consumer GPUs like the NVIDIA RTX 3090 (24 GB) at reduced context lengths, or on a single H100 at full 32K context.
Gemma 3 is released under the Gemma Terms of Service, a custom proprietary license maintained by Google. It is not an open-source license under the Open Source Initiative definition.
The license permits free use, modification, and distribution for most purposes, including commercial applications. Users who redistribute Gemma or derivatives must include a copy of the terms and must incorporate the use restrictions from Section 3.2 (which reference Google's Gemma Prohibited Use Policy) in any downstream distribution agreements.
The Prohibited Use Policy bars uses including generating content for weapons of mass destruction, creating malware, generating child sexual abuse material, and other clearly harmful applications. These terms are broadly consistent with similar "responsible use" provisions in Meta's Llama licenses and Microsoft's Phi licenses.
Gemma 4, released in 2025, moved to the Apache 2.0 license. Gemma 3 retains the custom terms.
| Benchmark | Gemma 3 4B | Gemma 3 12B | Gemma 3 27B | Gemini 2.0 Pro |
|---|---|---|---|---|
| MMLU-Pro | 43.6 | 60.6 | 67.5 | 79.1 |
| MATH | 75.6 | 83.8 | 89.0 | 91.8 |
| GPQA Diamond | 30.8 | 40.9 | 42.4 | 64.7 |
| LiveCodeBench | 12.6 | 24.6 | 29.7 | 36.0 |
| Bird-SQL | 36.3 | 47.9 | 54.4 | 59.3 |
| SimpleQA | 4.0 | 6.3 | 10.0 | 44.3 |
| Global MMLU | 57.0 | 69.4 | 75.7 | -- |
| Benchmark | Gemma 3 4B | Gemma 3 12B | Gemma 3 27B |
|---|---|---|---|
| MGSM | 34.7 | 64.3 | 74.3 |
| Global-MMLU-Lite | 54.5 | 69.5 | 75.1 |
| WMT24++ | 48.4 | 53.9 | 55.7 |
| Flores | 39.2 | 46.0 | 48.8 |
| Benchmark | Gemma 3 4B | Gemma 3 12B | Gemma 3 27B |
|---|---|---|---|
| DocVQA | 72.8 | 82.3 | 85.6 |
| TextVQA | 58.9 | 66.5 | 68.6 |
| ChartQA | 63.6 | 74.7 | 76.3 |
| VQAv2 | 63.9 | 71.2 | 72.9 |
| MMMU (val) | 39.2 | 50.3 | 56.1 |
Gemma 3 27B IT achieved a Chatbot Arena Elo score of 1338 at launch, placing ninth overall. This put it ahead of DeepSeek-V3 (1318), Llama 3.1 405B (1269), Qwen 2.5 72B (1257), and comparable to o1-preview (1335).
Gemma 3 launched into a competitive field of open-weight models. The comparisons most frequently cited in the technical report and community evaluations involve Llama 3.3 70B (Meta), Phi-4 14B (Microsoft), and Qwen 2.5 72B (Alibaba).
| Model | Developer | Parameters | Context | Multimodal | MMLU-Pro | MATH | Arena Elo |
|---|---|---|---|---|---|---|---|
| Gemma 3 27B | Google DeepMind | 27B | 128K | Yes | 67.5 | 89.0 | 1338 |
| Gemma 3 12B | Google DeepMind | 12B | 128K | Yes | 60.6 | 83.8 | -- |
| Llama 3.3 70B | Meta | 70B | 128K | No (text only) | ~65 | ~77 | ~1280 |
| Phi-4 14B | Microsoft | 14B | 16K | No (text only) | ~70 | ~80 | ~1140 |
| Qwen 2.5 72B | Alibaba | 72B | 128K | No (text only) | ~65 | 83.1 | 1257 |
Several differences stand out. Gemma 3 is the only model in this group to include multimodal vision at the time of its March 2025 release. At 27B parameters, it achieves a higher Chatbot Arena Elo than Llama 3.3 70B (which has 2.6x more parameters) and Qwen 2.5 72B (which has 2.7x more parameters). Phi-4 at 14B scores competitively on math and reasoning benchmarks but lacks multimodal capability and has a significantly shorter context window of 16K tokens.
Qwen 2.5 72B shows a stronger advantage on CJK (Chinese, Japanese, Korean) language tasks, benefiting from Alibaba's larger multilingual dataset. Llama 3.3 70B is generally faster at inference due to Meta's optimization of its architecture for throughput-oriented serving. Gemma 3 4B is notable as perhaps the most capable 4B-class multimodal model available at its release date, with a 128K context window that no competing 4B model matched.
In May 2025, Google DeepMind released a preview of Gemma 3n, a separate model family derived from Gemma 3 but designed specifically for on-device deployment on mobile phones and edge hardware. The full release followed in late June 2025.
Gemma 3n introduces several architectural innovations not present in the main Gemma 3 models:
MatFormer (Matryoshka Transformer): Gemma 3n uses a nested architecture where a larger model (nominally 4B parameters, referred to as E4B) contains a smaller functional submodel (nominally 2B parameters, E2B) embedded within it. At inference time, developers can choose to use either the full model or the nested submodel without reloading weights. This enables dynamic performance-quality tradeoff without requiring separate model files.
Per-Layer Embeddings (PLE): Embedding parameters are distributed across layers rather than stored in a single shared table. According to Google DeepMind, this reduces RAM requirements substantially: Gemma 3n E4B achieves performance comparable to a 4B model while requiring only 3 GB of active memory. The E2B submodel runs in approximately 2 GB.
Expanded multimodality: In addition to text and images, Gemma 3n includes audio understanding through a dedicated audio encoder, supporting automatic speech recognition, speech translation, and interleaved audio-text inputs. The vision encoder is a MobileNet-based architecture optimized for mobile inference rather than the SigLIP encoder used in Gemma 3.
Gemma 3n is roughly 1.5x faster on mobile hardware than Gemma 3 4B, according to Google's benchmarks. It integrates with next-generation Gemini Nano features on Android.
The combination of multimodal input, 128K context, and competitive performance across a range of sizes has made Gemma 3 models widely used for several application categories.
Retrieval-augmented generation (RAG): The 128K context window allows Gemma 3 models to process large retrieved document sets without truncation. The multimodal variants can handle mixed text-and-image document collections, making them suitable for enterprise document RAG pipelines where scanned PDFs, charts, and tables are common.
Coding assistance: Gemma 3 instruction-tuned models perform well on code generation and debugging tasks. The Bird-SQL benchmark, which tests text-to-SQL conversion, shows the 27B model at 54.4, and LiveCodeBench shows competitive scores. The models support function calling and structured JSON output, which are required for tool-using coding agents.
Multilingual applications: Support for 140 languages with a single model simplifies deployment for global applications. The Gemma 3 tokenizer's efficient encoding of non-Latin scripts reduces inference cost compared to models with smaller vocabularies.
On-device deployment: The 1B model and QAT quantized versions of larger models run on consumer hardware. The 1B model fits in mobile application bundles; the int4 quantized 4B model runs on laptops with 8 GB of VRAM.
Medical and scientific applications: Google released MedGemma, a Gemma 3 4B variant fine-tuned on medical data, which showed state-of-the-art performance among 4B-class models on medical multimodal question answering and chest X-ray classification tasks.
Document and visual understanding: Pan and Scan's ability to process high-resolution documents makes Gemma 3 multimodal models practical for document intelligence workflows involving scanned pages, infographics, and mixed-layout content.
Gemma 3's March 2025 release generated significant interest in the developer and research communities. Downloads of Gemma models across all generations exceeded 150 million by May 2025. The broader "Gemmaverse" of community fine-tunes and derivative models exceeded 60,000 variants on Hugging Face within two months of the Gemma 3 release.
The Chatbot Arena performance of the 27B model drew particular attention. Ranking ninth overall while being runnable on a single GPU put it in a category that no prior open model had occupied at that scale. Developers noted that comparisons to Llama 3.1 405B (which requires multiple high-memory GPUs) and DeepSeek-V3 (a mixture-of-experts model requiring significant infrastructure) were meaningful for practical deployment.
Google launched a Gemma 3 Academic Program alongside the model, offering $10,000 in Google Cloud credits per award to research groups building on Gemma 3. More than 600 projects were submitted to the Gemma 3n Impact Challenge on Kaggle by mid-2025.
Some developers raised concerns about the Gemma Terms of Service license, noting that the propagation requirements (requiring downstream distributors to pass along the terms) create compliance overhead in commercial redistribution scenarios. The license is more permissive than many enterprise contexts require documentation of, but less permissive than Apache 2.0 or MIT. Google's decision to adopt Apache 2.0 for Gemma 4 was widely seen as a response to this feedback.
Researchers also noted the SimpleQA weakness (10.0 for the 27B model versus 44.3 for Gemini 2.0 Pro), which indicates that despite strong performance on structured reasoning tasks, Gemma 3 can be unreliable on basic factual recall questions. This limitation was acknowledged in the technical report.
Gemma 3 has several documented limitations.
SimpleQA and factual recall: The 27B model scores 10.0 on SimpleQA, far below Gemini 2.0 Pro (44.3). This suggests the model is prone to hallucinating on basic factual queries even when strong performance on other benchmarks might suggest otherwise. The gap likely reflects the distillation training approach prioritizing reasoning and instruction-following over memorization of factual content.
Vision resolution ceiling: The SigLIP encoder has a fixed input resolution of 896x896 pixels. Pan and Scan partially addresses this for high-resolution documents, but images requiring continuous fine-grained detail across the entire frame (such as satellite imagery or medical scans with small lesions) may be processed with less accuracy than purpose-built vision models.
Long-context reasoning depth: While Gemma 3 supports 128K tokens and the RULER benchmark shows reasonable retrieval accuracy at that length, the depth of actual multi-hop reasoning over very long contexts is less well-characterized. Many benchmarks measure whether the model can find relevant information in a long context, not whether it can reason across multiple pieces of information spread across 100K+ tokens.
1B model limitations: The 1B variant is English-focused and does not include vision support. It is not recommended for multilingual tasks at comparable reliability to the larger models.
Symbolic hallucination: Research published after the Gemma 3 release found that symbolic hallucinations (errors involving structured reasoning with symbols, counts, or formal relationships) persist even in the 27B model, consistent with patterns observed in other large language models. Question-answering format with minimal constraints tends to elicit higher hallucination rates than constrained generation tasks.
License restrictions: Unlike Apache 2.0 or MIT, the Gemma Terms of Service require downstream distributors to pass along Google's use restrictions. This creates compliance overhead for some commercial redistribution scenarios.