Gemma 3n
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,181 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,181 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemma 3n is an open, mobile-first multimodal model from Google built to run locally on phones, tablets, laptops, and other resource-constrained hardware. It accepts text, image, audio, and video as input and produces text output, and it relies on a set of memory-saving techniques that let comparatively large networks fit within the tight RAM budgets of consumer devices. Google previewed the model in May 2025 and shipped the full release in June 2025 [1][2].
Gemma 3n belongs to the wider Gemma family of open models and shares lineage with Gemma 3. It was developed by Google DeepMind, and Google has stated that the same underlying architecture powers the next generation of Gemini Nano, the on-device model embedded across Google products [1].
Google announced a preview of Gemma 3n on May 20, 2025, framing it as a "powerful, efficient, mobile-first" model and timing the reveal to Google I/O 2025, where additional material went live at io.google on May 22 [1]. A staged preview followed on Hugging Face and Kaggle, with the litert-preview checkpoints noting that video input was still being rolled out at that stage [3]. The full developer release arrived on June 26, 2025, alongside a detailed engineering write-up [2].
The model fits Google's broader effort to push generative AI off the data center and onto local hardware, so that features can run with low latency, work offline, and keep data on the device. Google describes Gemma 3n as its first open model built on a shared architecture that it also uses for Gemini Nano [1][2].
At the center of Gemma 3n is MatFormer, short for Matryoshka Transformer, a nested design built for what Google calls elastic inference. The idea borrows from Russian nesting dolls: a larger network fully contains a smaller, independently usable network. In Gemma 3n this means the larger model holds a complete, functional smaller model inside it, so developers can run the full network or extract the compact sub-model depending on the device and the latency budget [2][4]. Google trained the 4B-active variant so that it natively contains a nested 2B-active sub-model [1].
The second major technique is Per-Layer Embeddings (PLE), described by Google DeepMind as an innovation that sharply reduces RAM use. Rather than keeping all embedding parameters resident on the accelerator, PLE gives each transformer layer its own embeddings and lets those parameters be generated and cached, with much of that work handled by the CPU. Only the core transformer weights need to sit on the GPU or TPU, which lowers the accelerator memory the model needs to keep loaded [1][2][5]. Gemma 3n also supports conditional parameter loading, so an application can skip loading audio or vision parameters it does not need and reclaim that memory [6].
A further efficiency feature is KV cache sharing, which Google reports doubles prefill performance relative to Gemma 3 4B when handling long inputs. Combined with activation quantization, Google says these techniques give roughly 1.5x faster on-device responses than Gemma 3 4B at comparable quality [2].
Gemma 3n ships in two sizes, named by their effective rather than raw parameter counts. The "E" prefix denotes effective parameters: although the networks hold more total weights, the PLE and parameter-skipping techniques mean the active memory footprint resembles that of a smaller model [2][6]. The E2B model holds about 5 billion raw parameters but runs with an effective load of just under 2 billion (Google cites 1.91B), and the E4B model holds roughly 8 billion raw parameters while presenting a 4-billion effective footprint [2][6].
| Model | Raw parameters | Effective parameters | Approx. minimum RAM |
|---|---|---|---|
| Gemma 3n E2B | ~5B | ~2B (1.91B) | ~2 GB |
| Gemma 3n E4B | ~8B | ~4B | ~3 GB |
Because the E4B model contains the E2B model through MatFormer, a single download can serve both a higher-quality and a lighter-weight configuration [2][4]. Reported on-device measurements bear out the small footprint: on a Galaxy S25 Ultra running a 4-bit dynamic quantization, Google's documentation lists a model size of roughly 4.2 GB and peak memory in the low gigabytes [3].
Gemma 3n is natively multimodal. It takes text, images, audio, and video as input and returns text [2][4]. For vision, it uses a MobileNet-V5-300M encoder that normalizes images to 256x256, 512x512, or 768x768 pixels and encodes each image to 256 tokens; Google reports the encoder can process up to 60 frames per second on a Pixel phone [2][3]. Audio is handled by an encoder based on Google's Universal Speech Model, which produces about 6.25 tokens per second and supported clips up to 30 seconds at launch, enabling on-device automatic speech recognition and speech translation [2][3].
The model carries a 32,000-token context window [3][6]. Google reports text support across more than 140 languages, with multimodal understanding in 35 languages, and notes particularly strong speech translation between English and Spanish, French, Italian, and Portuguese [2][4]. The model card lists training on roughly 11 trillion tokens spanning web text, code, mathematics, images, and audio, with a knowledge cutoff of June 2024 [3]. On the LMArena leaderboard, Google reported that Gemma 3n E4B scored above 1300 Elo, which it described as the first model under 10 billion parameters to do so [2][7].
Gemma 3n is aimed squarely at local inference. Google released it with open weights for commercial use and integrated it with Google AI Edge for on-device development, while also publishing checkpoints on Hugging Face and Kaggle [1][3][6]. The developer guide lists broad ecosystem support, including Hugging Face Transformers, llama.cpp, Ollama, MLX, LMStudio, transformers.js, Docker, Unsloth, and serving stacks such as vLLM and SGLang, in addition to cloud access through the Google GenAI API and Vertex AI [2]. The combination of a small memory footprint, the nested MatFormer sub-model, and PLE caching is what lets the model run within the constraints of phones and edge devices rather than requiring server-class accelerators [1][2].
Gemma 3n was first made available as a preview tied to Google I/O 2025 on May 20, 2025, then released in full on June 26, 2025 [1][2]. The weights are distributed under Gemma's open license terms and can be downloaded from Hugging Face and Kaggle, used through Google AI Edge for on-device work, or accessed in the cloud via Google's APIs [2][3][6]. As with other Gemini and Gemma releases, Google positioned Gemma 3n as both a standalone open model for developers and a research preview of architecture that feeds its on-device product line [1][2].