Gemma 3n
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v2 · 1,410 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v2 · 1,410 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gemma 3n is an open, mobile-first multimodal model from Google built to run locally on phones, tablets, laptops, and other resource-constrained hardware, accepting text, image, audio, and video as input and returning text. Its larger E4B variant was, according to Google, the first model under 10 billion parameters to score above 1300 Elo on the LMArena leaderboard, yet it runs in roughly 3 GB of RAM by using memory-saving techniques such as the MatFormer nested architecture and Per-Layer Embeddings [2][7]. Google previewed the model in May 2025 and shipped the full release on June 26, 2025 [1][2].
Gemma 3n belongs to the wider Gemma family of open models and shares lineage with Gemma 3. It was developed by Google DeepMind, and Google has stated that the same underlying architecture powers the next generation of Gemini Nano, the on-device model embedded across Google products [1].
Gemma 3n is Google's open, on-device generative AI model designed for low-latency multimodal inference on consumer hardware. It takes text, images, audio, and video as input and produces text output, and it relies on a set of memory-saving techniques that let comparatively large networks fit within the tight RAM budgets of phones, tablets, and laptops [1][2]. Google describes it as a "powerful, efficient, mobile-first" model and as its first open model built on a shared architecture that it also uses for Gemini Nano [1][2].
Google announced a preview of Gemma 3n on May 20, 2025, timing the reveal to Google I/O 2025, where additional material went live at io.google on May 22 [1]. A staged preview followed on Hugging Face and Kaggle, with the litert-preview checkpoints noting that video input was still being rolled out at that stage [3]. The full developer release arrived on June 26, 2025, alongside a detailed engineering write-up [2].
The model fits Google's broader effort to push generative AI off the data center and onto local hardware, so that features can run with low latency, work offline, and keep data on the device [1][2].
At the center of Gemma 3n is MatFormer, short for Matryoshka Transformer, a nested design built for what Google calls elastic inference. Google describes it as "a novel nested transformer built for elastic inference" and adds: "Think of it like Matryoshka dolls: a larger model contains smaller, fully functional versions of itself" [2]. In Gemma 3n this means the larger model holds a complete, functional smaller model inside it, so developers can run the full network or extract the compact sub-model depending on the device and the latency budget [2][4]. Google trained the 4B-active variant so that it natively contains a nested 2B-active sub-model, and reports the extracted sub-model can deliver up to 2x faster inference [1][2].
The second major technique is Per-Layer Embeddings (PLE), described by Google DeepMind as an innovation that sharply reduces RAM use. Rather than keeping all embedding parameters resident on the accelerator, PLE gives each transformer layer its own embeddings and lets those parameters be generated and cached, with much of that work handled by the CPU. Google states the technique "dramatically improves model quality without increasing the high-speed memory footprint required on your device's accelerator (GPU/TPU)" [2]. Only the core transformer weights need to sit on the GPU or TPU, which lowers the accelerator memory the model needs to keep loaded [1][2][5]. Gemma 3n also supports conditional parameter loading, so an application can skip loading audio or vision parameters it does not need and reclaim that memory [6].
A further efficiency feature is KV cache sharing, which Google reports delivers a 2x improvement on prefill performance relative to Gemma 3 4B when handling long inputs. Combined with activation quantization, Google says these techniques give roughly 1.5x faster on-device responses than Gemma 3 4B at comparable quality [2].
Gemma 3n ships in two sizes, named by their effective rather than raw parameter counts. The "E" prefix denotes effective parameters: although the networks hold more total weights, the PLE and parameter-skipping techniques mean the active memory footprint resembles that of a smaller model [2][6]. Google states the models are "available in two sizes based on effective parameters: E2B and E4B," with raw counts of 5B and 8B respectively, "operating with as little as 2GB (E2B) and 3GB (E4B) of memory" [2]. The E2B model holds about 5 billion raw parameters but runs with an effective load of just under 2 billion (Google cites 1.91B), and the E4B model holds roughly 8 billion raw parameters while presenting a 4-billion effective footprint [2][6].
| Model | Raw parameters | Effective parameters | Approx. minimum RAM |
|---|---|---|---|
| Gemma 3n E2B | ~5B | ~2B (1.91B) | ~2 GB |
| Gemma 3n E4B | ~8B | ~4B | ~3 GB |
Because the E4B model contains the E2B model through MatFormer, a single download can serve both a higher-quality and a lighter-weight configuration [2][4]. Reported on-device measurements bear out the small footprint: on a Galaxy S25 Ultra running a 4-bit dynamic quantization, Google's documentation lists a model size of roughly 4.2 GB and peak memory in the low gigabytes [3].
Gemma 3n is natively multimodal. It takes text, images, audio, and video as input and returns text [2][4]. For vision, it uses a MobileNet-V5-300M encoder that normalizes images to 256x256, 512x512, or 768x768 pixels and encodes each image to 256 tokens; Google reports the encoder can process up to 60 frames per second on a Pixel phone [2][3]. Google further reports the MobileNet-V5 encoder "delivers a 13x speedup with quantization (6.5x without), requires 46% fewer parameters, and has a 4x smaller memory footprint" compared with the earlier baseline encoder [2]. Audio is handled by an encoder based on Google's Universal Speech Model, which produces about 6.25 tokens per second and supported clips up to 30 seconds at launch, enabling on-device automatic speech recognition and speech translation [2][3].
The model carries a 32,000-token context window [3][6]. Google reports text support across more than 140 languages, with multimodal understanding in 35 languages, and notes particularly strong speech translation between English and Spanish, French, Italian, and Portuguese [2][4]. The model card lists training on roughly 11 trillion tokens spanning web text, code, mathematics, images, and audio, with a knowledge cutoff of June 2024 [3]. On the LMArena leaderboard, Google reported that Gemma 3n E4B scored above 1300 Elo, stating it was "the first model under 10 billion parameters to reach this benchmark" [2][7].
Gemma 3n is aimed squarely at local inference. Google released it with open weights for commercial use and integrated it with Google AI Edge for on-device development, while also publishing checkpoints on Hugging Face and Kaggle [1][3][6]. The developer guide lists broad ecosystem support, including Hugging Face Transformers, llama.cpp, Ollama, MLX, LMStudio, transformers.js, Docker, Unsloth, and serving stacks such as vLLM and SGLang, in addition to cloud access through the Google GenAI API and Vertex AI [2]. The combination of a small memory footprint, the nested MatFormer sub-model, and PLE caching is what lets the model run within the constraints of phones and edge devices rather than requiring server-class accelerators [1][2].
Gemma 3n was first made available as a preview tied to Google I/O 2025 on May 20, 2025, then released in full on June 26, 2025 [1][2]. The weights are distributed under Gemma's open license terms and can be downloaded from Hugging Face and Kaggle, used through Google AI Edge for on-device work, or accessed in the cloud via Google's APIs [2][3][6]. As with other Gemini and Gemma releases, Google positioned Gemma 3n as both a standalone open model for developers and a research preview of architecture that feeds its on-device product line [1][2].