EmbeddingGemma
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,831 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,831 words
Add missing citations, update stale details, or suggest a clearer explanation.
EmbeddingGemma is an open text embedding model from Google, released in September 2025, that turns text into dense numeric vectors for search, retrieval, classification, and clustering. It has about 308 million parameters and is built on the Gemma 3 architecture, which makes it small enough to run on a phone, a laptop, or other hardware without a connection to a data center. At launch Google described it as the highest-ranking open multilingual text embedding model under 500 million parameters on the Massive Text Embedding Benchmark, and it supports more than 100 languages while staying under 200 MB of memory when quantized.[1][2][3]
The model sits in a different part of Google's lineup from the chat-oriented Gemma models. Where a generative model produces text, an embedding model produces a fixed-length vector that captures meaning, so two sentences that say roughly the same thing land close together in vector space. That property is what powers semantic search and the retrieval step in a retrieval augmented generation pipeline. EmbeddingGemma was designed so that this step can happen locally, which keeps user data on the device and removes the latency and cost of a server round trip.[1][2]
Most high-quality embedding models in 2024 and 2025 were served through cloud APIs, including Google's own Gemini Embedding. Those models are accurate, but they require sending text to a remote server, and they bill per request. For applications that run offline, handle private documents, or need to embed large volumes of text cheaply, a smaller model that runs on the user's hardware is a better fit.
EmbeddingGemma targets that gap. Google positions it as the on-device counterpart to Gemini Embedding: the larger Gemini model remains the recommended choice for large-scale, server-side work, while EmbeddingGemma covers mobile and offline use where size and latency matter more than squeezing out the last point of accuracy.[2] The on-device framing also shapes the engineering. The model uses quantization-aware training so that it keeps most of its quality after weights are compressed, and it adopts Matryoshka Representation Learning so that a single set of weights can emit several embedding sizes depending on how much storage and compute an application can spare.[1][2]
EmbeddingGemma is built from the Gemma 3 large language model family, but it is not used as a generative decoder. The training recipe described in the technical report first adapts a decoder-only Gemma 3 checkpoint into an encoder-decoder model using the T5Gemma recipe, then takes the encoder half to initialize the embedding model. The point of that step is to give the model the richer bidirectional context that an encoder sees, where every token can attend to tokens on both sides, rather than the left-to-right view of an autoregressive decoder.[4]
The encoder produces one vector per input token. A mean pooling layer averages those token vectors into a single representation, and two dense projection layers map that representation to the final 768-dimensional output embedding.[1] Of the roughly 308 million parameters, Google breaks the count into about 100 million in the transformer backbone and about 200 million in the embedding table.[2]
Matryoshka Representation Learning is the feature that lets one model serve several output sizes. During training the model is encouraged to pack the most important information into the leading dimensions of the vector, so a 768-dimensional embedding can be truncated to 512, 256, or 128 dimensions and still work well. Smaller vectors use less storage and make nearest-neighbor search faster, at a modest cost in accuracy. The technique is shared with several recent embedding models and is documented separately under Matryoshka representation learning.[1][4]
The training itself leans on two further ideas. The model distills knowledge from Gemini Embedding through embedding matching, so the smaller model learns to imitate the geometry of a much larger teacher. It also uses model souping, which averages the weights of several checkpoints that were fine-tuned on different data mixtures rather than different hyperparameters, and a spread-out regularizer that pushes embeddings to use the vector space more evenly. Training ran on TPUv5e hardware over roughly 320 billion tokens of multilingual web text, code, and technical documentation.[1][4]
The table below collects the headline figures from Google's documentation and the technical report.
| Property | Value |
|---|---|
| Developer | Google DeepMind |
| Release date | September 4, 2025 |
| Parameters | ~308 million (about 100M backbone, 200M embedding) |
| Base architecture | Gemma 3, with T5Gemma encoder-decoder initialization |
| Encoder type | Bidirectional, mean pooling plus two dense layers |
| Output dimensions | 768, truncatable to 512, 256, or 128 (Matryoshka) |
| Context length | 2,048 tokens |
| Languages | 100+ |
| Training data | ~320 billion tokens |
| Training hardware | TPUv5e |
| Memory footprint | Under 200 MB of RAM when quantized |
| Precision | float32 or bfloat16 (float16 not supported) |
| License | Gemma Terms of Use |
The model expects task-specific prompts on its inputs. A search query is prefixed with text such as "task: search result | query:", a stored document is prefixed with "title:" and "text:", and other modes exist for classification, clustering, semantic similarity, question answering, code retrieval, and fact checking. Matching the prompt to the task is part of getting good results.[3]
EmbeddingGemma was evaluated on the Massive Text Embedding Benchmark, both its broad multilingual track (often written MMTEB) and its English and code tracks. The scores below are from Google's model card at full precision and the full 768-dimensional output. Each track reports a mean across individual tasks and a mean across task types.
| MTEB track | Mean (Task) | Mean (TaskType) |
|---|---|---|
| Multilingual v2 | 61.15 | 54.31 |
| English v2 | 69.67 | 65.11 |
| Code v1 | 68.76 | 68.76 |
The headline claim from Google and the technical report is about standing rather than raw numbers: among open models with fewer than 500 million parameters, EmbeddingGemma ranked at the top of the multilingual leaderboard at release, and Google reports performance comparable to models close to twice its size.[1][2][4] That ratio of quality to cost is the model's main selling point, since it lets developers reach for an on-device model in places that previously needed a larger or cloud-hosted one.
Two properties hold up under compression. Because of Matryoshka training, truncating the embedding from 768 to a smaller size loses only a little accuracy, and because of quantization-aware training, compressing the weights does the same.[1][4] On Google's reported figures the model runs in under 200 MB of RAM once quantized, and on EdgeTPU hardware it produces embeddings for a 256-token input in under 15 milliseconds, which is fast enough for interactive use.[2]
The clearest application is on-device retrieval augmented generation. A small embedding model indexes a user's local files, notes, or messages, and a small generative model then answers questions grounded in whatever the embedding step retrieves, all without leaving the device. EmbeddingGemma is sized to pair with compact generative models such as the smaller Gemma 3 variants for exactly this kind of fully local assistant.[2]
Beyond that, the model fits the standard menu of embedding tasks: semantic search over a document collection, clustering similar items, classification, deduplication, and recommendation. The multilingual coverage means a query in one language can retrieve relevant passages written in another, which is useful for search across mixed-language corpora. The small output dimensions help when an application must store millions of vectors, since a 128-dimensional embedding takes a sixth of the space of the full 768.[1][3]
Google and the community shipped EmbeddingGemma into the common tooling quickly. It works with the Sentence Transformers library and Hugging Face Transformers, and it has support across runtimes and vector stores including llama.cpp, MLX, Ollama, LiteRT, transformers.js, LM Studio, Weaviate, LlamaIndex, and LangChain.[1][2] The model is distributed through Hugging Face, Kaggle, and Vertex AI.[4]
Within Google's own lineup, EmbeddingGemma and Gemini Embedding are complementary. Gemini Embedding is a larger, API-served model aimed at server-side workloads where accuracy is the priority, and EmbeddingGemma is the open, on-device option that distills from it. Sharing a teacher helps the smaller model punch above its parameter count.[2][4]
In the wider field, EmbeddingGemma competes with other compact open embedders such as the multilingual E5 family, the BGE models from BAAI, the GTE and Qwen embedding models from Alibaba, Nomic Embed, and Snowflake's Arctic Embed. Many of these also use Matryoshka-style training and instruction prompts, so the differences come down to size, language coverage, license, and benchmark standing. EmbeddingGemma's distinguishing combination is its small footprint, its broad multilingual support, and its leaderboard position among sub-500M models at the time of release.[1][4]
EmbeddingGemma is released under the Gemma Terms of Use, the same license that covers the rest of the Gemma family, rather than a standard open-source license such as Apache 2.0. The terms make the weights freely available and permit commercial use, fine-tuning, and redistribution, subject to a prohibited-use policy that restricts certain harmful applications. Anyone redistributing the model or derivatives is expected to pass the same terms along.[2][3] Some early third-party write-ups described the model as Apache 2.0, but the official model card lists the license as "gemma" and points to Google's Gemma terms.[3]
The quality of an embedding model is bounded by its training data, and Google notes that gaps or biases in that data can carry into the model's representations despite filtering for unsafe and sensitive content. The model can also miss subtle meaning, such as sarcasm or figurative language, which affects how well similar-but-not-identical texts cluster.[3]
There are practical constraints too. The 2,048-token context window means long documents must be chunked before embedding, and the prompts that prefix queries and documents need to match the task to get the best results. Truncating embeddings to 128 or 256 dimensions trades accuracy for size, so the right dimension depends on the application. And although the model leads its size class, larger cloud models and Google's own Gemini Embedding still score higher overall, which is the expected price of running on a phone instead of a server.[1][2][3]