Qwen3 Embedding
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,044 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,044 words
Add missing citations, update stale details, or suggest a clearer explanation.
Qwen3 Embedding is a family of open text embedding and reranking models released by Alibaba's Qwen team in June 2025. The series comes in three sizes, 0.6 billion, 4 billion, and 8 billion parameters, and it is built on the Qwen3 foundation models. Each model converts text into dense vectors that can be compared for semantic similarity, and the companion rerankers score how well a document answers a query. At launch the 8B embedding model reached the top spot on the multilingual leaderboard of the Massive Text Embedding Benchmark, and the whole series ships under the permissive Apache 2.0 license. [1][2][3]
The models are designed for retrieval, search, and retrieval-augmented generation, where good embeddings decide which passages a large language model gets to read. Alibaba positions the series as a unified answer to both halves of that pipeline, with embedding models for first-stage recall and reranking models for precise reordering of the top candidates. [3]
Qwen3 Embedding is really two model families that share a training recipe and a base model. The embedding family has three members, Qwen3-Embedding-0.6B, Qwen3-Embedding-4B, and Qwen3-Embedding-8B. The reranking family mirrors those sizes with Qwen3-Reranker-0.6B, Qwen3-Reranker-4B, and Qwen3-Reranker-8B. [1][2]
The difference between the two is how they are used. An embedding model maps a single piece of text to a fixed-length vector, so you can encode a whole corpus once, store the vectors, and later find matches by cosine similarity against an encoded query. A reranker instead takes a query and a candidate document together and outputs a single relevance score, which is more accurate but too slow to run over millions of documents. The common pattern is to retrieve a few hundred candidates with the embedding model and then reorder them with the reranker. [2][3]
All six models inherit the multilingual reach of Qwen3, covering more than 100 natural languages plus programming languages. They support context windows up to 32,768 tokens, which lets them embed long documents without aggressive truncation. [1][2]
Each embedding model is a Qwen3 transformer adapted for representation learning. The 0.6B model has 28 layers, and the 4B and 8B models have 36 layers each. Because the backbone is a causal language model rather than a bidirectional encoder like BERT, the embedding is read off the hidden state of the final token. Qwen3 Embedding appends an end-of-text marker and uses the hidden state at that position as the sentence vector, a scheme often called last-token pooling. [2][4]
The models are instruction aware. A query is formatted as the instruction followed by the text, in the template {Instruction} {Query}<|endoftext|>, so the same model can produce different embeddings for different tasks just by changing the natural-language instruction. The Qwen team reports that a well chosen instruction typically improves results by one to a few points, and the default for search is a plain sentence such as asking the model to retrieve passages that answer a web query. Documents on the corpus side are usually encoded without an instruction. [2][4]
A notable feature is support for Matryoshka representation learning. The full vectors are 1024 dimensions for the 0.6B model, 2560 for the 4B model, and 4096 for the 8B model, but the training arranges information so that a prefix of each vector is still useful on its own. Users can request shorter outputs, anywhere from 32 dimensions up to the full width, and trade a little accuracy for cheaper storage and faster similarity search. [2][4]
The rerankers use the same Qwen3 backbones but are framed as a classification problem. The model reads a system prompt, the query, and a candidate document through the chat template, and the score comes from the logits it assigns to a yes-or-no judgment of relevance. That raw score can be turned into a probability between 0 and 1 with a sigmoid if a calibrated value is needed. [2][4]
The embedding models are trained with a multi-stage pipeline that leans heavily on synthetic data. In the first stage the team uses Qwen3 itself to generate weakly supervised training pairs. A configurable prompt system asks the foundation model to produce queries, relevant passages, and related text across many task types and languages, which sidesteps the usual reliance on scraped community question-and-answer data. The paper reports roughly 150 million such pairs generated this way. [4]
The second stage is supervised fine-tuning on high-quality data. This mixes human-labeled retrieval datasets with a filtered subset of the synthetic pairs, about 12 million examples chosen by cosine similarity so that only clean, well-matched pairs survive. Both stages use a contrastive learning objective that pulls matching query-document pairs together while pushing apart in-batch negatives and mined hard negatives. [4]
The final stage is model merging. Rather than picking a single best checkpoint, the team applies spherical linear interpolation, often shortened to slerp, to combine several checkpoints saved during fine-tuning. Merging tends to average away the idiosyncrasies of any one run and produces a model that generalizes better across the wide spread of evaluation tasks. [4]
The rerankers follow a shorter path. The Qwen team found in its own experiments that the weakly supervised first stage did not help reranking, so those models go straight to supervised training on high-quality labeled data, again finished with model merging. [3][4]
Because the backbone was pretrained on a multilingual corpus, the embedding models handle cross-lingual search, where the query is in one language and the answer is in another, along with monolingual retrieval in many languages. The series also targets code retrieval, the task of finding a relevant snippet or function given a natural-language description, which connects it to Alibaba's broader code work such as Qwen3-Coder. On the code portion of the MTEB suite the 8B embedding model scored 80.68, which the paper reports as ahead of the proprietary Gemini Embedding model on that task. [1][4]
Qwen3 Embedding was evaluated mainly on MTEB and its multilingual expansion. The multilingual track draws on MMTEB, a community-built benchmark of more than 500 tasks across over 250 languages, which is the largest multilingual evaluation collection for embedding models. On the MTEB multilingual leaderboard snapshot from June 2025, Qwen3-Embedding-8B ranked first with a mean task score of 70.58, ahead of Google's experimental Gemini Embedding entry. [1][4][5]
The table below lists the embedding scores the Qwen team reported across the multilingual, English, and Chinese benchmarks. Mean task scores are shown for each.
| Model | Params | MMTEB (multilingual) | MTEB English v2 | C-MTEB (Chinese) |
|---|---|---|---|---|
| Qwen3-Embedding-0.6B | 0.6B | 64.33 | 70.70 | 66.33 |
| Qwen3-Embedding-4B | 4B | 69.45 | 74.60 | 72.27 |
| Qwen3-Embedding-8B | 8B | 70.58 | 75.22 | 73.84 |
| gemini-embedding-exp-03-07 | n/a | 68.37 | n/a | n/a |
Scores are from the Qwen3 Embedding paper and model cards. [2][4]
The rerankers were evaluated by reordering the top 100 candidates that Qwen3-Embedding-0.6B retrieved, so their numbers measure the lift a reranker adds on top of a fixed first stage. The table reports mean scores on several retrieval suites.
| Reranker | Params | MTEB-R | CMTEB-R | MMTEB-R | MLDR | MTEB-Code |
|---|---|---|---|---|---|---|
| Qwen3-Reranker-0.6B | 0.6B | 65.80 | 71.31 | 66.36 | 67.28 | 73.42 |
| Qwen3-Reranker-4B | 4B | 69.76 | 75.94 | 72.74 | 69.97 | 81.20 |
| Qwen3-Reranker-8B | 8B | 69.02 | 77.45 | 72.94 | 70.19 | 81.22 |
Reranker scores are from the Qwen3 Reranker model cards. [2]
A pattern worth noting is that the 4B and 8B models are close on many tracks, and the 4B reranker actually edges the 8B on the English MTEB retrieval split. The 0.6B models trail the larger ones but remain competitive for their size, which the paper frames as the main argument for shipping a small option. [2][4]
| Attribute | 0.6B | 4B | 8B |
|---|---|---|---|
| Parameters | 0.6B | 4B | 8B |
| Layers | 28 | 36 | 36 |
| Max context | 32K tokens | 32K tokens | 32K tokens |
| Embedding dimension | 1024 | 2560 | 4096 |
| Matryoshka dimensions | 32 to 1024 | 32 to 2560 | 32 to 4096 |
| Instruction aware | Yes | Yes | Yes |
| Languages | 100+ | 100+ | 100+ |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Specifications apply to the embedding models; the rerankers share the same sizes, layer counts, and 32K context. [2]
All six models are released under the Apache 2.0 license, which permits commercial use, modification, and redistribution. Weights are hosted on Hugging Face and ModelScope, and the models integrate with common tooling such as sentence-transformers, the Transformers library, and serving engines like vLLM and Text Embeddings Inference. Smaller variants are also available through community runtimes such as Ollama. [1][2][3]
The most common use is the two-stage retrieval pipeline at the heart of retrieval-augmented generation. A document collection is split into chunks and encoded once with an embedding model, and the vectors go into a similarity index. At query time the system encodes the question, pulls back the nearest chunks, optionally reranks them with a Qwen3 reranker, and feeds the best passages to a generator model as context. The instruction-aware design lets a single deployment serve several tasks, and Matryoshka truncation lets operators shrink the index when storage or latency matters more than the last point of accuracy. [2][3]
When Qwen3 Embedding arrived, the multilingual leaderboard was led by Google's Gemini Embedding, a closed model available only through an API. Qwen3-Embedding-8B reported a higher multilingual mean task score, 70.58 against the 68.37 the Qwen team listed for the experimental Gemini entry, while remaining open and self-hostable. [4][6]
Against OpenAI's text-embedding-3 models, which are also API-only and top out at 3072 dimensions, the larger Qwen3 models report stronger multilingual results, though direct comparison depends on the exact benchmark version. Compared with BGE-M3, the widely used open multilingual embedder from the Beijing Academy of Artificial Intelligence, Qwen3 Embedding posts higher MTEB scores but follows a different design philosophy. BGE-M3 bundles dense, sparse, and multi-vector retrieval in one model for hybrid search, whereas Qwen3 Embedding focuses on dense vectors with flexible dimensions and instruction control. The choice between them often comes down to whether an application wants lexical signals alongside dense similarity. [6][7][8]
The larger models are heavy. An 8B embedding or reranking model needs a capable GPU for low-latency inference, and running a reranker over many candidates is costly, which is why the published reranker numbers assume only the top 100 results are reordered. Decoder-based embedding with last-token pooling also means a forward pass per text, so high-throughput indexing of huge corpora can be slow compared with smaller bidirectional encoders. [2][4]
There are evaluation caveats too. Leaderboard standings move quickly, and the June 2025 ranking reflects a single snapshot against the baselines available then; newer models, including later Gemini Embedding releases, have since changed the field. MTEB and MMTEB scores measure aggregate task performance and do not guarantee results on any specific domain, so the usual advice to test on representative in-domain data still applies. The synthetic-data pipeline also ties data quality to the behavior of the Qwen3 generator, which can pass its own blind spots into the embeddings. [4][5]