Jina Embeddings v3
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,482 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,482 words
Add missing citations, update stale details, or suggest a clearer explanation.
Jina Embeddings v3 is a multilingual text embedding model released by Jina AI on September 18, 2024. With 570 million parameters, support for 89 languages, an 8,192 token context window, and a stack of task-specific Low-Rank Adaptation (LoRA) adapters, the model targets production retrieval, classification, clustering, and semantic similarity workloads. It also integrates Matryoshka Representation Learning, allowing users to truncate the 1,024-dimensional output to as few as 32 dimensions without retraining. The accompanying technical report, titled jina-embeddings-v3: Multilingual Embeddings With Task LoRA, was posted to arXiv under identifier 2409.10173.
At launch, Jina Embeddings v3 ranked as the top-performing model under 1 billion parameters on the Massive Text Embedding Benchmark (MTEB) English leaderboard, behind only much larger 7-billion-parameter contestants, and led the MTEB multilingual track among comparable models. Jina AI positioned the release as evidence that compact, well-engineered open-weight encoders can outperform substantially larger proprietary embedding systems such as OpenAI's text-embedding-3 family and Cohere Embed Multilingual v3 on a wide variety of retrieval and similarity tasks. The model is distributed under a Creative Commons Attribution-NonCommercial 4.0 license through Hugging Face, with commercial use available through Jina AI's hosted Embeddings API.
The Jina Embeddings family began with v1 in October 2023, an English-only T5-based encoder, followed in late 2023 by v2 models that extended context to 8,192 tokens using ALiBi positional biases. By mid-2024, the market for general-purpose text embeddings had become crowded. OpenAI had introduced text-embedding-3-small and text-embedding-3-large in January 2024, both using Matryoshka Representation Learning for truncatable outputs. Cohere offered Embed v3 in English and a multilingual variant covering more than 100 languages. The Beijing Academy of Artificial Intelligence released BGE-M3 in early 2024 with dense, sparse, and multi-vector outputs, while Microsoft Research released multilingual-e5-large-instruct.
Jina Embeddings v3 was developed by a Jina AI team led by Saba Sturua, Isabelle Mohr, and Mohammad Kalim Akram, with collaborators including Michael Günther, Andreas Koukounas, and Han Xiao. Rather than ship one large generalist model, the v3 design pairs a single 559M-parameter encoder backbone with a small library of swap-in adapters, each trained for a specific information retrieval or NLP use case.
The Jina Embeddings v3 backbone is a modified version of XLM-RoBERTa, Facebook AI's 2019 cross-lingual encoder built on the same masked language modeling objective as RoBERTa but trained on the CC-100 corpus across 100 languages. Jina AI refers to the modified backbone as jina-XLM-RoBERTa. The principal architectural changes are the replacement of XLM-RoBERTa's absolute learned position embeddings with Rotary Position Embeddings (RoPE), enabling sequence length extrapolation beyond the pre-training window, the integration of FlashAttention 2 for memory-efficient attention at long sequence lengths, and a mean-pooling head used to derive a single fixed-size embedding from token-level outputs.
The encoder has 24 transformer layers and a hidden size of 1,024. Combined with the original 250,000-token sentencepiece vocabulary inherited from XLM-RoBERTa, the parameter count of the base model is approximately 559 million. With all five LoRA adapters loaded simultaneously, the total parameter count rises to roughly 572 million. The adapters thus contribute less than 3 percent of the parameter budget, in line with the design philosophy of LoRA.
| Property | Value |
|---|---|
| Backbone | jina-XLM-RoBERTa (modified) |
| Transformer layers | 24 |
| Hidden size | 1,024 |
| Attention heads | 16 |
| Vocabulary size | 250,000 |
| Base parameters | 559 million |
| Parameters with all LoRA adapters | 572 million |
| Default output dimension | 1,024 |
| Matryoshka supported dimensions | 32 to 1,024 |
| Maximum context length | 8,192 tokens |
| Position encoding | Rotary Position Embeddings (RoPE) |
| RoPE base (training) | 10,000 |
| RoPE base (inference) | 20,000 |
| Attention implementation | FlashAttention 2 |
| Pooling | Mean pooling |
| Languages | 89 |
| Languages with optimized performance | 30 |
| License | CC BY-NC 4.0 |
| Release date | September 18, 2024 |
| arXiv identifier | 2409.10173 |
XLM-RoBERTa was originally trained with a 512-token maximum sequence length and absolute learned positional embeddings, which do not extrapolate well to longer inputs. To extend the receptive field, Jina AI swapped the absolute embeddings for Rotary Position Embeddings (RoPE), the same scheme used in models such as LLaMA and Mistral. During pre-training the team used a RoPE base of 10,000, then increased it to 20,000 at inference time, a technique known as RoPE rescaling that allows the model to process up to 8,192 tokens with minimal accuracy loss. Long-context inference is paired with FlashAttention 2, an I/O-aware attention implementation that drastically reduces GPU memory consumption at long sequence lengths.
The defining innovation of Jina Embeddings v3 is its set of task-specific LoRA adapters, each conditioning the encoder for a class of information retrieval or NLP task. LoRA, introduced by Edward Hu and colleagues at Microsoft Research in 2021, trains small low-rank matrices added to the frozen weight matrices of a base model, producing task specialization with a tiny fraction of the cost of full fine-tuning and weights that can be swapped at runtime. At inference, the user selects one of five task names. Because all five adapters share the same backbone, a model server only needs one copy of the 559M-parameter encoder in memory and adapter switching is essentially free.
| Task name | Use case |
|---|---|
| retrieval.query | Query embeddings for asymmetric retrieval, paired with retrieval.passage |
| retrieval.passage | Document or passage embeddings, paired with retrieval.query |
| separation | Clustering, reranking, deduplication; spreads out distinct documents |
| classification | Classifier features, trained with classification labels |
| text-matching | Symmetric semantic similarity, paraphrase, STS; general-purpose default |
The retrieval.query and retrieval.passage adapters address the asymmetry of search workloads, where queries tend to be short and underspecified while passages are long and information-rich. Jina v3 follows a separate-encoder approach made cheap by LoRA, where one shared backbone with two adapters effectively yields two specialized encoders. The separation adapter is trained so that documents from different sources yield embeddings that are more spread out in vector space, improving clustering quality. The classification adapter is trained with labeled datasets so that documents of the same class cluster together. The text-matching adapter is the closest to a general-purpose default, optimized for symmetric semantic textual similarity.
Jina Embeddings v3 was trained in three sequential stages: pre-training of the backbone with a masked language modeling objective, contrastive pair training on more than one billion text pairs, and LoRA adapter training on task-specific labeled data.
The modified jina-XLM-RoBERTa backbone was pre-trained on multilingual web text drawn from the CulturaX corpus, a cleaned multilingual dataset covering 167 languages. The mixture used 89 languages with roughly 20 percent English. Pre-training proceeded in two phases: a short-context phase of 100,000 steps at 512 tokens, then a long-context phase of 60,000 steps at 8,192 tokens. The model was trained with a masked language modeling objective, following the original RoBERTa recipe without next-sentence prediction.
The second stage applied a bidirectional InfoNCE contrastive loss to more than one billion text pairs assembled from over 300 source datasets. Pairs spanned web queries and clicked passages, parallel multilingual sentences, paraphrase corpora, question-answer pairs, and synthetic translations. A heuristic filter removed pairs with 80 percent or more lexical overlap to focus the model on semantic rather than surface-level matching. Pair training was conducted first on short sequences and then on longer ones to preserve the long-context behavior introduced during pre-training.
After pair training, the team froze the backbone and trained each LoRA adapter on a task-specific dataset. Retrieval adapters were trained on labeled query-document pairs with hard negatives. The text-matching adapter was trained on semantic textual similarity (STS) datasets and paraphrase pairs. The classification adapter was trained on a curated mixture of classification datasets. The separation adapter was trained with a loss that increases the distance between unrelated documents from distinct domains. In addition, the team generated synthetic failure-case data targeting four known retrieval pathologies: misleading syntactic similarity, named-entity misinterpretation, polar question handling (preserving negations), and a preference for low-quality documents that merely repeat the query.
| Stage | Devices | Steps | Batch size | Sequence length | Learning rate |
|---|---|---|---|---|---|
| Pre-training (short) | 8 | 100,000 | 128×8 | 512 | 1×10⁻⁴ |
| Pre-training (long) | 8 | 60,000 | 8×8 | 8,192 | 5×10⁻⁵ |
| Pair training (short) | 8 | 60,000 | 8×2,048 | 192 | 3×10⁻⁵ |
| Pair training (long) | 2 | 50,000 | 2×512 | 1,024 | 2×10⁻⁵ |
| Retrieval adapters | 1 | 20,000 | 128 | 512 | 5×10⁻⁴ |
| Text matching | 1 | 15,000 | 256 | 192 | 1×10⁻⁴ |
| Classification | 1 | 11,500 | 256 | 192 | 5×10⁻⁴ |
| Separation | 1 | 12,500 | 512 | 192 | 5×10⁻⁵ |
The text-matching adapter is trained with a Matryoshka Representation Learning loss, a technique introduced in 2022 by Aditya Kusupati and colleagues that allows a single embedding to be truncated to a shorter prefix while remaining useful. Conventionally, an embedding model produces a fixed-length output, and reducing dimensionality afterward through PCA or random projection sacrifices accuracy. Matryoshka training avoids this by adding a sum of contrastive losses computed at nested prefix lengths.
Jina Embeddings v3 outputs 1,024-dimensional vectors by default but is trained to support truncation to 32, 64, 128, 256, 512, or 768. According to the technical report, retrieval scores degrade only marginally when truncating from 1,024 to 512 dimensions and remain useful at 64 to 128 dimensions for many tasks. This places Jina Embeddings v3 in the same family as OpenAI's text-embedding-3-large (truncatable between 256 and 3,072 dimensions), Nomic Embed Text v1.5, and other Matryoshka-trained encoders.
A notable inference-time technique introduced alongside v3 is called late chunking. Traditional embedding-based retrieval-augmented generation (RAG) pipelines split a document into chunks of, say, 256 or 512 tokens, then embed each chunk independently. The resulting chunk embeddings have no awareness of context outside the chunk boundaries, which damages retrieval quality. Late chunking instead feeds the entire long document through the encoder in a single forward pass to obtain contextualized token embeddings, then performs the chunking by mean-pooling token vectors within each chunk after the encoder has already integrated information across the full document.
Jina Embeddings v3 was evaluated on the Massive Text Embedding Benchmark (MTEB), the de facto standard for general-purpose embedding evaluation introduced by Niklas Muennighoff and colleagues in 2022. MTEB covers eight tasks, including classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), summarization, and bitext mining across more than 50 datasets and 100 languages.
| Task | Jina Embeddings v3 | OpenAI text-embedding-3-large | Cohere Embed Multilingual v3 | multilingual-e5-large-instruct |
|---|---|---|---|---|
| Classification (accuracy) | 82.58 | 75.45 | 77.71 | 76.81 |
| Sentence similarity (STS) | 85.80 | 81.73 | 81.42 | 84.78 |
| Retrieval (nDCG@10) | 53.87 | 55.44 | 55.39 | 52.50 |
| Clustering (V-measure) | 45.27 | 49.01 | 47.46 | 50.86 |
| Reranking | 58.77 | 59.16 | 58.31 | 56.36 |
| Pair classification | 85.80 | 87.96 | 86.05 | 86.46 |
| Summarization | 30.91 | 30.85 | 30.99 | 31.61 |
| Overall English average | 65.52 | 64.60 | 64.01 | 64.41 |
The English average of 65.52 placed Jina Embeddings v3 ahead of OpenAI's text-embedding-3-large (64.60) and Cohere's Embed Multilingual v3 (64.01) at launch, despite Jina v3 having 570 million parameters compared with significantly larger proprietary contestants.
On the multilingual MTEB tasks, Jina v3 averaged 64.44, narrowly ahead of multilingual-e5-large-instruct at 64.25. On the LongEmbed long-document retrieval benchmark, Jina v3 averaged 70.39 nDCG@10 across six tasks, compared with 56.56 for BGE-M3, 60.99 for jina-embeddings-v2-base-en, and lower scores for shorter-context proprietary models that cannot accept inputs of 8,192 tokens or more.
| LongEmbed task | Jina v3 nDCG@10 |
|---|---|
| NarrativeQA | 33.32 |
| Passkey retrieval | 100.00 |
| WikiQA | 72.46 |
| Needle retrieval | 92.00 |
| QMSum | 36.05 |
| 2WikiMultihopQA | 88.50 |
| Average | 70.39 |
The report acknowledges that the much larger 7-billion-parameter e5-mistral-7b-instruct model still outscores Jina v3 on overall English MTEB at 66.63 versus 65.52, but at more than 12 times the parameter count and with a 4,096-dimensional output that makes downstream serving more expensive.
| Model | Vendor | Parameters | Max context | Default dim | Languages | License | MTEB Eng. avg |
|---|---|---|---|---|---|---|---|
| Jina Embeddings v3 | Jina AI | 570M | 8,192 | 1,024 | 89 | CC BY-NC 4.0 | 65.52 |
| text-embedding-3-large | OpenAI | undisclosed | 8,191 | 3,072 | 100+ | proprietary API | 64.60 |
| text-embedding-3-small | OpenAI | undisclosed | 8,191 | 1,536 | 100+ | proprietary API | 62.30 |
| Embed Multilingual v3 | Cohere | undisclosed | 512 | 1,024 | 100+ | proprietary API | 64.01 |
| BGE-M3 | BAAI | 568M | 8,192 | 1,024 | 100+ | MIT | ~63.0 |
| multilingual-e5-large-instruct | Microsoft | 560M | 512 | 1,024 | 100 | MIT | 64.41 |
| e5-mistral-7b-instruct | Microsoft | 7.1B | 32,768 | 4,096 | English-focused | MIT | 66.63 |
| voyage-3 | Voyage AI | undisclosed | 32,000 | 1,024 | English | proprietary API | ~63 |
Jina v3 occupies a sweet spot of open weights, modest parameter count, long context, and competitive accuracy that none of the peer offerings replicate simultaneously. Voyage-3, released by Voyage AI shortly after Jina v3, takes a different path with a much longer context window of 32,000 tokens but remains proprietary and is primarily English-focused. BGE-M3 offers the unique feature of producing dense, sparse, and multi-vector outputs from a single model, useful for hybrid search. Cohere's Embed Multilingual v3 covers more languages but is limited to 512 tokens of input.
Jina Embeddings v2, released in late 2023 and early 2024, used a custom BERT-style backbone with ALiBi positional biases. The v2 family included jina-embeddings-v2-base-en (137M parameters), jina-embeddings-v2-small-en (33M parameters), and bilingual variants for German-English, Chinese-English, and Spanish-English. All v2 models supported 8,192-token context but produced 768-dimensional embeddings and had no task-specific adapters.
| Aspect | Jina v2 base | Jina Embeddings v3 |
|---|---|---|
| Backbone | Custom BERT with ALiBi | jina-XLM-RoBERTa with RoPE |
| Parameters | 137M | 570M |
| Output dimensions | 768 (fixed) | 1,024 (Matryoshka, truncatable to 32) |
| Languages | English (plus bilingual variants) | 89 |
| Task adapters | None | 5 LoRA adapters |
| Long context | 8,192 tokens via ALiBi | 8,192 tokens via RoPE |
| Attention | Standard | FlashAttention 2 |
| MTEB English avg | ~60.4 | 65.52 |
The move from ALiBi to RoPE was motivated by RoPE's better support for sequence length extrapolation and its compatibility with FlashAttention 2. The shift from a small custom encoder to a larger XLM-RoBERTa-derived backbone gave v3 its multilingual capability without requiring a separate model per language pair. The addition of LoRA adapters allowed task specialization without separate model releases.
Jina Embeddings v3 is distributed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, which permits free non-commercial use including academic research, prototyping, and personal experimentation. Commercial use requires a paid agreement with Jina AI, typically through Jina AI's hosted Embeddings API priced at $0.02 per million tokens at launch. The model weights are hosted at jinaai/jina-embeddings-v3 on Hugging Face and can be loaded via Hugging Face Transformers, sentence-transformers, the Jina AI SDK, vLLM, and several inference platforms including Azure Marketplace and AWS.
The license choice differs from the Apache 2.0 license used for the earlier Jina v2 family and reflects an industry trend toward source-available rather than fully open licenses for state-of-the-art embedding models. BGE-M3 from BAAI remains under the MIT license, while the multilingual-e5 models from Microsoft are also released under MIT. OpenAI, Cohere, and Voyage AI keep their best embedding weights closed and offer only API access.
Jina Embeddings v3 was widely covered in technical media at launch as a notable open-weight challenger to OpenAI and Cohere on retrieval benchmarks. The MTEB leaderboard placed it at the top of the sub-billion-parameter bracket for several months. By late 2024 it was integrated into popular RAG and search frameworks including LangChain, LlamaIndex, Haystack, Qdrant FastEmbed, Weaviate, Pinecone, and Milvus. Adoption was driven by the combination of open weights, multilingual coverage, long context, and the task adapter design that simplifies switching between retrieval, classification, and similarity workloads. The model has also been used as a base for community fine-tunes, including classification-focused distilled variants. Jina AI subsequently shipped follow-on models in the same architectural family, including jina-reranker-v3 and jina-clip-v2 for multimodal embeddings, building on the same XLM-RoBERTa-based backbone.
The technical report and subsequent community evaluations note several limitations. The CC BY-NC 4.0 license bars free commercial use, complicating adoption for organizations that prefer permissive licensing. Despite Matryoshka support, retrieval quality on the very shortest dimensions (32 or 64) degrades meaningfully on some tasks. The 89-language coverage is broad but uneven, with the report explicitly identifying 30 languages where performance is strongest; underrepresented languages may perform substantially worse. The model also inherits known issues of dense bi-encoder retrieval generally: it can be fooled by lexical-surface matches when meaning is subtle, and does not natively output sparse or multi-vector representations the way BGE-M3 does. Hybrid search pipelines must combine Jina v3 with separate BM25 or learned-sparse components to capture both lexical and semantic signals.