Jina Embeddings v3

Jina Embeddings v3 is a multilingual text embedding model released by Jina AI on September 18, 2024. With 570 million parameters, support for 89 languages, an 8,192 token context window, and a stack of task-specific Low-Rank Adaptation (LoRA) adapters, the model targets production retrieval, classification, clustering, and semantic similarity workloads. It also integrates Matryoshka Representation Learning, allowing users to truncate the 1,024-dimensional output to as few as 32 dimensions without retraining. The accompanying technical report, titled jina-embeddings-v3: Multilingual Embeddings With Task LoRA, was posted to arXiv under identifier 2409.10173.

At launch, Jina Embeddings v3 ranked as the top-performing model under 1 billion parameters on the Massive Text Embedding Benchmark (MTEB) English leaderboard, behind only much larger 7-billion-parameter contestants, and led the MTEB multilingual track among comparable models. Jina AI positioned the release as evidence that compact, well-engineered open-weight encoders can outperform substantially larger proprietary embedding systems such as OpenAI's text-embedding-3 family and Cohere Embed Multilingual v3 on a wide variety of retrieval and similarity tasks. The model is distributed under a Creative Commons Attribution-NonCommercial 4.0 license through Hugging Face, with commercial use available through Jina AI's hosted Embeddings API.

Background

The Jina Embeddings family began with v1 in October 2023, an English-only T5-based encoder, followed in late 2023 by v2 models that extended context to 8,192 tokens using ALiBi positional biases. By mid-2024, the market for general-purpose text embeddings had become crowded. OpenAI had introduced text-embedding-3-small and text-embedding-3-large in January 2024, both using Matryoshka Representation Learning for truncatable outputs. Cohere offered Embed v3 in English and a multilingual variant covering more than 100 languages. The Beijing Academy of Artificial Intelligence released BGE-M3 in early 2024 with dense, sparse, and multi-vector outputs, while Microsoft Research released multilingual-e5-large-instruct.

Jina Embeddings v3 was developed by a Jina AI team led by Saba Sturua, Isabelle Mohr, and Mohammad Kalim Akram, with collaborators including Michael Günther, Andreas Koukounas, and Han Xiao. Rather than ship one large generalist model, the v3 design pairs a single 559M-parameter encoder backbone with a small library of swap-in adapters, each trained for a specific information retrieval or NLP use case.

Architecture

The Jina Embeddings v3 backbone is a modified version of XLM-RoBERTa, Facebook AI's 2019 cross-lingual encoder built on the same masked language modeling objective as RoBERTa but trained on the CC-100 corpus across 100 languages. Jina AI refers to the modified backbone as jina-XLM-RoBERTa. The principal architectural changes are the replacement of XLM-RoBERTa's absolute learned position embeddings with Rotary Position Embeddings (RoPE), enabling sequence length extrapolation beyond the pre-training window, the integration of FlashAttention 2 for memory-efficient attention at long sequence lengths, and a mean-pooling head used to derive a single fixed-size embedding from token-level outputs.

The encoder has 24 transformer layers and a hidden size of 1,024. Combined with the original 250,000-token sentencepiece vocabulary inherited from XLM-RoBERTa, the parameter count of the base model is approximately 559 million. With all five LoRA adapters loaded simultaneously, the total parameter count rises to roughly 572 million. The adapters thus contribute less than 3 percent of the parameter budget, in line with the design philosophy of LoRA.

Model specifications

Property	Value
Backbone	jina-XLM-RoBERTa (modified)
Transformer layers	24
Hidden size	1,024
Attention heads	16
Vocabulary size	250,000
Base parameters	559 million
Parameters with all LoRA adapters	572 million
Default output dimension	1,024
Matryoshka supported dimensions	32 to 1,024
Maximum context length	8,192 tokens
Position encoding	Rotary Position Embeddings (RoPE)
RoPE base (training)	10,000
RoPE base (inference)	20,000
Attention implementation	FlashAttention 2
Pooling	Mean pooling
Languages	89
Languages with optimized performance	30
License	CC BY-NC 4.0
Release date	September 18, 2024
arXiv identifier	2409.10173

Rotary Position Embeddings and long context

XLM-RoBERTa was originally trained with a 512-token maximum sequence length and absolute learned positional embeddings, which do not extrapolate well to longer inputs. To extend the receptive field, Jina AI swapped the absolute embeddings for Rotary Position Embeddings (RoPE), the same scheme used in models such as LLaMA and Mistral. During pre-training the team used a RoPE base of 10,000, then increased it to 20,000 at inference time, a technique known as RoPE rescaling that allows the model to process up to 8,192 tokens with minimal accuracy loss. Long-context inference is paired with FlashAttention 2, an I/O-aware attention implementation that drastically reduces GPU memory consumption at long sequence lengths.

Task-specific LoRA adapters

The defining innovation of Jina Embeddings v3 is its set of task-specific LoRA adapters, each conditioning the encoder for a class of information retrieval or NLP task. LoRA, introduced by Edward Hu and colleagues at Microsoft Research in 2021, trains small low-rank matrices added to the frozen weight matrices of a base model, producing task specialization with a tiny fraction of the cost of full fine-tuning and weights that can be swapped at runtime. At inference, the user selects one of five task names. Because all five adapters share the same backbone, a model server only needs one copy of the 559M-parameter encoder in memory and adapter switching is essentially free.

Task name	Use case
retrieval.query	Query embeddings for asymmetric retrieval, paired with retrieval.passage
retrieval.passage	Document or passage embeddings, paired with retrieval.query
separation	Clustering, reranking, deduplication; spreads out distinct documents
classification	Classifier features, trained with classification labels
text-matching	Symmetric semantic similarity, paraphrase, STS; general-purpose default

The retrieval.query and retrieval.passage adapters address the asymmetry of search workloads, where queries tend to be short and underspecified while passages are long and information-rich. Jina v3 follows a separate-encoder approach made cheap by LoRA, where one shared backbone with two adapters effectively yields two specialized encoders. The separation adapter is trained so that documents from different sources yield embeddings that are more spread out in vector space, improving clustering quality. The classification adapter is trained with labeled datasets so that documents of the same class cluster together. The text-matching adapter is the closest to a general-purpose default, optimized for symmetric semantic textual similarity.

Training

Jina Embeddings v3 was trained in three sequential stages: pre-training of the backbone with a masked language modeling objective, contrastive pair training on more than one billion text pairs, and LoRA adapter training on task-specific labeled data.

Pre-training

The modified jina-XLM-RoBERTa backbone was pre-trained on multilingual web text drawn from the CulturaX corpus, a cleaned multilingual dataset covering 167 languages. The mixture used 89 languages with roughly 20 percent English. Pre-training proceeded in two phases: a short-context phase of 100,000 steps at 512 tokens, then a long-context phase of 60,000 steps at 8,192 tokens. The model was trained with a masked language modeling objective, following the original RoBERTa recipe without next-sentence prediction.

Contrastive pair training

The second stage applied a bidirectional InfoNCE contrastive loss to more than one billion text pairs assembled from over 300 source datasets. Pairs spanned web queries and clicked passages, parallel multilingual sentences, paraphrase corpora, question-answer pairs, and synthetic translations. A heuristic filter removed pairs with 80 percent or more lexical overlap to focus the model on semantic rather than surface-level matching. Pair training was conducted first on short sequences and then on longer ones to preserve the long-context behavior introduced during pre-training.

LoRA adapter training

After pair training, the team froze the backbone and trained each LoRA adapter on a task-specific dataset. Retrieval adapters were trained on labeled query-document pairs with hard negatives. The text-matching adapter was trained on semantic textual similarity (STS) datasets and paraphrase pairs. The classification adapter was trained on a curated mixture of classification datasets. The separation adapter was trained with a loss that increases the distance between unrelated documents from distinct domains. In addition, the team generated synthetic failure-case data targeting four known retrieval pathologies: misleading syntactic similarity, named-entity misinterpretation, polar question handling (preserving negations), and a preference for low-quality documents that merely repeat the query.

Training hyperparameters

Stage	Devices	Steps	Batch size	Sequence length	Learning rate
Pre-training (short)	8	100,000	128×8	512	1×10⁻⁴
Pre-training (long)	8	60,000	8×8	8,192	5×10⁻⁵
Pair training (short)	8	60,000	8×2,048	192	3×10⁻⁵
Pair training (long)	2	50,000	2×512	1,024	2×10⁻⁵
Retrieval adapters	1	20,000	128	512	5×10⁻⁴
Text matching	1	15,000	256	192	1×10⁻⁴
Classification	1	11,500	256	192	5×10⁻⁴
Separation	1	12,500	512	192	5×10⁻⁵

Matryoshka representation learning

The text-matching adapter is trained with a Matryoshka Representation Learning loss, a technique introduced in 2022 by Aditya Kusupati and colleagues that allows a single embedding to be truncated to a shorter prefix while remaining useful. Conventionally, an embedding model produces a fixed-length output, and reducing dimensionality afterward through PCA or random projection sacrifices accuracy. Matryoshka training avoids this by adding a sum of contrastive losses computed at nested prefix lengths.

Jina Embeddings v3 outputs 1,024-dimensional vectors by default but is trained to support truncation to 32, 64, 128, 256, 512, or 768. According to the technical report, retrieval scores degrade only marginally when truncating from 1,024 to 512 dimensions and remain useful at 64 to 128 dimensions for many tasks. This places Jina Embeddings v3 in the same family as OpenAI's text-embedding-3-large (truncatable between 256 and 3,072 dimensions), Nomic Embed Text v1.5, and other Matryoshka-trained encoders.

Late chunking

A notable inference-time technique introduced alongside v3 is called late chunking. Traditional embedding-based retrieval-augmented generation (RAG) pipelines split a document into chunks of, say, 256 or 512 tokens, then embed each chunk independently. The resulting chunk embeddings have no awareness of context outside the chunk boundaries, which damages retrieval quality. Late chunking instead feeds the entire long document through the encoder in a single forward pass to obtain contextualized token embeddings, then performs the chunking by mean-pooling token vectors within each chunk after the encoder has already integrated information across the full document.

Performance

Jina Embeddings v3 was evaluated on the Massive Text Embedding Benchmark (MTEB), the de facto standard for general-purpose embedding evaluation introduced by Niklas Muennighoff and colleagues in 2022. MTEB covers eight tasks, including classification, clustering, pair classification, reranking, retrieval, sentence similarity (STS), summarization, and bitext mining across more than 50 datasets and 100 languages.

MTEB English results

Task	Jina Embeddings v3	OpenAI text-embedding-3-large	Cohere Embed Multilingual v3	multilingual-e5-large-instruct
Classification (accuracy)	82.58	75.45	77.71	76.81
Sentence similarity (STS)	85.80	81.73	81.42	84.78
Retrieval (nDCG@10)	53.87	55.44	55.39	52.50
Clustering (V-measure)	45.27	49.01	47.46	50.86
Reranking	58.77	59.16	58.31	56.36
Pair classification	85.80	87.96	86.05	86.46
Summarization	30.91	30.85	30.99	31.61
Overall English average	65.52	64.60	64.01	64.41

The English average of 65.52 placed Jina Embeddings v3 ahead of OpenAI's text-embedding-3-large (64.60) and Cohere's Embed Multilingual v3 (64.01) at launch, despite Jina v3 having 570 million parameters compared with significantly larger proprietary contestants.

MTEB multilingual and long-document results

On the multilingual MTEB tasks, Jina v3 averaged 64.44, narrowly ahead of multilingual-e5-large-instruct at 64.25. On the LongEmbed long-document retrieval benchmark, Jina v3 averaged 70.39 nDCG@10 across six tasks, compared with 56.56 for BGE-M3, 60.99 for jina-embeddings-v2-base-en, and lower scores for shorter-context proprietary models that cannot accept inputs of 8,192 tokens or more.

LongEmbed task	Jina v3 nDCG@10
NarrativeQA	33.32
Passkey retrieval	100.00
WikiQA	72.46
Needle retrieval	92.00
QMSum	36.05
2WikiMultihopQA	88.50
Average	70.39

The report acknowledges that the much larger 7-billion-parameter e5-mistral-7b-instruct model still outscores Jina v3 on overall English MTEB at 66.63 versus 65.52, but at more than 12 times the parameter count and with a 4,096-dimensional output that makes downstream serving more expensive.

Comparison with peer multilingual embedding models

Model	Vendor	Parameters	Max context	Default dim	Languages	License	MTEB Eng. avg
Jina Embeddings v3	Jina AI	570M	8,192	1,024	89	CC BY-NC 4.0	65.52
text-embedding-3-large	OpenAI	undisclosed	8,191	3,072	100+	proprietary API	64.60
text-embedding-3-small	OpenAI	undisclosed	8,191	1,536	100+	proprietary API	62.30
Embed Multilingual v3	Cohere	undisclosed	512	1,024	100+	proprietary API	64.01
BGE-M3	BAAI	568M	8,192	1,024	100+	MIT	~63.0
multilingual-e5-large-instruct	Microsoft	560M	512	1,024	100	MIT	64.41
e5-mistral-7b-instruct	Microsoft	7.1B	32,768	4,096	English-focused	MIT	66.63
voyage-3	Voyage AI	undisclosed	32,000	1,024	English	proprietary API	~63

Jina v3 occupies a sweet spot of open weights, modest parameter count, long context, and competitive accuracy that none of the peer offerings replicate simultaneously. Voyage-3, released by Voyage AI shortly after Jina v3, takes a different path with a much longer context window of 32,000 tokens but remains proprietary and is primarily English-focused. BGE-M3 offers the unique feature of producing dense, sparse, and multi-vector outputs from a single model, useful for hybrid search. Cohere's Embed Multilingual v3 covers more languages but is limited to 512 tokens of input.

Comparison with the Jina v2 family

Jina Embeddings v2, released in late 2023 and early 2024, used a custom BERT-style backbone with ALiBi positional biases. The v2 family included jina-embeddings-v2-base-en (137M parameters), jina-embeddings-v2-small-en (33M parameters), and bilingual variants for German-English, Chinese-English, and Spanish-English. All v2 models supported 8,192-token context but produced 768-dimensional embeddings and had no task-specific adapters.

Aspect	Jina v2 base	Jina Embeddings v3
Backbone	Custom BERT with ALiBi	jina-XLM-RoBERTa with RoPE
Parameters	137M	570M
Output dimensions	768 (fixed)	1,024 (Matryoshka, truncatable to 32)
Languages	English (plus bilingual variants)	89
Task adapters	None	5 LoRA adapters
Long context	8,192 tokens via ALiBi	8,192 tokens via RoPE
Attention	Standard	FlashAttention 2
MTEB English avg	~60.4	65.52

The move from ALiBi to RoPE was motivated by RoPE's better support for sequence length extrapolation and its compatibility with FlashAttention 2. The shift from a small custom encoder to a larger XLM-RoBERTa-derived backbone gave v3 its multilingual capability without requiring a separate model per language pair. The addition of LoRA adapters allowed task specialization without separate model releases.

License and availability

Jina Embeddings v3 is distributed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, which permits free non-commercial use including academic research, prototyping, and personal experimentation. Commercial use requires a paid agreement with Jina AI, typically through Jina AI's hosted Embeddings API priced at $0.02 per million tokens at launch. The model weights are hosted at jinaai/jina-embeddings-v3 on Hugging Face and can be loaded via Hugging Face Transformers, sentence-transformers, the Jina AI SDK, vLLM, and several inference platforms including Azure Marketplace and AWS.

The license choice differs from the Apache 2.0 license used for the earlier Jina v2 family and reflects an industry trend toward source-available rather than fully open licenses for state-of-the-art embedding models. BGE-M3 from BAAI remains under the MIT license, while the multilingual-e5 models from Microsoft are also released under MIT. OpenAI, Cohere, and Voyage AI keep their best embedding weights closed and offer only API access.

Reception and applications

Jina Embeddings v3 was widely covered in technical media at launch as a notable open-weight challenger to OpenAI and Cohere on retrieval benchmarks. The MTEB leaderboard placed it at the top of the sub-billion-parameter bracket for several months. By late 2024 it was integrated into popular RAG and search frameworks including LangChain, LlamaIndex, Haystack, Qdrant FastEmbed, Weaviate, Pinecone, and Milvus. Adoption was driven by the combination of open weights, multilingual coverage, long context, and the task adapter design that simplifies switching between retrieval, classification, and similarity workloads. The model has also been used as a base for community fine-tunes, including classification-focused distilled variants. Jina AI subsequently shipped follow-on models in the same architectural family, including jina-reranker-v3 and jina-clip-v2 for multimodal embeddings, building on the same XLM-RoBERTa-based backbone.

Limitations

The technical report and subsequent community evaluations note several limitations. The CC BY-NC 4.0 license bars free commercial use, complicating adoption for organizations that prefer permissive licensing. Despite Matryoshka support, retrieval quality on the very shortest dimensions (32 or 64) degrades meaningfully on some tasks. The 89-language coverage is broad but uneven, with the report explicitly identifying 30 languages where performance is strongest; underrepresented languages may perform substantially worse. The model also inherits known issues of dense bi-encoder retrieval generally: it can be fooled by lexical-surface matches when meaning is subtle, and does not natively output sparse or multi-vector representations the way BGE-M3 does. Hybrid search pipelines must combine Jina v3 with separate BM25 or learned-sparse components to capture both lexical and semantic signals.

References

Sturua, S., Mohr, I., Akram, M. K., Günther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., and Xiao, H. "jina-embeddings-v3: Multilingual Embeddings With Task LoRA." arXiv preprint arXiv:2409.10173, September 2024.
Jina AI. "Jina Embeddings v3: A Frontier Multilingual Embedding Model." Jina AI news, September 18, 2024.
Hugging Face. "jinaai/jina-embeddings-v3 model card." Hugging Face Hub.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. "MTEB: Massive Text Embedding Benchmark." arXiv preprint arXiv:2210.07316, October 2022.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. "Unsupervised Cross-lingual Representation Learning at Scale." arXiv preprint arXiv:1911.02116, November 2019.
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. "Matryoshka Representation Learning." arXiv preprint arXiv:2205.13147, May 2022.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. "LoRA: Low-Rank Adaptation of Large Language Models." arXiv preprint arXiv:2106.09685, June 2021.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint arXiv:2104.09864, April 2021.
Dao, T. "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv preprint arXiv:2307.08691, July 2023.
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." arXiv preprint arXiv:2402.03216, February 2024.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. "Multilingual E5 Text Embeddings: A Technical Report." arXiv preprint arXiv:2402.05672, February 2024.
Nussbaum, Z., Morris, J. X., Duderstadt, B., and Mulyar, A. "Nomic Embed: Training a Reproducible Long Context Text Embedder." arXiv preprint arXiv:2402.01613, February 2024.
OpenAI. "New embedding models and API updates." OpenAI blog, January 25, 2024.
Günther, M., Mohr, I., Williams, D. J., Wang, B., and Xiao, H. "Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents." arXiv preprint arXiv:2310.19923, October 2023.
Nguyen, T., Van Nguyen, C., Lai, V. D., Man, H., Ngo, N. T., Dernoncourt, F., Rossi, R. A., and Nguyen, T. H. "CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages." arXiv preprint arXiv:2309.09400, September 2023.

Background

Architecture

Model specifications

Rotary Position Embeddings and long context

Task-specific LoRA adapters

Training

Pre-training

Contrastive pair training

LoRA adapter training

Training hyperparameters

Matryoshka representation learning

Late chunking

Performance

MTEB English results

MTEB multilingual and long-document results

Comparison with peer multilingual embedding models

Comparison with the Jina v2 family

License and availability

Reception and applications

Limitations

See also

References

Improve this article

Related Articles

OpenClaw

MTEB (Massive Text Embedding Benchmark)

Voyage-3

LlamaIndex

Haystack (framework)

XLM-RoBERTa

Background

Architecture

Model specifications

Rotary Position Embeddings and long context

Task-specific LoRA adapters

Training

Pre-training

Contrastive pair training

LoRA adapter training

Training hyperparameters

Matryoshka representation learning

Late chunking

Performance

MTEB English results

MTEB multilingual and long-document results

Comparison with peer multilingual embedding models

Comparison with the Jina v2 family

License and availability

Reception and applications

Limitations

See also

References

Related Articles

OpenClaw

MTEB (Massive Text Embedding Benchmark)

Voyage-3

LlamaIndex

Haystack (framework)

XLM-RoBERTa