BGE (BAAI General Embedding)

Chinese AI Information Retrieval Open Source AI

10 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 1,994 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BGE (BAAI General Embedding) is a family of open-source text embedding and reranking models from the Beijing Academy of Artificial Intelligence (BAAI), first released in August 2023 and distributed through the FlagEmbedding project on GitHub and Hugging Face. At launch, bge-large-en and bge-large-zh ranked first on the MTEB (English) and C-MTEB (Chinese) retrieval leaderboards, and the family has since grown to include the multilingual, multi-functional BGE-M3 model (dense, sparse, and multi-vector retrieval across more than 100 languages, up to 8,192 tokens), the bge-reranker series, multimodal variants, and large-language-model-based embedders. The core models are released under the permissive MIT license and are free for commercial use, which made them defaults in open retrieval-augmented generation (RAG) stacks. ^[1]^[2]^[3]^[5]

What is BGE used for?

A text embedding is a fixed-length vector of numbers that represents the meaning of a piece of text, so that texts with similar meaning sit close together in vector space. Embeddings are the backbone of semantic search, where a query is matched against a corpus by vector similarity rather than exact keywords, and they are a core component of retrieval-augmented generation (RAG), where relevant documents are fetched and fed to a large language model as context. In a typical pipeline, an embedding model encodes every document in a corpus once, the vectors are stored in a vector database, and at query time the nearest vectors are retrieved using cosine similarity or inner product. BGE models are designed to produce those embeddings, and the project also ships rerankers that re-score a shortlist of candidates for higher precision. ^[3]

What were the original BGE models?

BAAI released the first BGE models, bge-large-en and bge-large-zh, on 2 August 2023, followed by base- and small-scale variants on 5 August 2023. The v1.5 update arrived on 12 September 2023. The original release required an instruction prefix on queries for retrieval ("Represent this sentence for searching relevant passages:"), and the v1.5 models were issued mainly to give the embeddings a more reasonable similarity distribution and to make that instruction optional, so retrieval without the prefix loses only a small amount of quality. ^[3]^[4]

The models are encoder-only Transformers built on a BERT-style backbone, with a 512-token maximum input length. They come in three sizes that differ in embedding dimension and parameter count. ^[3]^[4]

Model	Language	Parameters	Embedding dim	Max tokens	License
bge-large-en-v1.5	English	326M	1024	512	MIT
bge-base-en-v1.5	English	102M	768	512	MIT
bge-small-en-v1.5	English	24M	384	512	MIT
bge-large-zh-v1.5	Chinese	326M	1024	512	MIT
bge-base-zh-v1.5	Chinese	102M	768	512	MIT
bge-small-zh-v1.5	Chinese	24M	384	512	MIT

The choice of dimension is a practical trade-off: the small model's 384-dimensional vectors are cheaper to store and search, while the large model's 1024 dimensions give the best accuracy. The v1.5 English and Chinese models remain among the most downloaded embedding models on Hugging Face, with bge-large-en-v1.5 alone drawing well over ten million downloads per month. ^[1]^[4]

How were the BGE models trained?

The training recipe for the original BGE models is described in the C-Pack paper, "C-Pack: Packed Resources For General Chinese Embeddings" (arXiv:2309.07597), by Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie, first submitted on 14 September 2023. C-Pack is a bundle of three resources: C-MTEB, a Chinese embedding benchmark covering 6 task types and 35 datasets; C-MTP, a large training corpus assembled from labeled and unlabeled Chinese text; and C-TEM, the family of embedding models. The same paper also released the English data and models, with the English training set roughly twice the size of the Chinese data. ^[2]

Training proceeds in three stages. First, the encoder is pre-trained with RetroMAE, a masked auto-encoding objective in which a heavily corrupted version of a text must be reconstructed from its embedding, which biases the model toward producing information-rich sentence vectors. Second, the model is fine-tuned with contrastive learning on large volumes of unlabeled text pairs, learning to pull matching pairs together and push unrelated (negative) texts apart. Third, it is fine-tuned on labeled data with task-specific instructions and mined hard negatives. The contrastive stages rely heavily on in-batch negatives, and the paper reports scaling the effective batch size up to 19,200 using gradient checkpointing and cross-device sharing of embeddings, with larger batches consistently improving results. The released sizes are small (24M parameters), base (102M), and large (326M). The paper states that the C-Pack models "outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release." ^[2]

What is BGE-M3?

BGE-M3 is a later multilingual model introduced in the paper "M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation" (arXiv:2402.03216), by Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu, first submitted on 5 February 2024. The model card was published on 1 February 2024. The name's three M's refer to its design goals, which the paper sets out directly: "It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval," it "provides a uniform support for the semantic retrieval of more than 100 working languages," and it is "capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens." ^[3]^[5]^[6]

Multi-functionality means a single model produces all three of the common retrieval representations at once. Dense retrieval uses one 1024-dimensional vector per text. Sparse (lexical) retrieval assigns a learned weight to each token, giving a bag-of-words style representation comparable to BM25 but learned. Multi-vector retrieval, in the style of ColBERT, keeps one vector per token and scores a query against a document by a late-interaction sum of token-level similarities. The three scores can be combined with adjustable weights for hybrid retrieval. Multi-linguality means the model supports more than 100 working languages and sets strong results on multilingual and cross-lingual tasks. Multi-granularity means it handles inputs from short sentences up to long documents of 8,192 tokens, a large jump from the 512-token limit of the v1.5 models. ^[3]^[5]^[6]

BGE-M3 is built on the XLM-RoBERTa-large backbone, which gives it broad multilingual coverage out of the box, and it is trained with a self-knowledge distillation scheme in which the relevance scores from the dense, sparse, and multi-vector heads are combined into a single teacher signal that supervises each individual head. The paper evaluates on multilingual retrieval benchmarks including MIRACL and the cross-lingual MKQA, plus long-document retrieval, reporting state-of-the-art results among open models at the time. BGE-M3 is distributed under the MIT license. ^[5]^[6]

What are the BGE rerankers and later models?

Rerankers in the BGE project are cross-encoders rather than embedding models. Instead of encoding a query and a document separately into vectors, a reranker takes the query and document together as a single input and outputs one relevance score, which is more accurate but too slow to run over a whole corpus. The standard pattern is to retrieve a candidate set with an embedding model and then reorder the top results with a reranker. ^[3]^[7]

The project ships several generations. The first rerankers, bge-reranker-base and bge-reranker-large, are multilingual cross-encoders. The v2 series, described in a separate paper (arXiv:2312.15503), broadens the options: bge-reranker-v2-m3 is a lightweight multilingual reranker built on bge-m3, bge-reranker-v2-gemma is built on Gemma-2B, and bge-reranker-v2-minicpm-layerwise is built on MiniCPM-2B and lets the user pick how many layers to run to trade speed against accuracy. ^[3]^[7]

The family also includes several specialized embedders. BGE-EN-ICL, described in "Making Text Embedders Few-Shot Learners" (arXiv:2409.15700) and released around September 2024, is built on Mistral-7B and accepts a few in-context examples in the query to adapt to new tasks without fine-tuning, reaching the top of the MTEB English leaderboard on release. BGE-Multilingual-Gemma2, released in July 2024, is an LLM-based multilingual embedder built on Gemma-2-9B. Visualized BGE (and the later BGE-VL line, released in March 2025) add image inputs for multimodal and hybrid image-text retrieval. LLM-Embedder is a single model tuned to serve several kinds of retrieval augmentation for LLMs. ^[3]^[8]

Model	Type	Base / backbone	Notes
bge-*-en/zh-v1.5	Dense embedding	BERT	512 tokens, MIT
bge-m3	Dense + sparse + multi-vector	XLM-RoBERTa-large	100+ languages, 8192 tokens
bge-en-icl	LLM embedding	Mistral-7B	In-context (few-shot) learning
bge-multilingual-gemma2	LLM embedding	Gemma-2-9B	Multilingual
bge-reranker-v2-m3	Cross-encoder reranker	bge-m3	Lightweight, multilingual
bge-reranker-v2-gemma	Cross-encoder reranker	Gemma-2B	LLM-based
Visualized BGE / BGE-VL	Multimodal embedding	BGE + vision	Image and text

What is the FlagEmbedding library?

FlagEmbedding is the open-source toolkit, hosted at FlagOpen/FlagEmbedding on GitHub, that hosts BGE and provides inference and fine-tuning code for both embedding and reranking models. It bundles convenience classes such as FlagModel for dense embedders, BGEM3FlagModel for the three BGE-M3 output modes, FlagReranker for cross-encoder rerankers, and FlagICLModel for the in-context BGE-EN-ICL model. The same project publishes the model weights as BAAI/bge-* repositories on Hugging Face and maintains documentation describing each model and its usage. ^[1]^[3]

How does BGE perform on MTEB?

BGE's headline claims are tied to MTEB (the Massive Text Embedding Benchmark) for English and C-MTEB for Chinese. At the time of their 2023 release, bge-large-en and bge-large-zh ranked first on those leaderboards, and the C-Pack paper reported that the Chinese models beat all prior Chinese embeddings on C-MTEB by up to about 10 percent. bge-large-en-v1.5 posts an MTEB English average of 64.23 across 56 datasets. The numbers below are the per-task averages reported for bge-large-en-v1.5. ^[1]^[2]^[3]^[4]

Benchmark (bge-large-en-v1.5)	Datasets	Score
MTEB average (English)	56	64.23
Retrieval	15	54.29
STS (semantic similarity)	10	83.11
Pair classification	3	87.12
Reranking	4	60.03
Clustering	11	46.08
C-MTEB average (Chinese, bge-large-zh-v1.5)	N/A	63.96

These leaderboard positions describe 2023. The MTEB rankings have since moved on: by mid-2025 the top of the English and multilingual boards was held by newer LLM-based embedders such as NVIDIA's NV-Embed-v2, Google's Gemini Embedding, and Alibaba's Qwen3 Embedding family, which posts multilingual averages above 70. BAAI's own BGE-EN-ICL (around 71 on MTEB English) is part of that later wave. The v1.5 models nonetheless stay in heavy production use because they are small, fast, permissively licensed, and well-supported, and BGE-M3 remains a common default for multilingual and hybrid retrieval. ^[9]

Is BGE open source?

The core BGE models, the v1.5 English and Chinese embedders and BGE-M3, are released under the permissive MIT license and are free for commercial use, which is a large part of why they became defaults in open RAG stacks. Licensing on the newer LLM-based models varies and should be checked per model: bge-en-icl and bge-reranker-v2-m3 are distributed under Apache 2.0, while bge-multilingual-gemma2, being derived from Google's Gemma-2, carries the Gemma license rather than a fully open license. ^[3]^[4]^[7]^[8]

References

BAAI/bge-large-en-v1.5, Hugging Face model card. https://huggingface.co/BAAI/bge-large-en-v1.5 ↩
Shitao Xiao et al., "C-Pack: Packed Resources For General Chinese Embeddings," arXiv:2309.07597. https://arxiv.org/abs/2309.07597 ↩
FlagOpen/FlagEmbedding, GitHub repository. https://github.com/FlagOpen/FlagEmbedding ↩
BAAI/bge-large-en, Hugging Face model card (original BGE and release dates). https://huggingface.co/BAAI/bge-large-en ↩
Jianlv Chen et al., "M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation," arXiv:2402.03216. https://arxiv.org/abs/2402.03216 ↩
BAAI/bge-m3, Hugging Face model card. https://huggingface.co/BAAI/bge-m3 ↩
BAAI/bge-reranker-v2-m3, Hugging Face model card. https://huggingface.co/BAAI/bge-reranker-v2-m3 ↩
BAAI/bge-en-icl, Hugging Face model card; "Making Text Embedders Few-Shot Learners," arXiv:2409.15700. https://huggingface.co/BAAI/bge-en-icl ↩
Qwen3 Embedding, Qwen team blog (MTEB multilingual leaderboard context, 2025). https://qwenlm.github.io/blog/qwen3-embedding/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Aquila (language model)Beijing Academy of Artificial Intelligence HNSW Re-ranking Tower Vector embeddings Wu Dao

What is BGE used for?

What were the original BGE models?

How were the BGE models trained?

What is BGE-M3?

What are the BGE rerankers and later models?

What is the FlagEmbedding library?

How does BGE perform on MTEB?

Is BGE open source?

See also

References

Improve this article

Related Articles

Qwen3 Embedding

LlamaIndex

Haystack (framework)

FAISS

MTEB (Massive Text Embedding Benchmark)

GraphRAG

What links here

Related Articles

Qwen3 Embedding

LlamaIndex

Haystack (framework)

FAISS

MTEB (Massive Text Embedding Benchmark)

GraphRAG

What links here