BGE (BAAI General Embedding)
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,803 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,803 words
Add missing citations, update stale details, or suggest a clearer explanation.
BGE (BAAI General Embedding) is a family of open text embedding and reranking models from the Beijing Academy of Artificial Intelligence (BAAI). The original BGE models were released in August 2023 and reached the top of the MTEB (English) and C-MTEB (Chinese) retrieval leaderboards at that time. The family has since grown to include the multilingual, multi-functional BGE-M3 model, a set of rerankers, multimodal variants, and large-language-model-based embedders, all distributed through BAAI's FlagEmbedding toolkit on GitHub and Hugging Face. [1][2][3]
A text embedding is a fixed-length vector of numbers that represents the meaning of a piece of text, so that texts with similar meaning sit close together in vector space. Embeddings are the backbone of semantic search, where a query is matched against a corpus by vector similarity rather than exact keywords, and they are a core component of retrieval-augmented generation (RAG), where relevant documents are fetched and fed to a large language model as context. In a typical pipeline, an embedding model encodes every document in a corpus once, the vectors are stored in a vector database, and at query time the nearest vectors are retrieved using cosine similarity or inner product. BGE models are designed to produce those embeddings, and the project also ships rerankers that re-score a shortlist of candidates for higher precision. [3]
BAAI released the first BGE models, bge-large-en and bge-large-zh, on 2 August 2023, followed by base- and small-scale variants on 5 August 2023. The v1.5 update arrived on 12 September 2023. The original release required an instruction prefix on queries for retrieval ("Represent this sentence for searching relevant passages:"), and the v1.5 models were issued mainly to give the embeddings a more reasonable similarity distribution and to make that instruction optional, so retrieval without the prefix loses only a small amount of quality. [3][4]
The models are encoder-only Transformers built on a BERT-style backbone, with a 512-token maximum input length. They come in three sizes that differ in embedding dimension and parameter count. [3][4]
| Model | Language | Parameters | Embedding dim | Max tokens | License |
|---|---|---|---|---|---|
| bge-large-en-v1.5 | English | 326M | 1024 | 512 | MIT |
| bge-base-en-v1.5 | English | 102M | 768 | 512 | MIT |
| bge-small-en-v1.5 | English | 24M | 384 | 512 | MIT |
| bge-large-zh-v1.5 | Chinese | 326M | 1024 | 512 | MIT |
| bge-base-zh-v1.5 | Chinese | 102M | 768 | 512 | MIT |
| bge-small-zh-v1.5 | Chinese | 24M | 384 | 512 | MIT |
The choice of dimension is a practical trade-off: the small model's 384-dimensional vectors are cheaper to store and search, while the large model's 1024 dimensions give the best accuracy. The v1.5 English and Chinese models remain among the most downloaded embedding models on Hugging Face, with bge-large-en-v1.5 alone drawing well over ten million downloads per month. [1][4]
The training recipe for the original BGE models is described in the C-Pack paper, "C-Pack: Packed Resources For General Chinese Embeddings" (arXiv:2309.07597), by Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie, first submitted on 14 September 2023. C-Pack is a bundle of three resources: C-MTEB, a Chinese embedding benchmark covering 6 task types and 35 datasets; C-MTP, a large training corpus assembled from labeled and unlabeled Chinese text; and C-TEM, the family of embedding models. The same paper also released the English data and models. [2]
Training proceeds in three stages. First, the encoder is pre-trained with RetroMAE, a masked auto-encoding objective in which a heavily corrupted version of a text must be reconstructed from its embedding, which biases the model toward producing information-rich sentence vectors. Second, the model is fine-tuned with contrastive learning on large volumes of unlabeled text pairs, learning to pull matching pairs together and push unrelated (negative) texts apart. Third, it is fine-tuned on labeled data with task-specific instructions and mined hard negatives. The contrastive stages rely heavily on in-batch negatives, and the paper reports scaling the effective batch size up to 19,200 using gradient checkpointing and cross-device sharing of embeddings, with larger batches consistently improving results. The released sizes are small (24M parameters), base (102M), and large (326M). [2]
BGE-M3 is a later multilingual model introduced in the paper "M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation" (arXiv:2402.03216), by Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu, first submitted on 5 February 2024. The model card was published on 1 February 2024. The name's three M's refer to its design goals. [3][5][6]
Multi-functionality means a single model produces all three of the common retrieval representations at once. Dense retrieval uses one 1024-dimensional vector per text. Sparse (lexical) retrieval assigns a learned weight to each token, giving a bag-of-words style representation comparable to BM25 but learned. Multi-vector retrieval, in the style of ColBERT, keeps one vector per token and scores a query against a document by a late-interaction sum of token-level similarities. The three scores can be combined with adjustable weights for hybrid retrieval. Multi-linguality means the model supports more than 100 working languages and sets strong results on multilingual and cross-lingual tasks. Multi-granularity means it handles inputs from short sentences up to long documents of 8,192 tokens, a large jump from the 512-token limit of the v1.5 models. [3][5][6]
BGE-M3 is built on the XLM-RoBERTa-large backbone, which gives it broad multilingual coverage out of the box, and it is trained with a self-knowledge distillation scheme in which the relevance scores from the dense, sparse, and multi-vector heads are combined into a single teacher signal that supervises each individual head. The paper evaluates on multilingual retrieval benchmarks including MIRACL and the cross-lingual MKQA, plus long-document retrieval, reporting state-of-the-art results among open models at the time. [5][6]
Rerankers in the BGE project are cross-encoders rather than embedding models. Instead of encoding a query and a document separately into vectors, a reranker takes the query and document together as a single input and outputs one relevance score, which is more accurate but too slow to run over a whole corpus. The standard pattern is to retrieve a candidate set with an embedding model and then reorder the top results with a reranker. [3][7]
The project ships several generations. The first rerankers, bge-reranker-base and bge-reranker-large, are multilingual cross-encoders. The v2 series, described in a separate paper (arXiv:2312.15503), broadens the options: bge-reranker-v2-m3 is a lightweight multilingual reranker built on bge-m3, bge-reranker-v2-gemma is built on Gemma-2B, and bge-reranker-v2-minicpm-layerwise is built on MiniCPM-2B and lets the user pick how many layers to run to trade speed against accuracy. [3][7]
The family also includes several specialized embedders. BGE-EN-ICL, described in "Making Text Embedders Few-Shot Learners" (arXiv:2409.15700) and released around September 2024, is built on Mistral-7B and accepts a few in-context examples in the query to adapt to new tasks without fine-tuning, reaching the top of the MTEB English leaderboard on release. BGE-Multilingual-Gemma2, released in July 2024, is an LLM-based multilingual embedder built on Gemma-2-9B. Visualized BGE (and the later BGE-VL line, released in March 2025) add image inputs for multimodal and hybrid image-text retrieval. LLM-Embedder is a single model tuned to serve several kinds of retrieval augmentation for LLMs. [3][8]
| Model | Type | Base / backbone | Notes |
|---|---|---|---|
| bge-*-en/zh-v1.5 | Dense embedding | BERT | 512 tokens, MIT |
| bge-m3 | Dense + sparse + multi-vector | XLM-RoBERTa-large | 100+ languages, 8192 tokens |
| bge-en-icl | LLM embedding | Mistral-7B | In-context (few-shot) learning |
| bge-multilingual-gemma2 | LLM embedding | Gemma-2-9B | Multilingual |
| bge-reranker-v2-m3 | Cross-encoder reranker | bge-m3 | Lightweight, multilingual |
| bge-reranker-v2-gemma | Cross-encoder reranker | Gemma-2B | LLM-based |
| Visualized BGE / BGE-VL | Multimodal embedding | BGE + vision | Image and text |
FlagEmbedding is the open-source toolkit, hosted at FlagOpen/FlagEmbedding on GitHub, that hosts BGE and provides inference and fine-tuning code for both embedding and reranking models. It bundles convenience classes such as FlagModel for dense embedders, BGEM3FlagModel for the three BGE-M3 output modes, FlagReranker for cross-encoder rerankers, and FlagICLModel for the in-context BGE-EN-ICL model. The same project publishes the model weights as BAAI/bge-* repositories on Hugging Face and maintains documentation describing each model and its usage. [1][3]
BGE's headline claims are tied to MTEB (the Massive Text Embedding Benchmark) for English and C-MTEB for Chinese. At the time of their 2023 release, bge-large-en and bge-large-zh ranked first on those leaderboards, and the C-Pack paper reported that the Chinese models beat all prior Chinese embeddings on C-MTEB by up to about 10 percent. The numbers below are the per-task averages reported for bge-large-en-v1.5. [2][3][4]
| Benchmark (bge-large-en-v1.5) | Datasets | Score |
|---|---|---|
| MTEB average (English) | 56 | 64.23 |
| Retrieval | 15 | 54.29 |
| STS (semantic similarity) | 10 | 83.11 |
| Pair classification | 3 | 87.12 |
| Reranking | 4 | 60.03 |
| Clustering | 11 | 46.08 |
| C-MTEB average (Chinese, bge-large-zh-v1.5) | N/A | 63.96 |
These leaderboard positions describe 2023. The MTEB rankings have since moved on: by mid-2025 the top of the English and multilingual boards was held by newer LLM-based embedders such as NVIDIA's NV-Embed-v2, Google's Gemini Embedding, and Alibaba's Qwen3 Embedding family, which posts multilingual averages above 70. BAAI's own BGE-EN-ICL (around 71 on MTEB English) is part of that later wave. The v1.5 models nonetheless stay in heavy production use because they are small, fast, permissively licensed, and well-supported, and BGE-M3 remains a common default for multilingual and hybrid retrieval. [9]
The core BGE models, the v1.5 English and Chinese embedders and BGE-M3, are released under the permissive MIT license and are free for commercial use, which is a large part of why they became defaults in open RAG stacks. Licensing on the newer LLM-based models varies and should be checked per model: bge-en-icl and bge-reranker-v2-m3 are distributed under Apache 2.0, while bge-multilingual-gemma2, being derived from Google's Gemma-2, carries the Gemma license rather than a fully open license. [3][4][7][8]