Hugging Face
sentence-transformers/all-mpnet-base-v2
Name
all-mpnet-base-v2
User / Organization
sentence-transformers
Type
Task
Sentence Similarity, Feature Extraction
Library
PyTorch, Sentence Transformers
Base model
microsoft/mpnet-base
Architecture
MPNet (12-layer transformer encoder)
Embedding dimension
768
Max sequence length
384 word pieces
Parameters
~110 million
License
Apache 2.0
Released
August 2021 (Flax community sprint)
Paper
arxiv:1908.10084 (Sentence-BERT), arxiv:2004.09297 (MPNet)
sentence-transformers/all-mpnet-base-v2 is a sentence-embedding model that maps English sentences and short paragraphs to a 768-dimensional dense vector. It is one of the flagship general-purpose models distributed through the Sentence Transformers library and one of the most downloaded text embedding models on Hugging Face, with tens of millions of downloads per month as of 2026.
The model is a fine-tuned version of Microsoft's mpnet-base checkpoint. Microsoft released MPNet at NeurIPS 2020 as a pre-training objective that combines masked and permuted language modeling. The fine-tuning that produced all-mpnet-base-v2 happened during the Hugging Face Flax/JAX community sprint of summer 2021, in a project titled "Train the Best Sentence Embedding Model Ever with 1B Training Pairs." The team trained the model with a contrastive objective on roughly 1.17 billion sentence pairs collected from more than 30 datasets, using seven TPU v3-8 pods donated by Google.
The v2 suffix distinguishes this checkpoint from earlier versions trained on smaller mixtures of data. It quickly became the default reference model for general-purpose English embeddings and has held that role through 2026, even as larger models from BGE, GTE, NV-Embed, and Qwen now rank higher on the MTEB leaderboard.
| Field | Value |
|---|---|
| Hugging Face ID | sentence-transformers/all-mpnet-base-v2 |
| Base checkpoint | microsoft/mpnet-base |
| Architecture | MPNet encoder |
| Pooling | Mean pooling over token outputs (with attention mask) |
| Output | 768-dimensional L2-normalized vector |
| Library | Sentence Transformers, transformers |
| Frameworks | PyTorch (primary), JAX/Flax (training), ONNX, Core ML, TensorFlow.js (community ports) |
| License | Apache 2.0 |
| Project lead during training | Nils Reimers (then UKP Lab) |
| Current maintainer | Tom Aarsen at Hugging Face |
Nils Reimers created the Sentence Transformers project in 2019 at the Ubiquitous Knowledge Processing (UKP) Lab at TU Darmstadt, under Iryna Gurevych. He led the 2021 sprint that produced the all-* family. In late 2023 Tom Aarsen took over maintenance, and in 2025 the project officially moved from UKP Lab to Hugging Face.
The encoder is identical to microsoft/mpnet-base and follows the standard BERT-base shape. Architecture values come from the published config.json on Hugging Face.
| Component | Value |
|---|---|
| Transformer layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| Feed-forward (intermediate) size | 3072 |
| Max position embeddings | 514 |
| Vocabulary size | 30,527 word pieces |
| Tokenizer | MPNet tokenizer (WordPiece, cased) |
| Total parameters | ~110 million |
| Output token dim | 768 |
| Sentence vector dim | 768 |
During inference, sequences are tokenized with the MPNet tokenizer, padded or truncated to a maximum of 384 word pieces, and run through the 12 encoder blocks. The sentence embedding is the mean of the contextualized token embeddings, weighted by the attention mask so that padding tokens are excluded. The result is then L2-normalized so that dot product equals cosine similarity.
The choice of mean pooling rather than the [CLS] token follows the original Sentence-BERT paper by Reimers and Gurevych, which found that mean pooling produced better sentence-level representations than the classification token used in BERT fine-tuning.
The base model uses the MPNet objective from Song et al., "MPNet: Masked and Permuted Pre-training for Language Understanding" (NeurIPS 2020). MPNet sits between BERT and XLNet. It uses permuted language modeling like XLNet so the model learns dependencies between predicted tokens, and it feeds full position information into the encoder so the model always sees the length of the sentence (which XLNet does not). This combination outperformed BERT-base, RoBERTa-base, and XLNet-base on GLUE and SQuAD when normalized for parameter count.
all-mpnet-base-v2 does not change this objective. It only adds a contrastive fine-tuning stage on top.
| Item | Value |
|---|---|
| Hardware | 7 TPU v3-8 pods (Google Cloud) |
| Framework | JAX/Flax |
| Optimizer | AdamW, learning rate 2e-5 |
| Warmup | 500 steps, linear |
| Steps | 100,000 |
| Batch size | 1,024 sentence pairs (128 per TPU core) |
| Sequence length during training | 128 word pieces |
| Loss | Cross-entropy over scaled cosine similarity (Multiple Negatives Ranking Loss) |
Training ran during the Hugging Face Flax/JAX community sprint in July and August 2021. The TPUs were donated by Google's Cloud team, and the codebase was released by the flax-sentence-embeddings organization on the Hub.
The team used Multiple Negatives Ranking Loss (also called InfoNCE or NTXent) over scaled cosine similarity:
loss = -1/n * sum_i log( exp(C * cos(a_i, p_i)) / sum_j exp(C * cos(a_i, p_j)) )
For each anchor a_i the matched positive p_i must score higher than every other sentence in the batch. The temperature constant C = 20 sharpens the distribution. Because there is no explicit negative mining, all other items in the batch act as in-batch negatives, which is why the team used a batch size of 1,024. This contrastive learning recipe is the same one that powers most modern dual-encoder retrievers.
The team mixed roughly 1.17 billion sentence pairs from over thirty sources. The largest contributors are listed below.
| Dataset | Pairs | Type |
|---|---|---|
| Reddit comments (2015 to 2018) | 726,484,430 | Conversational |
| S2ORC citation pairs (abstracts) | 116,288,806 | Scientific |
| WikiAnswers duplicate questions | 77,427,422 | Question paraphrase |
| PAQ (question, answer) | 64,371,441 | Open-domain QA |
| S2ORC citation pairs (titles) | 52,603,982 | Scientific |
| S2ORC (title, abstract) | 41,769,185 | Scientific |
| Stack Exchange (title, body) | 25,316,456 | Technical Q&A |
| Stack Exchange (title+body, answer) | 21,396,559 | Technical Q&A |
| Stack Exchange (title, answer) | 21,396,559 | Technical Q&A |
| MS MARCO triplets | 9,144,553 | Web search |
| GOOAQ | 3,012,496 | Web Q&A |
| Yahoo Answers (title, answer) | 1,198,260 | Community Q&A |
| CodeSearchNet | 1,151,414 | Code, docstring |
| COCO image captions | 828,395 | Captions |
| SPECTER citation triplets | 684,100 | Scientific |
| SearchQA | 582,261 | Web QA |
| ELI5 | 325,475 | Long-form QA |
| Flickr 30k | 317,695 | Captions |
| Stack Exchange duplicate questions | 304,525 | Question paraphrase |
| AllNLI (SNLI + MultiNLI) | 277,230 | Natural language inference |
| Sentence Compression | 180,000 | Paraphrase |
| WikiHow | 128,542 | Procedural |
| AltLex | 112,696 | Causal paraphrase |
| Quora Question Triplets | 103,663 | Question paraphrase |
| Simple Wikipedia | 102,225 | Paraphrase |
| Natural Questions | 100,231 | Open-domain QA |
| SQuAD 2.0 | 87,599 | Reading comprehension |
| TriviaQA | 73,346 | Trivia QA |
The mixture is heavy on Reddit (about 62 percent of all pairs) and scientific text from S2ORC (about 18 percent combined). That bias is part of why the model performs strongly on conversational and academic retrieval but is mediocre on long, formal documents.
During training, batches were assembled with a sampling strategy that drew from at least two datasets at a time, mixing in-batch negatives across domains so that the model would not collapse on any single distribution.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
sentences = [
"That is a happy person",
"That is a happy dog",
"That is a very happy person",
"Today is a sunny day",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (4, 768)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
SentenceTransformer.encode already handles tokenization, mean pooling, and L2 normalization. The default similarity function is cosine.
If the sentence-transformers library is not available, the same model can be loaded through the base transformers library, but pooling and normalization must be done by hand.
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)
sentences = ["This is an example sentence", "Each sentence is converted"]
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")
encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
output = model(**encoded)
embeddings = mean_pooling(output, encoded["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)
The full checkpoint can be cloned from the Hub. Git LFS is required for the model weights.
git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2
To download only metadata and pointers, prepend GIT_LFS_SKIP_SMUDGE=1.
On the original MTEB English benchmark (Muennighoff et al., 2022), all-mpnet-base-v2 scored an average of about 57.8 across 56 tasks. At release in 2022 it ranked near the top of the leaderboard among public models of comparable size.
By 2026 the model has been overtaken on the absolute leaderboard. Top current models such as Qwen3-Embedding-8B (around 70 on MTEB v2), NV-Embed-v2 (about 69 on MTEB v1), Google's Gemini Embedding 001 (around 68), and the BGE and GTE family all use larger encoders and instruction tuning. Despite this, all-mpnet-base-v2 remains a common baseline because of its small size, permissive license, and extremely fast CPU inference relative to billion-parameter rivals.
The Sentence Transformers documentation positions all-mpnet-base-v2 at the high-quality end of the general-purpose family.
| Model | Embedding dim | Params | Max tokens | Relative speed | Quality position |
|---|---|---|---|---|---|
| all-mpnet-base-v2 | 768 | ~110M | 384 | 1x | Highest in the all-* family |
| all-distilroberta-v1 | 768 | ~82M | 512 | ~3x | Slightly below mpnet |
| all-MiniLM-L12-v2 | 384 | ~33M | 256 | ~6x | Mid-tier |
| all-MiniLM-L6-v2 | 384 | ~22M | 256 | ~14x | Good for speed-critical use |
The official sbert.net guidance is that all-mpnet-base-v2 provides the best quality among the all-* models, while all-MiniLM-L6-v2 is roughly five times faster on GPU and still competitive on most retrieval tasks. The relative-speed numbers above are taken from the documentation's CPU and GPU throughput tables.
For English retrieval at moderate scale, all-mpnet-base-v2 sits at a useful inflection point: the next jump in quality (BGE-M3, GTE-large, NV-Embed-v2, Qwen3-Embedding) costs three to ten times the parameters and often requires task-specific instructions. The next jump in speed (all-MiniLM-L6-v2) gives up roughly two to three points of average MTEB score in exchange for a 5x to 14x speed-up and half the embedding dimension.
Because the model produces a single fixed-size vector per input and supports cosine similarity directly, it slots into a wide range of retrieval and similarity workflows.
| Use case | Notes |
|---|---|
| Semantic search over English documents | Default starter model in many tutorials |
| Dense passage retrieval for retrieval augmented generation | Common index for LangChain, LlamaIndex, Haystack quickstarts |
| Clustering | Combine with k-means, HDBSCAN, or BERTopic |
| Paraphrase mining | Mine duplicate questions or near-duplicate documents |
| FAQ matching | Match user queries against a small bank of answers |
| Topic discovery | BERTopic uses it as a default backbone |
| Re-ranking candidates | First-stage retrieval, before a cross-encoder |
| Zero-shot classification | Score class names against a query embedding |
The single 768-dim vector also makes the model trivial to store in vector databases such as pgvector, Pinecone, Qdrant, Weaviate, Milvus, and Chroma.
paraphrase-multilingual-mpnet-base-v2 or BGE-M3 instead.Despite all of the above, all-mpnet-base-v2 is still the most-cited starter model in 2026 RAG and search tutorials. Three reasons account for this. First, it works out of the box without API keys, payments, or special hardware. Second, the quality is good enough for prototype and production at small scale, and the failure modes are well understood. Third, every major vector database, evaluation harness, and tutorial uses it as the reference, so reproducing a published experiment usually means installing this exact model.
The model's place in the ecosystem is closer to that of bert-base-uncased than to a cutting-edge release: not the strongest option on any benchmark, but the one almost everyone has loaded at some point.