# Sentence-transformers/all-mpnet-base-v2 model

> Source: https://aiwiki.ai/wiki/sentence-transformers_all-mpnet-base-v2_model
> Updated: 2026-04-26
> Categories: AI Models, Natural Language Processing, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Hugging Face**

sentence-transformers/all-mpnet-base-v2

**Name**

all-mpnet-base-v2

**User / Organization**

sentence-transformers

**Type**

[Natural Language Processing](/wiki/natural_language_processing)

**Task**

[Sentence Similarity](/wiki/sentence_similarity_models), [Feature Extraction](/wiki/feature_extraction)

**Library**

PyTorch, [Sentence Transformers](/wiki/sentence-transformers)

**Base model**

microsoft/mpnet-base

**Architecture**

[MPNet](/wiki/mpnet) (12-layer transformer encoder)

**Embedding dimension**

768

**Max sequence length**

384 word pieces

**Parameters**

~110 million

**License**

Apache 2.0

**Released**

August 2021 (Flax community sprint)

**Paper**

[arxiv:1908.10084](https://arxiv.org/abs/1908.10084) (Sentence-BERT), [arxiv:2004.09297](https://arxiv.org/abs/2004.09297) (MPNet)

## Overview

`sentence-transformers/all-mpnet-base-v2` is a sentence-embedding model that maps English sentences and short paragraphs to a 768-dimensional dense vector. It is one of the flagship general-purpose models distributed through the [Sentence Transformers](/wiki/sentence-transformers) library and one of the most downloaded text embedding models on [Hugging Face](/wiki/hugging_face), with tens of millions of downloads per month as of 2026.

The model is a fine-tuned version of Microsoft's `mpnet-base` checkpoint. Microsoft released MPNet at NeurIPS 2020 as a pre-training objective that combines masked and permuted language modeling. The fine-tuning that produced `all-mpnet-base-v2` happened during the Hugging Face Flax/JAX community sprint of summer 2021, in a project titled "Train the Best Sentence Embedding Model Ever with 1B Training Pairs." The team trained the model with a contrastive objective on roughly 1.17 billion sentence pairs collected from more than 30 datasets, using seven TPU v3-8 pods donated by Google.

The v2 suffix distinguishes this checkpoint from earlier versions trained on smaller mixtures of data. It quickly became the default reference model for general-purpose English embeddings and has held that role through 2026, even as larger models from BGE, GTE, NV-Embed, and Qwen now rank higher on the [MTEB](/wiki/mteb) leaderboard.

## Identity and provenance

| Field | Value |
|---|---|
| Hugging Face ID | `sentence-transformers/all-mpnet-base-v2` |
| Base checkpoint | `microsoft/mpnet-base` |
| Architecture | [MPNet](/wiki/mpnet) encoder |
| Pooling | Mean pooling over token outputs (with attention mask) |
| Output | 768-dimensional L2-normalized vector |
| Library | [Sentence Transformers](/wiki/sentence-transformers), `transformers` |
| Frameworks | PyTorch (primary), JAX/Flax (training), ONNX, Core ML, TensorFlow.js (community ports) |
| License | Apache 2.0 |
| Project lead during training | Nils Reimers (then UKP Lab) |
| Current maintainer | Tom Aarsen at [Hugging Face](/wiki/hugging_face) |

Nils Reimers created the Sentence Transformers project in 2019 at the Ubiquitous Knowledge Processing (UKP) Lab at TU Darmstadt, under Iryna Gurevych. He led the 2021 sprint that produced the `all-*` family. In late 2023 Tom Aarsen took over maintenance, and in 2025 the project officially moved from UKP Lab to Hugging Face.

## Architecture

The encoder is identical to `microsoft/mpnet-base` and follows the standard BERT-base shape. Architecture values come from the published config.json on Hugging Face.

| Component | Value |
|---|---|
| Transformer layers | 12 |
| Hidden size | 768 |
| Attention heads | 12 |
| Feed-forward (intermediate) size | 3072 |
| Max position embeddings | 514 |
| Vocabulary size | 30,527 word pieces |
| Tokenizer | MPNet tokenizer (WordPiece, cased) |
| Total parameters | ~110 million |
| Output token dim | 768 |
| Sentence vector dim | 768 |

During inference, sequences are tokenized with the MPNet tokenizer, padded or truncated to a maximum of 384 word pieces, and run through the 12 encoder blocks. The sentence embedding is the mean of the contextualized token embeddings, weighted by the attention mask so that padding tokens are excluded. The result is then L2-normalized so that dot product equals [cosine similarity](/wiki/cosine_similarity).

The choice of [mean pooling](/wiki/mean_pooling) rather than the `[CLS]` token follows the original [Sentence-BERT](/wiki/sentence-bert) paper by Reimers and Gurevych, which found that mean pooling produced better sentence-level representations than the classification token used in [BERT](/wiki/bert) fine-tuning.

### MPNet pre-training objective

The base model uses the MPNet objective from Song et al., "MPNet: Masked and Permuted Pre-training for Language Understanding" (NeurIPS 2020). MPNet sits between BERT and XLNet. It uses permuted language modeling like XLNet so the model learns dependencies between predicted tokens, and it feeds full position information into the encoder so the model always sees the length of the sentence (which XLNet does not). This combination outperformed BERT-base, RoBERTa-base, and XLNet-base on GLUE and SQuAD when normalized for parameter count.

`all-mpnet-base-v2` does not change this objective. It only adds a contrastive fine-tuning stage on top.

## Training

### Hardware and software

| Item | Value |
|---|---|
| Hardware | 7 TPU v3-8 pods (Google Cloud) |
| Framework | JAX/Flax |
| Optimizer | AdamW, learning rate 2e-5 |
| Warmup | 500 steps, linear |
| Steps | 100,000 |
| Batch size | 1,024 sentence pairs (128 per TPU core) |
| Sequence length during training | 128 word pieces |
| Loss | Cross-entropy over scaled cosine similarity (Multiple Negatives Ranking Loss) |

Training ran during the Hugging Face Flax/JAX community sprint in July and August 2021. The TPUs were donated by Google's Cloud team, and the codebase was released by the `flax-sentence-embeddings` organization on the Hub.

### Loss function

The team used Multiple Negatives Ranking Loss (also called InfoNCE or NTXent) over scaled cosine similarity:

```
loss = -1/n * sum_i log( exp(C * cos(a_i, p_i)) / sum_j exp(C * cos(a_i, p_j)) )
```

For each anchor `a_i` the matched positive `p_i` must score higher than every other sentence in the batch. The temperature constant `C = 20` sharpens the distribution. Because there is no explicit negative mining, all other items in the batch act as in-batch negatives, which is why the team used a batch size of 1,024. This [contrastive learning](/wiki/contrastive_learning) recipe is the same one that powers most modern dual-encoder retrievers.

### Training datasets

The team mixed roughly 1.17 billion sentence pairs from over thirty sources. The largest contributors are listed below.

| Dataset | Pairs | Type |
|---|---|---|
| Reddit comments (2015 to 2018) | 726,484,430 | Conversational |
| S2ORC citation pairs (abstracts) | 116,288,806 | Scientific |
| WikiAnswers duplicate questions | 77,427,422 | Question paraphrase |
| PAQ (question, answer) | 64,371,441 | Open-domain QA |
| S2ORC citation pairs (titles) | 52,603,982 | Scientific |
| S2ORC (title, abstract) | 41,769,185 | Scientific |
| Stack Exchange (title, body) | 25,316,456 | Technical Q&A |
| Stack Exchange (title+body, answer) | 21,396,559 | Technical Q&A |
| Stack Exchange (title, answer) | 21,396,559 | Technical Q&A |
| MS MARCO triplets | 9,144,553 | Web search |
| GOOAQ | 3,012,496 | Web Q&A |
| Yahoo Answers (title, answer) | 1,198,260 | Community Q&A |
| CodeSearchNet | 1,151,414 | Code, docstring |
| COCO image captions | 828,395 | Captions |
| SPECTER citation triplets | 684,100 | Scientific |
| SearchQA | 582,261 | Web QA |
| ELI5 | 325,475 | Long-form QA |
| Flickr 30k | 317,695 | Captions |
| Stack Exchange duplicate questions | 304,525 | Question paraphrase |
| AllNLI (SNLI + MultiNLI) | 277,230 | Natural language inference |
| Sentence Compression | 180,000 | Paraphrase |
| WikiHow | 128,542 | Procedural |
| AltLex | 112,696 | Causal paraphrase |
| Quora Question Triplets | 103,663 | Question paraphrase |
| Simple Wikipedia | 102,225 | Paraphrase |
| Natural Questions | 100,231 | Open-domain QA |
| SQuAD 2.0 | 87,599 | Reading comprehension |
| TriviaQA | 73,346 | Trivia QA |

The mixture is heavy on Reddit (about 62 percent of all pairs) and scientific text from S2ORC (about 18 percent combined). That bias is part of why the model performs strongly on conversational and academic retrieval but is mediocre on long, formal documents.

During training, batches were assembled with a sampling strategy that drew from at least two datasets at a time, mixing in-batch negatives across domains so that the model would not collapse on any single distribution.

## How to use it

### Sentence Transformers (recommended)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day",
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (4, 768)

similarities = model.similarity(embeddings, embeddings)
print(similarities)
```

`SentenceTransformer.encode` already handles tokenization, mean pooling, and L2 normalization. The default similarity function is cosine.

### Plain transformers (manual pooling)

If the `sentence-transformers` library is not available, the same model can be loaded through the base `transformers` library, but pooling and normalization must be done by hand.

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")

encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    output = model(**encoded)

embeddings = mean_pooling(output, encoded["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)
```

### Cloning the repository

The full checkpoint can be cloned from the Hub. Git LFS is required for the model weights.

```bash
git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2
```

To download only metadata and pointers, prepend `GIT_LFS_SKIP_SMUDGE=1`.

## Performance

### MTEB

On the original MTEB English benchmark (Muennighoff et al., 2022), `all-mpnet-base-v2` scored an average of about 57.8 across 56 tasks. At release in 2022 it ranked near the top of the leaderboard among public models of comparable size.

By 2026 the model has been overtaken on the absolute leaderboard. Top current models such as Qwen3-Embedding-8B (around 70 on MTEB v2), NV-Embed-v2 (about 69 on MTEB v1), Google's Gemini Embedding 001 (around 68), and the BGE and GTE family all use larger encoders and instruction tuning. Despite this, `all-mpnet-base-v2` remains a common baseline because of its small size, permissive license, and extremely fast CPU inference relative to billion-parameter rivals.

### Comparison with other Sentence Transformers models

The Sentence Transformers documentation positions `all-mpnet-base-v2` at the high-quality end of the general-purpose family.

| Model | Embedding dim | Params | Max tokens | Relative speed | Quality position |
|---|---|---|---|---|---|
| [all-mpnet-base-v2](/wiki/sentence-transformers_all-mpnet-base-v2_model) | 768 | ~110M | 384 | 1x | Highest in the all-* family |
| all-distilroberta-v1 | 768 | ~82M | 512 | ~3x | Slightly below mpnet |
| all-MiniLM-L12-v2 | 384 | ~33M | 256 | ~6x | Mid-tier |
| [all-MiniLM-L6-v2](/wiki/sentence-transformers_all-minilm-l6-v2_model) | 384 | ~22M | 256 | ~14x | Good for speed-critical use |

The official sbert.net guidance is that `all-mpnet-base-v2` provides the best quality among the all-* models, while `all-MiniLM-L6-v2` is roughly five times faster on GPU and still competitive on most retrieval tasks. The relative-speed numbers above are taken from the documentation's CPU and GPU throughput tables.

### Quality versus cost

For English retrieval at moderate scale, `all-mpnet-base-v2` sits at a useful inflection point: the next jump in quality (BGE-M3, GTE-large, NV-Embed-v2, Qwen3-Embedding) costs three to ten times the parameters and often requires task-specific instructions. The next jump in speed (`all-MiniLM-L6-v2`) gives up roughly two to three points of average MTEB score in exchange for a 5x to 14x speed-up and half the embedding dimension.

## Use cases

Because the model produces a single fixed-size vector per input and supports cosine similarity directly, it slots into a wide range of retrieval and similarity workflows.

| Use case | Notes |
|---|---|
| Semantic search over English documents | Default starter model in many tutorials |
| Dense passage retrieval for [retrieval augmented generation](/wiki/retrieval_augmented_generation) | Common index for LangChain, LlamaIndex, Haystack quickstarts |
| Clustering | Combine with k-means, HDBSCAN, or BERTopic |
| Paraphrase mining | Mine duplicate questions or near-duplicate documents |
| FAQ matching | Match user queries against a small bank of answers |
| Topic discovery | BERTopic uses it as a default backbone |
| Re-ranking candidates | First-stage retrieval, before a cross-encoder |
| Zero-shot classification | Score class names against a query [embedding](/wiki/embeddings) |

The single 768-dim vector also makes the model trivial to store in vector databases such as pgvector, Pinecone, Qdrant, Weaviate, Milvus, and Chroma.

## Strengths and limitations

### Strengths

1. Apache 2.0 license, no use restrictions.
2. Small enough to run on a CPU at hundreds of sentences per second, and on a single mid-range GPU at thousands per second.
3. Trained on a deliberately diverse mixture, so it generalizes across domains better than narrowly tuned alternatives.
4. Mature ecosystem: ONNX, GGUF, Core ML, TensorRT, and TensorFlow.js community ports exist.
5. The Sentence Transformers library makes loading, encoding, fine-tuning, and quantizing the model a one-liner.

### Limitations

1. English only. Multilingual queries should use `paraphrase-multilingual-mpnet-base-v2` or BGE-M3 instead.
2. Hard 384-token cap. Longer documents must be chunked.
3. No instruction tuning. Newer instruction-aware embedding models score higher on diverse tasks because they can be told what to do.
4. Reddit-heavy training mix. The model can overfit to casual conversational style and underperform on formal legal, medical, or financial text without fine-tuning.
5. As of 2026 it is no longer state of the art on MTEB. Teams that need maximum quality typically pick BGE-M3, GTE-large-en-v1.5, NV-Embed-v2, or Qwen3-Embedding.
6. Output is a single fixed-size vector. There is no token-level output for tasks like extractive QA or named entity recognition.

## Why it remains the default

Despite all of the above, `all-mpnet-base-v2` is still the most-cited starter model in 2026 RAG and search tutorials. Three reasons account for this. First, it works out of the box without API keys, payments, or special hardware. Second, the quality is good enough for prototype and production at small scale, and the failure modes are well understood. Third, every major vector database, evaluation harness, and tutorial uses it as the reference, so reproducing a published experiment usually means installing this exact model.

The model's place in the ecosystem is closer to that of `bert-base-uncased` than to a cutting-edge release: not the strongest option on any benchmark, but the one almost everyone has loaded at some point.

## See also

- [Sentence-BERT](/wiki/sentence-bert)
- [Sentence Transformers](/wiki/sentence-transformers)
- [MPNet](/wiki/mpnet)
- [BERT](/wiki/bert)
- [Transformers](/wiki/transformers)
- [all-MiniLM-L6-v2](/wiki/sentence-transformers_all-minilm-l6-v2_model)
- [Hugging Face](/wiki/hugging_face)
- [Embedding](/wiki/embeddings)
- [Sentence similarity](/wiki/sentence_similarity_models)
- [Feature extraction](/wiki/feature_extraction)
- [Contrastive learning](/wiki/contrastive_learning)
- [Mean pooling](/wiki/mean_pooling)
- [Cosine similarity](/wiki/cosine_similarity)
- [MTEB](/wiki/mteb)
- [Retrieval augmented generation](/wiki/retrieval_augmented_generation)

## References

1. Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. [arXiv:1908.10084](https://arxiv.org/abs/1908.10084).
2. Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). "MPNet: Masked and Permuted Pre-training for Language Understanding." NeurIPS 2020. [arXiv:2004.09297](https://arxiv.org/abs/2004.09297).
3. Hugging Face model card. "sentence-transformers/all-mpnet-base-v2." [https://huggingface.co/sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2).
4. Microsoft Research model card. "microsoft/mpnet-base." [https://huggingface.co/microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base).
5. Hugging Face Blog (2021). "Train a Sentence Embedding Model with 1 Billion Training Pairs." [https://huggingface.co/blog/1b-sentence-embeddings](https://huggingface.co/blog/1b-sentence-embeddings).
6. SBERT documentation. "Pretrained Models." [https://www.sbert.net/docs/sentence_transformer/pretrained_models.html](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html).
7. Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." [arXiv:2210.07316](https://arxiv.org/abs/2210.07316).
8. Hugging Face Blog (2025). "Sentence Transformers is joining Hugging Face." [https://huggingface.co/blog/sentence-transformers-joins-hf](https://huggingface.co/blog/sentence-transformers-joins-hf).
9. MTEB Leaderboard. [https://huggingface.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
10. NVIDIA Developer Blog. "NVIDIA Text Embedding Model Tops MTEB Leaderboard." [https://developer.nvidia.com/blog/nvidia-text-embedding-model-tops-mteb-leaderboard/](https://developer.nvidia.com/blog/nvidia-text-embedding-model-tops-mteb-leaderboard/).

