Sentence-transformers/all-mpnet-base-v2 model

Hugging Face

sentence-transformers/all-mpnet-base-v2

Name

all-mpnet-base-v2

User / Organization

sentence-transformers

Type

Natural Language Processing

Task

Sentence Similarity, Feature Extraction

Library

PyTorch, Sentence Transformers

Base model

microsoft/mpnet-base

Architecture

MPNet (12-layer transformer encoder)

Embedding dimension

768

Max sequence length

384 word pieces

Parameters

~110 million

License

Apache 2.0

Released

August 2021 (Flax community sprint)

Paper

arxiv:1908.10084 (Sentence-BERT), arxiv:2004.09297 (MPNet)

Overview

sentence-transformers/all-mpnet-base-v2 is a sentence-embedding model that maps English sentences and short paragraphs to a 768-dimensional dense vector. It is one of the flagship general-purpose models distributed through the Sentence Transformers library and one of the most downloaded text embedding models on Hugging Face, with tens of millions of downloads per month as of 2026.

The model is a fine-tuned version of Microsoft's mpnet-base checkpoint. Microsoft released MPNet at NeurIPS 2020 as a pre-training objective that combines masked and permuted language modeling. The fine-tuning that produced all-mpnet-base-v2 happened during the Hugging Face Flax/JAX community sprint of summer 2021, in a project titled "Train the Best Sentence Embedding Model Ever with 1B Training Pairs." The team trained the model with a contrastive objective on roughly 1.17 billion sentence pairs collected from more than 30 datasets, using seven TPU v3-8 pods donated by Google.

The v2 suffix distinguishes this checkpoint from earlier versions trained on smaller mixtures of data. It quickly became the default reference model for general-purpose English embeddings and has held that role through 2026, even as larger models from BGE, GTE, NV-Embed, and Qwen now rank higher on the MTEB leaderboard.

Identity and provenance

Field	Value
Hugging Face ID	`sentence-transformers/all-mpnet-base-v2`
Base checkpoint	`microsoft/mpnet-base`
Architecture	MPNet encoder
Pooling	Mean pooling over token outputs (with attention mask)
Output	768-dimensional L2-normalized vector
Library	Sentence Transformers, `transformers`
Frameworks	PyTorch (primary), JAX/Flax (training), ONNX, Core ML, TensorFlow.js (community ports)
License	Apache 2.0
Project lead during training	Nils Reimers (then UKP Lab)
Current maintainer	Tom Aarsen at Hugging Face

Nils Reimers created the Sentence Transformers project in 2019 at the Ubiquitous Knowledge Processing (UKP) Lab at TU Darmstadt, under Iryna Gurevych. He led the 2021 sprint that produced the all-* family. In late 2023 Tom Aarsen took over maintenance, and in 2025 the project officially moved from UKP Lab to Hugging Face.

Architecture

The encoder is identical to microsoft/mpnet-base and follows the standard BERT-base shape. Architecture values come from the published config.json on Hugging Face.

Component	Value
Transformer layers	12
Hidden size	768
Attention heads	12
Feed-forward (intermediate) size	3072
Max position embeddings	514
Vocabulary size	30,527 word pieces
Tokenizer	MPNet tokenizer (WordPiece, cased)
Total parameters	~110 million
Output token dim	768
Sentence vector dim	768

During inference, sequences are tokenized with the MPNet tokenizer, padded or truncated to a maximum of 384 word pieces, and run through the 12 encoder blocks. The sentence embedding is the mean of the contextualized token embeddings, weighted by the attention mask so that padding tokens are excluded. The result is then L2-normalized so that dot product equals cosine similarity.

The choice of mean pooling rather than the [CLS] token follows the original Sentence-BERT paper by Reimers and Gurevych, which found that mean pooling produced better sentence-level representations than the classification token used in BERT fine-tuning.

MPNet pre-training objective

The base model uses the MPNet objective from Song et al., "MPNet: Masked and Permuted Pre-training for Language Understanding" (NeurIPS 2020). MPNet sits between BERT and XLNet. It uses permuted language modeling like XLNet so the model learns dependencies between predicted tokens, and it feeds full position information into the encoder so the model always sees the length of the sentence (which XLNet does not). This combination outperformed BERT-base, RoBERTa-base, and XLNet-base on GLUE and SQuAD when normalized for parameter count.

all-mpnet-base-v2 does not change this objective. It only adds a contrastive fine-tuning stage on top.

Training

Hardware and software

Item	Value
Hardware	7 TPU v3-8 pods (Google Cloud)
Framework	JAX/Flax
Optimizer	AdamW, learning rate 2e-5
Warmup	500 steps, linear
Steps	100,000
Batch size	1,024 sentence pairs (128 per TPU core)
Sequence length during training	128 word pieces
Loss	Cross-entropy over scaled cosine similarity (Multiple Negatives Ranking Loss)

Training ran during the Hugging Face Flax/JAX community sprint in July and August 2021. The TPUs were donated by Google's Cloud team, and the codebase was released by the flax-sentence-embeddings organization on the Hub.

Loss function

The team used Multiple Negatives Ranking Loss (also called InfoNCE or NTXent) over scaled cosine similarity:

loss = -1/n * sum_i log( exp(C * cos(a_i, p_i)) / sum_j exp(C * cos(a_i, p_j)) )

For each anchor a_i the matched positive p_i must score higher than every other sentence in the batch. The temperature constant C = 20 sharpens the distribution. Because there is no explicit negative mining, all other items in the batch act as in-batch negatives, which is why the team used a batch size of 1,024. This contrastive learning recipe is the same one that powers most modern dual-encoder retrievers.

Training datasets

The team mixed roughly 1.17 billion sentence pairs from over thirty sources. The largest contributors are listed below.

Dataset	Pairs	Type
Reddit comments (2015 to 2018)	726,484,430	Conversational
S2ORC citation pairs (abstracts)	116,288,806	Scientific
WikiAnswers duplicate questions	77,427,422	Question paraphrase
PAQ (question, answer)	64,371,441	Open-domain QA
S2ORC citation pairs (titles)	52,603,982	Scientific
S2ORC (title, abstract)	41,769,185	Scientific
Stack Exchange (title, body)	25,316,456	Technical Q&A
Stack Exchange (title+body, answer)	21,396,559	Technical Q&A
Stack Exchange (title, answer)	21,396,559	Technical Q&A
MS MARCO triplets	9,144,553	Web search
GOOAQ	3,012,496	Web Q&A
Yahoo Answers (title, answer)	1,198,260	Community Q&A
CodeSearchNet	1,151,414	Code, docstring
COCO image captions	828,395	Captions
SPECTER citation triplets	684,100	Scientific
SearchQA	582,261	Web QA
ELI5	325,475	Long-form QA
Flickr 30k	317,695	Captions
Stack Exchange duplicate questions	304,525	Question paraphrase
AllNLI (SNLI + MultiNLI)	277,230	Natural language inference
Sentence Compression	180,000	Paraphrase
WikiHow	128,542	Procedural
AltLex	112,696	Causal paraphrase
Quora Question Triplets	103,663	Question paraphrase
Simple Wikipedia	102,225	Paraphrase
Natural Questions	100,231	Open-domain QA
SQuAD 2.0	87,599	Reading comprehension
TriviaQA	73,346	Trivia QA

The mixture is heavy on Reddit (about 62 percent of all pairs) and scientific text from S2ORC (about 18 percent combined). That bias is part of why the model performs strongly on conversational and academic retrieval but is mediocre on long, formal documents.

During training, batches were assembled with a sampling strategy that drew from at least two datasets at a time, mixing in-batch negatives across domains so that the model would not collapse on any single distribution.

How to use it

Sentence Transformers (recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

sentences = [
    "That is a happy person",
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day",
]

embeddings = model.encode(sentences)
print(embeddings.shape)  # (4, 768)

similarities = model.similarity(embeddings, embeddings)
print(similarities)

SentenceTransformer.encode already handles tokenization, mean pooling, and L2 normalization. The default similarity function is cosine.

Plain transformers (manual pooling)

If the sentence-transformers library is not available, the same model can be loaded through the base transformers library, but pooling and normalization must be done by hand.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
    mask = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

sentences = ["This is an example sentence", "Each sentence is converted"]

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")

encoded = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    output = model(**encoded)

embeddings = mean_pooling(output, encoded["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)

Cloning the repository

The full checkpoint can be cloned from the Hub. Git LFS is required for the model weights.

git lfs install
git clone https://huggingface.co/sentence-transformers/all-mpnet-base-v2

To download only metadata and pointers, prepend GIT_LFS_SKIP_SMUDGE=1.

Performance

MTEB

On the original MTEB English benchmark (Muennighoff et al., 2022), all-mpnet-base-v2 scored an average of about 57.8 across 56 tasks. At release in 2022 it ranked near the top of the leaderboard among public models of comparable size.

By 2026 the model has been overtaken on the absolute leaderboard. Top current models such as Qwen3-Embedding-8B (around 70 on MTEB v2), NV-Embed-v2 (about 69 on MTEB v1), Google's Gemini Embedding 001 (around 68), and the BGE and GTE family all use larger encoders and instruction tuning. Despite this, all-mpnet-base-v2 remains a common baseline because of its small size, permissive license, and extremely fast CPU inference relative to billion-parameter rivals.

Comparison with other Sentence Transformers models

The Sentence Transformers documentation positions all-mpnet-base-v2 at the high-quality end of the general-purpose family.

Model	Embedding dim	Params	Max tokens	Relative speed	Quality position
all-mpnet-base-v2	768	~110M	384	1x	Highest in the all-* family
all-distilroberta-v1	768	~82M	512	~3x	Slightly below mpnet
all-MiniLM-L12-v2	384	~33M	256	~6x	Mid-tier
all-MiniLM-L6-v2	384	~22M	256	~14x	Good for speed-critical use

The official sbert.net guidance is that all-mpnet-base-v2 provides the best quality among the all-* models, while all-MiniLM-L6-v2 is roughly five times faster on GPU and still competitive on most retrieval tasks. The relative-speed numbers above are taken from the documentation's CPU and GPU throughput tables.

Quality versus cost

For English retrieval at moderate scale, all-mpnet-base-v2 sits at a useful inflection point: the next jump in quality (BGE-M3, GTE-large, NV-Embed-v2, Qwen3-Embedding) costs three to ten times the parameters and often requires task-specific instructions. The next jump in speed (all-MiniLM-L6-v2) gives up roughly two to three points of average MTEB score in exchange for a 5x to 14x speed-up and half the embedding dimension.

Use cases

Because the model produces a single fixed-size vector per input and supports cosine similarity directly, it slots into a wide range of retrieval and similarity workflows.

Use case	Notes
Semantic search over English documents	Default starter model in many tutorials
Dense passage retrieval for retrieval augmented generation	Common index for LangChain, LlamaIndex, Haystack quickstarts
Clustering	Combine with k-means, HDBSCAN, or BERTopic
Paraphrase mining	Mine duplicate questions or near-duplicate documents
FAQ matching	Match user queries against a small bank of answers
Topic discovery	BERTopic uses it as a default backbone
Re-ranking candidates	First-stage retrieval, before a cross-encoder
Zero-shot classification	Score class names against a query embedding

The single 768-dim vector also makes the model trivial to store in vector databases such as pgvector, Pinecone, Qdrant, Weaviate, Milvus, and Chroma.

Strengths and limitations

Strengths

Apache 2.0 license, no use restrictions.
Small enough to run on a CPU at hundreds of sentences per second, and on a single mid-range GPU at thousands per second.
Trained on a deliberately diverse mixture, so it generalizes across domains better than narrowly tuned alternatives.
Mature ecosystem: ONNX, GGUF, Core ML, TensorRT, and TensorFlow.js community ports exist.
The Sentence Transformers library makes loading, encoding, fine-tuning, and quantizing the model a one-liner.

Limitations

English only. Multilingual queries should use paraphrase-multilingual-mpnet-base-v2 or BGE-M3 instead.
Hard 384-token cap. Longer documents must be chunked.
No instruction tuning. Newer instruction-aware embedding models score higher on diverse tasks because they can be told what to do.
Reddit-heavy training mix. The model can overfit to casual conversational style and underperform on formal legal, medical, or financial text without fine-tuning.
As of 2026 it is no longer state of the art on MTEB. Teams that need maximum quality typically pick BGE-M3, GTE-large-en-v1.5, NV-Embed-v2, or Qwen3-Embedding.
Output is a single fixed-size vector. There is no token-level output for tasks like extractive QA or named entity recognition.

Why it remains the default

Despite all of the above, all-mpnet-base-v2 is still the most-cited starter model in 2026 RAG and search tutorials. Three reasons account for this. First, it works out of the box without API keys, payments, or special hardware. Second, the quality is good enough for prototype and production at small scale, and the failure modes are well understood. Third, every major vector database, evaluation harness, and tutorial uses it as the reference, so reproducing a published experiment usually means installing this exact model.

The model's place in the ecosystem is closer to that of bert-base-uncased than to a cutting-edge release: not the strongest option on any benchmark, but the one almost everyone has loaded at some point.

References

Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. arXiv:1908.10084.
Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). "MPNet: Masked and Permuted Pre-training for Language Understanding." NeurIPS 2020. arXiv:2004.09297.
Hugging Face model card. "sentence-transformers/all-mpnet-base-v2." https://huggingface.co/sentence-transformers/all-mpnet-base-v2.
Microsoft Research model card. "microsoft/mpnet-base." https://huggingface.co/microsoft/mpnet-base.
Hugging Face Blog (2021). "Train a Sentence Embedding Model with 1 Billion Training Pairs." https://huggingface.co/blog/1b-sentence-embeddings.
SBERT documentation. "Pretrained Models." https://www.sbert.net/docs/sentence_transformer/pretrained_models.html.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." arXiv:2210.07316.
Hugging Face Blog (2025). "Sentence Transformers is joining Hugging Face." https://huggingface.co/blog/sentence-transformers-joins-hf.
MTEB Leaderboard. https://huggingface.co/spaces/mteb/leaderboard.
NVIDIA Developer Blog. "NVIDIA Text Embedding Model Tops MTEB Leaderboard." https://developer.nvidia.com/blog/nvidia-text-embedding-model-tops-mteb-leaderboard/.

Overview

Identity and provenance

Architecture

MPNet pre-training objective

Training

Hardware and software

Loss function

Training datasets

How to use it

Sentence Transformers (recommended)

Plain transformers (manual pooling)

Cloning the repository

Performance

MTEB

Comparison with other Sentence Transformers models

Quality versus cost

Use cases

Strengths and limitations

Strengths

Limitations

Why it remains the default

See also

References

Improve this article

Related Articles

Sentence-transformers/all-MiniLM-L6-v2 model

SmolVLA

StarCoder

Clément Delangue

Hugging Face Transformers

MTEB (Massive Text Embedding Benchmark)

Overview

Identity and provenance

Architecture

MPNet pre-training objective

Training

Hardware and software

Loss function

Training datasets

How to use it

Sentence Transformers (recommended)

Plain transformers (manual pooling)

Cloning the repository

Performance

MTEB

Comparison with other Sentence Transformers models

Quality versus cost

Use cases

Strengths and limitations

Strengths

Limitations

Why it remains the default

See also

References

Related Articles

Sentence-transformers/all-MiniLM-L6-v2 model

SmolVLA

StarCoder

Clément Delangue

Hugging Face Transformers

MTEB (Massive Text Embedding Benchmark)