Sentence-BERT (SBERT)
Last reviewed
Apr 28, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,557 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,557 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sentence-BERT, almost always abbreviated SBERT, is a modification of the pretrained BERT transformer network designed to produce semantically meaningful, fixed-size sentence embeddings that can be compared with simple cosine similarity. It was introduced by Nils Reimers and Iryna Gurevych at the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt in their EMNLP 2019 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. SBERT solved a critical efficiency problem that had been holding back BERT from large-scale semantic similarity tasks, and the open-source library that implements it, sentence-transformers, has since become one of the most widely deployed pieces of NLP software in the world. It powers a substantial fraction of the embedding pipelines behind modern semantic search systems, retrieval augmented generation stacks, clustering tools, and recommendation engines.[1][2]
The core idea is conceptually simple but pragmatically transformative. Vanilla BERT, although extremely strong on sentence-pair classification, requires both sentences to be concatenated and fed through the network jointly. This is fine for a single comparison but catastrophic for tasks like nearest-neighbor search over a corpus, because the cost grows with the square of the corpus size. SBERT replaces this cross-encoder formulation with a bi-encoder built from a Siamese (or triplet) configuration of BERT models. After fine-tuning on natural language inference and semantic similarity data, each sentence can be encoded once into a single dense vector, and then arbitrary pairs of vectors can be compared in microseconds. The reduction the original paper reports is striking: finding the most similar pair in a collection of 10,000 sentences with vanilla BERT takes roughly 65 hours; the same task with SBERT takes about 5 seconds, while preserving most of the accuracy advantage that BERT brought over earlier sentence embedding methods.[1]
Before SBERT, the field had a fragmented set of options for representing whole sentences as fixed-length vectors. Methods like averaged word embeddings (averaging GloVe or word2vec vectors), Skip-Thought, InferSent, and the Universal Sentence Encoder were each useful but plateaued well below what large pretrained transformers seemed to promise. When BERT arrived in late 2018, it set new state-of-the-art results across an enormous range of NLP benchmarks, including sentence-pair tasks like STS-B, MRPC, and natural language inference. However, BERT's design treated sentence pairs as a joint input: the two sentences were concatenated with a [SEP] token, fed together through the transformer stack, and a classification head on top of the [CLS] token produced a similarity score or label. This cross-encoding approach is what produced the strong scores, but it makes BERT unsuitable for tasks where you need a stand-alone vector representation of a single sentence.[1]
The naive workaround, which many practitioners initially tried, was to feed individual sentences through BERT and either take the [CLS] token output or average the contextualized token vectors and treat the result as a sentence embedding. Reimers and Gurevych showed that this naive approach actually performed worse than averaging GloVe embeddings on the standard semantic textual similarity tasks. Averaging BERT-base outputs achieved only about 54.81 average Spearman correlation across STS12 through STS16, STSbenchmark, and SICK-R, while averaged GloVe embeddings hit 61.32. This counterintuitive result motivated the entire SBERT project. BERT's representations were rich, but its training objectives (masked language modeling and next-sentence prediction) had not aligned the geometry of the embedding space with semantic similarity.[1][3]
SBERT keeps the BERT (or RoBERTa, DistilBERT, ALBERT, MPNet, etc.) transformer encoder as its backbone but adds two key elements on top: a pooling layer that converts the variable-length sequence of token vectors into a fixed-size sentence vector, and a Siamese training setup that fine-tunes the encoder so that semantically similar sentences end up close together in the resulting vector space.[1]
A standard BERT encoder emits a sequence of contextualized token embeddings whose length depends on the input. SBERT pools this sequence into a single vector. The original paper explored three pooling operations and found that the choice matters, especially when training with a regression-style objective.[1]
| Pooling strategy | What it does | Notes |
|---|---|---|
| CLS | Use the embedding of the special [CLS] token as the sentence vector. | Simple, matches how BERT is fine-tuned for classification, but underperforms mean pooling for semantic similarity. |
| MEAN | Average all token embeddings, optionally weighted by the attention mask so padding does not bias the result. | Default in sentence-transformers. Best overall tradeoff between quality and stability across tasks. |
| MAX | Take an element-wise maximum over the sequence dimension (max-over-time). | Performs noticeably worse than CLS or MEAN when the model is trained with a regression objective. |
Mean pooling has become the de facto standard, and almost every popular SBERT-style model on the Hugging Face Hub uses it. Newer variants sometimes use a learned weighted pool or a small attention head, but mean pooling remains a strong baseline.[1][4]
The Siamese setup is what allows SBERT to learn embeddings that are directly comparable. During fine-tuning, two copies of the same BERT encoder (with shared weights, hence the term Siamese) process the two sentences in a pair independently. Each side runs the encoder forward, applies pooling, and produces a single vector. The two vectors are then combined according to whatever objective is being optimized. Because the weights are tied, what is really happening is that one model is being trained to produce representations that work well in pairs, but at inference time the model can be applied to a single sentence at a time. This is the same architectural pattern used in the foundational two-tower model family that underpins modern dense retrieval and recommendation systems.[1]
For data that comes in triplets (anchor, positive, negative), SBERT uses a triplet network with three weight-sharing copies of the encoder.[1]
The original paper proposed three loss functions to fine-tune the Siamese architecture, and the choice of loss is closely coupled to the format of the available training data.[1]
u and v, the head computes the concatenation (u, v, |u - v|), multiplies it by a trainable weight matrix, and applies softmax cross-entropy over the three NLI labels. The element-wise absolute difference |u - v| is the crucial design choice, providing a magnitude signal the linear layer can exploit.u and v and the target score. This objective is much more sensitive to pooling choice; MAX pooling collapses badly, while MEAN and CLS both work well.(anchor, positive, negative). The loss enforces that the distance between anchor and positive is smaller than the distance between anchor and negative by some fixed margin.Later work in the same library generalized these into a much broader catalog of more than 20 loss functions, but two are particularly important in modern practice: the original softmax-over-NLI loss and the Multiple Negatives Ranking Loss (MNR), sometimes called in-batch negatives loss, which computes cross entropy over cosine similarities within a batch and treats every other example in the batch as a negative for each anchor. MNR loss has become the dominant choice and consistently outperforms the original softmax classification loss.[5][6]
The original SBERT paper trained on a combination of two NLI corpora and then fine-tuned further on STS-B for tasks where supervised similarity scores were available.[1]
The combined SNLI plus MultiNLI corpus is often referred to as AllNLI in the sentence-transformers documentation, and it remains a common starting point for fine-tuning new sentence encoders. In subsequent years, the community discovered that using NLI contradictions as hard negatives in a triplet or MNR setup substantially improves quality. Modern training recipes for state-of-the-art embedding models layer additional supervision from question-answer pairs, paraphrase corpora, web search click data, and synthetic data generated by large language models.[1][5][6]
The Reimers and Gurevych paper evaluated SBERT on the standard suite of unsupervised STS tasks (STS12 through STS16, STS Benchmark, and SICK-R) using Spearman rank correlation. The headline numbers from the paper, averaged across the seven tasks, were as follows.[1]
| Method | Average Spearman correlation |
|---|---|
| Average GloVe embeddings | 61.32 |
| Average BERT-base CLS or token embeddings | 54.81 |
| InferSent (GloVe) | 65.01 |
| Universal Sentence Encoder | 71.22 |
| SBERT-NLI-base | 74.89 |
| SBERT-NLI-large | 76.55 |
| SRoBERTa-NLI-large | 76.68 |
SBERT-NLI-base improved on InferSent by about 11.7 points and on Universal Sentence Encoder by about 5.5 points without any task-specific supervision. SBERT also improved over BERT-as-a-bi-encoder by more than 20 points, demonstrating that the contribution was not the BERT backbone alone but the Siamese fine-tuning that adapted the geometry of the embedding space.[1]
sentence-transformers libraryAlongside the paper, Reimers and Gurevych released an open-source PyTorch library called sentence-transformers, originally hosted under the UKPLab GitHub organization. The library bundles the model architecture, training code, evaluation harnesses, a wide catalog of pretrained checkpoints, and utility functions for cosine similarity, semantic search, paraphrase mining, and clustering. Its API famously reduces the experience of using a strong sentence encoder to a few lines of Python:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(["A man is eating food.", "A man eats a meal."])
This ergonomic simplicity, combined with a steadily expanding zoo of pretrained checkpoints uploaded to the Hugging Face Hub under the sentence-transformers/ namespace, is what made the library so dominant. By 2024 it was being installed millions of times per month from PyPI and was a near-universal dependency in production semantic search and RAG stacks.[2][7]
In late 2024 the project was officially transferred from the UKP Lab to Hugging Face, with maintainer Tom Aarsen (who had already been the de facto lead since late 2023) continuing to drive development under the huggingface/sentence-transformers repository. Subsequent releases brought tighter integration with the Hugging Face Trainer, support for new losses (Matryoshka Loss, Cached MNR, GISTEmbed), and better tooling for fine-tuning custom embedding models.[2][7]
The sentence-transformers namespace on the Hugging Face Hub hosts hundreds of pretrained checkpoints. A handful are dominant in production usage because they hit useful sweet spots between speed, quality, and language coverage.[4]
| Model | Backbone | Output dim | Parameters | Notes |
|---|---|---|---|---|
| all-mpnet-base-v2 | MPNet base | 768 | ~110M | Long the default high-quality general-purpose English encoder, fine-tuned on more than one billion sentence pairs with MNR loss. Top performer among general-purpose SBERT checkpoints on STS-B and many MTEB English tasks. |
| all-MiniLM-L6-v2 | MiniLM (6 layers, distilled) | 384 | ~22M | The most-downloaded sentence encoder ever shipped on the Hugging Face Hub. Much faster than mpnet-base, with only a few percentage points lost on most benchmarks. The default choice when latency or memory is constrained. |
| all-MiniLM-L12-v2 | MiniLM (12 layers, distilled) | 384 | ~33M | Slightly larger MiniLM with marginally better quality, still extremely fast. |
| paraphrase-multilingual-MiniLM-L12-v2 | MiniLM (multilingual) | 384 | ~118M | Trained on parallel data covering 50+ languages so that translations of the same sentence end up close in embedding space. The standard go-to multilingual SBERT checkpoint. |
| paraphrase-multilingual-mpnet-base-v2 | MPNet (multilingual) | 768 | ~278M | Higher-quality multilingual variant. Used widely for cross-lingual semantic search. |
| multi-qa-mpnet-base-dot-v1 | MPNet base | 768 | ~110M | Tuned specifically for asymmetric question-passage retrieval. Uses dot-product rather than cosine similarity. |
| msmarco-distilbert-base-v4 | DistilBERT | 768 | ~66M | Trained on the MS MARCO passage retrieval dataset, an early standard for dense retrieval. |
The all- prefix indicates that the model was trained on the combined billion-pair dataset assembled by the sentence-transformers maintainers, while paraphrase-, multi-qa-, and msmarco- prefixes indicate domain-targeted training data.[4]
A persistent practical question is when to use the SBERT-style bi-encoder formulation versus a BERT-style cross-encoder. The answer turns on the difference between recall and re-ranking.[8]
O(n) for a corpus of size n, and similarity comparison is O(1) per pair, with billion-scale search made tractable by approximate nearest neighbor indexes like FAISS. Quality is somewhat lower than cross-encoders because the model never sees the two inputs together.O(n) per query, which is unworkable for retrieval over millions of documents.The canonical production pattern that emerged is retrieve-then-rerank: an SBERT-style bi-encoder retrieves the top-k candidates with k-nearest neighbors search, and a smaller, more expensive cross-encoder re-scores those candidates. The sentence-transformers library packages cross-encoders alongside bi-encoders for exactly this workflow.[8]
SBERT and its descendants are used wherever a fixed-size dense vector representation of text is useful.[2][8]
Over the years, the community has settled on a few standard evaluations for sentence and document encoders.
The years since 2019 have seen an explosion of embedding models that build on the SBERT recipe. Most retain the bi-encoder architecture and contrastive training objective; the main innovations are larger and more diverse training data, better backbone models (often initialized from instruction-tuned LLMs rather than BERT), and techniques like Matryoshka representation learning that allow a single model to produce embeddings at multiple dimensionalities.[9]
| Model family | Provider | Default dim | Open weights | Notes |
|---|---|---|---|---|
all-mpnet-base-v2 (SBERT) | UKP Lab / Hugging Face | 768 | Yes | The classic open-source baseline. Widely used as a default. |
BGE (bge-large-en-v1.5, bge-m3) | BAAI (Beijing Academy of AI) | 1024 (large), variable (m3) | Yes | The BGE series became the open-source state-of-the-art in 2023 and 2024. bge-m3 supports dense, sparse, and ColBERT-style multi-vector retrieval in one model with 100+ language coverage and 8K context. |
GTE (gte-large-en-v1.5, gte-Qwen2-1.5B-instruct) | Alibaba DAMO | 1024+ | Yes | Strong open-source competitor; recent variants are LLM-initialized. |
Jina Embeddings (jina-embeddings-v3) | Jina AI | 1024 (truncatable via Matryoshka) | Yes | Long-context (8K tokens) and multilingual. Designed for production retrieval. |
OpenAI text-embedding-3-small and text-embedding-3-large | OpenAI | 1536 / 3072 (truncatable) | No | Closed API, replaced ada-002 in 2024. Uses Matryoshka so dimensions can be shortened. Small is the dominant production embedding API by usage. |
Voyage voyage-3-large, voyage-4 | Voyage AI (Anthropic) | 1024 default, up to 2048 | No | Top performer on retrieval benchmarks. Supports int8 and binary quantization for very low storage cost. |
Cohere embed-v4 | Cohere | 1024+ | No | Multilingual and multimodal embeddings. Strong on enterprise retrieval workloads. |
| Google Gemini Embedding | Google DeepMind | up to 3072 (truncatable) | No | Multimodal embeddings spanning text, image, video, audio, and PDF. |
| NV-Embed, Llama-Embed-Nemotron | NVIDIA | 4096 | Yes | Decoder-LLM-initialized embeddings that top current MTEB tables, at significant compute cost. |
Despite the proliferation of newer models, SBERT itself and the simple all-MiniLM-L6-v2 in particular remain extraordinarily popular because they are tiny, fast, and good enough for many real workloads. A small embedding model with a well-tuned cross-encoder reranker often beats a huge embedding model used alone, both on quality and on cost.[8][9]
A few practical lessons have accumulated in the years since the SBERT paper.
multi-qa-*-dot-v1) are trained for raw dot product and should not be normalized.bge-m3 and jina-embeddings-v3 support 8K context.It is hard to overstate the practical impact of the SBERT line of work. The paper itself has accumulated more than 10,000 citations, but the influence is more visible in deployed systems than in the citation graph: sentence-transformers is a hidden dependency of essentially every production semantic search system built between 2020 and 2024, the BEIR and MTEB benchmarks were created in part to evaluate SBERT-style models, and the bi-encoder plus cross-encoder reranker pattern that SBERT crystallized has become the canonical architecture for dense retrieval. The current crop of state-of-the-art embedding models (BGE, GTE, Voyage, Jina, OpenAI text-embedding-3, Cohere embed-v4, NV-Embed) all retain the bi-encoder Siamese-fine-tuning recipe at their core; they are, in a real sense, scaled-up and refined SBERTs.[1][2][8][9]
SBERT also did something important sociologically. By packaging an empirically strong recipe in a clean Python API and seeding the Hugging Face Hub with high-quality pretrained checkpoints, it made dense semantic embeddings genuinely accessible to developers who were not NLP researchers. That accessibility, more than any single benchmark number, is why SBERT remains the default starting point for anyone who needs a sentence embedding in 2026.