Sentence-BERT (SBERT)

Sentence-BERT, almost always abbreviated SBERT, is a modification of the pretrained BERT transformer network designed to produce semantically meaningful, fixed-size sentence embeddings that can be compared with simple cosine similarity. It was introduced by Nils Reimers and Iryna Gurevych at the Ubiquitous Knowledge Processing (UKP) Lab at the Technical University of Darmstadt in their EMNLP 2019 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. SBERT solved a critical efficiency problem that had been holding back BERT from large-scale semantic similarity tasks, and the open-source library that implements it, sentence-transformers, has since become one of the most widely deployed pieces of NLP software in the world. It powers a substantial fraction of the embedding pipelines behind modern semantic search systems, retrieval augmented generation stacks, clustering tools, and recommendation engines.^[1]^[2]

The core idea is conceptually simple but pragmatically transformative. Vanilla BERT, although extremely strong on sentence-pair classification, requires both sentences to be concatenated and fed through the network jointly. This is fine for a single comparison but catastrophic for tasks like nearest-neighbor search over a corpus, because the cost grows with the square of the corpus size. SBERT replaces this cross-encoder formulation with a bi-encoder built from a Siamese (or triplet) configuration of BERT models. After fine-tuning on natural language inference and semantic similarity data, each sentence can be encoded once into a single dense vector, and then arbitrary pairs of vectors can be compared in microseconds. The reduction the original paper reports is striking: finding the most similar pair in a collection of 10,000 sentences with vanilla BERT takes roughly 65 hours; the same task with SBERT takes about 5 seconds, while preserving most of the accuracy advantage that BERT brought over earlier sentence embedding methods.^[1]

Background and motivation

Before SBERT, the field had a fragmented set of options for representing whole sentences as fixed-length vectors. Methods like averaged word embeddings (averaging GloVe or word2vec vectors), Skip-Thought, InferSent, and the Universal Sentence Encoder were each useful but plateaued well below what large pretrained transformers seemed to promise. When BERT arrived in late 2018, it set new state-of-the-art results across an enormous range of NLP benchmarks, including sentence-pair tasks like STS-B, MRPC, and natural language inference. However, BERT's design treated sentence pairs as a joint input: the two sentences were concatenated with a [SEP] token, fed together through the transformer stack, and a classification head on top of the [CLS] token produced a similarity score or label. This cross-encoding approach is what produced the strong scores, but it makes BERT unsuitable for tasks where you need a stand-alone vector representation of a single sentence.^[1]

The naive workaround, which many practitioners initially tried, was to feed individual sentences through BERT and either take the [CLS] token output or average the contextualized token vectors and treat the result as a sentence embedding. Reimers and Gurevych showed that this naive approach actually performed worse than averaging GloVe embeddings on the standard semantic textual similarity tasks. Averaging BERT-base outputs achieved only about 54.81 average Spearman correlation across STS12 through STS16, STSbenchmark, and SICK-R, while averaged GloVe embeddings hit 61.32. This counterintuitive result motivated the entire SBERT project. BERT's representations were rich, but its training objectives (masked language modeling and next-sentence prediction) had not aligned the geometry of the embedding space with semantic similarity.^[1]^[3]

Architecture

SBERT keeps the BERT (or RoBERTa, DistilBERT, ALBERT, MPNet, etc.) transformer encoder as its backbone but adds two key elements on top: a pooling layer that converts the variable-length sequence of token vectors into a fixed-size sentence vector, and a Siamese training setup that fine-tunes the encoder so that semantically similar sentences end up close together in the resulting vector space.^[1]

Pooling strategies

A standard BERT encoder emits a sequence of contextualized token embeddings whose length depends on the input. SBERT pools this sequence into a single vector. The original paper explored three pooling operations and found that the choice matters, especially when training with a regression-style objective.^[1]

Pooling strategy	What it does	Notes
CLS	Use the embedding of the special `[CLS]` token as the sentence vector.	Simple, matches how BERT is fine-tuned for classification, but underperforms mean pooling for semantic similarity.
MEAN	Average all token embeddings, optionally weighted by the attention mask so padding does not bias the result.	Default in `sentence-transformers`. Best overall tradeoff between quality and stability across tasks.
MAX	Take an element-wise maximum over the sequence dimension (max-over-time).	Performs noticeably worse than CLS or MEAN when the model is trained with a regression objective.

Mean pooling has become the de facto standard, and almost every popular SBERT-style model on the Hugging Face Hub uses it. Newer variants sometimes use a learned weighted pool or a small attention head, but mean pooling remains a strong baseline.^[1]^[4]

Siamese and triplet networks

The Siamese setup is what allows SBERT to learn embeddings that are directly comparable. During fine-tuning, two copies of the same BERT encoder (with shared weights, hence the term Siamese) process the two sentences in a pair independently. Each side runs the encoder forward, applies pooling, and produces a single vector. The two vectors are then combined according to whatever objective is being optimized. Because the weights are tied, what is really happening is that one model is being trained to produce representations that work well in pairs, but at inference time the model can be applied to a single sentence at a time. This is the same architectural pattern used in the foundational two-tower model family that underpins modern dense retrieval and recommendation systems.^[1]

For data that comes in triplets (anchor, positive, negative), SBERT uses a triplet network with three weight-sharing copies of the encoder.^[1]

Training objectives

The original paper proposed three loss functions to fine-tune the Siamese architecture, and the choice of loss is closely coupled to the format of the available training data.^[1]

Classification objective: Used for natural language inference (NLI) data with discrete labels (entailment, contradiction, neutral). Given two sentence vectors u and v, the head computes the concatenation (u, v, |u - v|), multiplies it by a trainable weight matrix, and applies softmax cross-entropy over the three NLI labels. The element-wise absolute difference |u - v| is the crucial design choice, providing a magnitude signal the linear layer can exploit.
Regression objective: Used for STS-B style data where each sentence pair carries a real-valued similarity score. The loss is mean squared error between the cosine similarity of u and v and the target score. This objective is much more sensitive to pooling choice; MAX pooling collapses badly, while MEAN and CLS both work well.
Triplet objective: Used when training data comes as triplets (anchor, positive, negative). The loss enforces that the distance between anchor and positive is smaller than the distance between anchor and negative by some fixed margin.

Later work in the same library generalized these into a much broader catalog of more than 20 loss functions, but two are particularly important in modern practice: the original softmax-over-NLI loss and the Multiple Negatives Ranking Loss (MNR), sometimes called in-batch negatives loss, which computes cross entropy over cosine similarities within a batch and treats every other example in the batch as a negative for each anchor. MNR loss has become the dominant choice and consistently outperforms the original softmax classification loss.^[5]^[6]

Training data

The original SBERT paper trained on a combination of two NLI corpora and then fine-tuned further on STS-B for tasks where supervised similarity scores were available.^[1]

SNLI (Stanford Natural Language Inference): 570,000 sentence pairs labeled as entailment, contradiction, or neutral.
MultiNLI: 430,000 sentence pairs covering more genres than SNLI.
STS-B (Semantic Textual Similarity Benchmark): 8,628 sentence pairs annotated with continuous similarity scores from 0 to 5.

The combined SNLI plus MultiNLI corpus is often referred to as AllNLI in the sentence-transformers documentation, and it remains a common starting point for fine-tuning new sentence encoders. In subsequent years, the community discovered that using NLI contradictions as hard negatives in a triplet or MNR setup substantially improves quality. Modern training recipes for state-of-the-art embedding models layer additional supervision from question-answer pairs, paraphrase corpora, web search click data, and synthetic data generated by large language models.^[1]^[5]^[6]

Results in the original paper

The Reimers and Gurevych paper evaluated SBERT on the standard suite of unsupervised STS tasks (STS12 through STS16, STS Benchmark, and SICK-R) using Spearman rank correlation. The headline numbers from the paper, averaged across the seven tasks, were as follows.^[1]

Method	Average Spearman correlation
Average GloVe embeddings	61.32
Average BERT-base CLS or token embeddings	54.81
InferSent (GloVe)	65.01
Universal Sentence Encoder	71.22
SBERT-NLI-base	74.89
SBERT-NLI-large	76.55
SRoBERTa-NLI-large	76.68

SBERT-NLI-base improved on InferSent by about 11.7 points and on Universal Sentence Encoder by about 5.5 points without any task-specific supervision. SBERT also improved over BERT-as-a-bi-encoder by more than 20 points, demonstrating that the contribution was not the BERT backbone alone but the Siamese fine-tuning that adapted the geometry of the embedding space.^[1]

The `sentence-transformers` library

Alongside the paper, Reimers and Gurevych released an open-source PyTorch library called sentence-transformers, originally hosted under the UKPLab GitHub organization. The library bundles the model architecture, training code, evaluation harnesses, a wide catalog of pretrained checkpoints, and utility functions for cosine similarity, semantic search, paraphrase mining, and clustering. Its API famously reduces the experience of using a strong sentence encoder to a few lines of Python:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(["A man is eating food.", "A man eats a meal."])

This ergonomic simplicity, combined with a steadily expanding zoo of pretrained checkpoints uploaded to the Hugging Face Hub under the sentence-transformers/ namespace, is what made the library so dominant. By 2024 it was being installed millions of times per month from PyPI and was a near-universal dependency in production semantic search and RAG stacks.^[2]^[7]

In late 2024 the project was officially transferred from the UKP Lab to Hugging Face, with maintainer Tom Aarsen (who had already been the de facto lead since late 2023) continuing to drive development under the huggingface/sentence-transformers repository. Subsequent releases brought tighter integration with the Hugging Face Trainer, support for new losses (Matryoshka Loss, Cached MNR, GISTEmbed), and better tooling for fine-tuning custom embedding models.^[2]^[7]

Popular pretrained SBERT models

The sentence-transformers namespace on the Hugging Face Hub hosts hundreds of pretrained checkpoints. A handful are dominant in production usage because they hit useful sweet spots between speed, quality, and language coverage.^[4]

Model	Backbone	Output dim	Parameters	Notes
all-mpnet-base-v2	MPNet base	768	~110M	Long the default high-quality general-purpose English encoder, fine-tuned on more than one billion sentence pairs with MNR loss. Top performer among general-purpose SBERT checkpoints on STS-B and many MTEB English tasks.
all-MiniLM-L6-v2	MiniLM (6 layers, distilled)	384	~22M	The most-downloaded sentence encoder ever shipped on the Hugging Face Hub. Much faster than mpnet-base, with only a few percentage points lost on most benchmarks. The default choice when latency or memory is constrained.
all-MiniLM-L12-v2	MiniLM (12 layers, distilled)	384	~33M	Slightly larger MiniLM with marginally better quality, still extremely fast.
paraphrase-multilingual-MiniLM-L12-v2	MiniLM (multilingual)	384	~118M	Trained on parallel data covering 50+ languages so that translations of the same sentence end up close in embedding space. The standard go-to multilingual SBERT checkpoint.
paraphrase-multilingual-mpnet-base-v2	MPNet (multilingual)	768	~278M	Higher-quality multilingual variant. Used widely for cross-lingual semantic search.
multi-qa-mpnet-base-dot-v1	MPNet base	768	~110M	Tuned specifically for asymmetric question-passage retrieval. Uses dot-product rather than cosine similarity.
msmarco-distilbert-base-v4	DistilBERT	768	~66M	Trained on the MS MARCO passage retrieval dataset, an early standard for dense retrieval.

The all- prefix indicates that the model was trained on the combined billion-pair dataset assembled by the sentence-transformers maintainers, while paraphrase-, multi-qa-, and msmarco- prefixes indicate domain-targeted training data.^[4]

Comparison with cross-encoders

A persistent practical question is when to use the SBERT-style bi-encoder formulation versus a BERT-style cross-encoder. The answer turns on the difference between recall and re-ranking.^[8]

A bi-encoder like SBERT encodes each input independently and compares vectors with cosine similarity (or dot product). Encoding cost is O(n) for a corpus of size n, and similarity comparison is O(1) per pair, with billion-scale search made tractable by approximate nearest neighbor indexes like FAISS. Quality is somewhat lower than cross-encoders because the model never sees the two inputs together.
A cross-encoder concatenates the query and document, runs them through the transformer jointly, and emits a single relevance score. Quality is consistently higher because cross-attention can exchange information between the two sequences. Cost, however, is O(n) per query, which is unworkable for retrieval over millions of documents.

The canonical production pattern that emerged is retrieve-then-rerank: an SBERT-style bi-encoder retrieves the top-k candidates with k-nearest neighbors search, and a smaller, more expensive cross-encoder re-scores those candidates. The sentence-transformers library packages cross-encoders alongside bi-encoders for exactly this workflow.^[8]

Use cases

SBERT and its descendants are used wherever a fixed-size dense vector representation of text is useful.^[2]^[8]

Semantic search: Encode every document in a corpus once, then encode each query at request time and retrieve nearest neighbors with FAISS, ScaNN, HNSW, or a managed vector database such as Pinecone, Weaviate, Qdrant, Milvus, or pgvector.
Retrieval augmented generation (RAG): The default retrieval substrate for almost every open-source RAG implementation, including LangChain, LlamaIndex, Haystack, and most enterprise stacks built on top of them.
Clustering and topic discovery: Tools like BERTopic use SBERT embeddings as the foundation for unsupervised topic modeling.
Paraphrase mining and deduplication: Find near-duplicate documents in a corpus by computing pairwise similarities and thresholding.
Cross-lingual retrieval: Multilingual SBERT variants align embedding spaces across languages so that an English query can retrieve documents written in Spanish, Mandarin, or Arabic.
Recommendation and matching: Match users to items, candidates to job postings, or questions to existing answers in a help center.
Zero-shot classification: Compare a new sentence to embedding centroids of labeled examples without retraining.

Benchmarks and evaluation

Over the years, the community has settled on a few standard evaluations for sentence and document encoders.

STS-B and the STS12 through STS17 tasks: The original benchmarks from the SBERT paper. Reported as Spearman correlation between cosine similarity and human-rated similarity.
BEIR: A heterogeneous zero-shot information retrieval benchmark covering tasks like fact verification, question answering, news retrieval, and biomedical search. Models that do well on STS often do not transfer perfectly to BEIR-style retrieval, which exposes the difference between symmetric similarity and asymmetric query-passage retrieval.
MTEB (Massive Text Embedding Benchmark): Maintained by the embeddings-benchmark consortium, MTEB aggregates more than 50 datasets across retrieval, clustering, classification, semantic similarity, and reranking. The MTEB leaderboard on Hugging Face has become the central scoreboard for embedding models. As of 2026, top spots are held by very large models trained on much larger and more diverse data than the original SBERT, but the gap on many practical tasks is smaller than the absolute scores would suggest.^[9]

Successors and the modern embedding landscape

The years since 2019 have seen an explosion of embedding models that build on the SBERT recipe. Most retain the bi-encoder architecture and contrastive training objective; the main innovations are larger and more diverse training data, better backbone models (often initialized from instruction-tuned LLMs rather than BERT), and techniques like Matryoshka representation learning that allow a single model to produce embeddings at multiple dimensionalities.^[9]

Model family	Provider	Default dim	Open weights	Notes
`all-mpnet-base-v2` (SBERT)	UKP Lab / Hugging Face	768	Yes	The classic open-source baseline. Widely used as a default.
BGE (`bge-large-en-v1.5`, `bge-m3`)	BAAI (Beijing Academy of AI)	1024 (large), variable (m3)	Yes	The BGE series became the open-source state-of-the-art in 2023 and 2024. `bge-m3` supports dense, sparse, and ColBERT-style multi-vector retrieval in one model with 100+ language coverage and 8K context.
GTE (`gte-large-en-v1.5`, `gte-Qwen2-1.5B-instruct`)	Alibaba DAMO	1024+	Yes	Strong open-source competitor; recent variants are LLM-initialized.
Jina Embeddings (`jina-embeddings-v3`)	Jina AI	1024 (truncatable via Matryoshka)	Yes	Long-context (8K tokens) and multilingual. Designed for production retrieval.
OpenAI `text-embedding-3-small` and `text-embedding-3-large`	OpenAI	1536 / 3072 (truncatable)	No	Closed API, replaced `ada-002` in 2024. Uses Matryoshka so dimensions can be shortened. Small is the dominant production embedding API by usage.
Voyage `voyage-3-large`, `voyage-4`	Voyage AI (Anthropic)	1024 default, up to 2048	No	Top performer on retrieval benchmarks. Supports int8 and binary quantization for very low storage cost.
Cohere `embed-v4`	Cohere	1024+	No	Multilingual and multimodal embeddings. Strong on enterprise retrieval workloads.
Google Gemini Embedding	Google DeepMind	up to 3072 (truncatable)	No	Multimodal embeddings spanning text, image, video, audio, and PDF.
NV-Embed, Llama-Embed-Nemotron	NVIDIA	4096	Yes	Decoder-LLM-initialized embeddings that top current MTEB tables, at significant compute cost.

Despite the proliferation of newer models, SBERT itself and the simple all-MiniLM-L6-v2 in particular remain extraordinarily popular because they are tiny, fast, and good enough for many real workloads. A small embedding model with a well-tuned cross-encoder reranker often beats a huge embedding model used alone, both on quality and on cost.^[8]^[9]

Practical considerations

A few practical lessons have accumulated in the years since the SBERT paper.

Normalize embeddings. Most SBERT models are trained against cosine similarity, so it is conventional to L2-normalize the output vectors and use either cosine similarity or, equivalently for unit vectors, dot product. Some checkpoints (notably multi-qa-*-dot-v1) are trained for raw dot product and should not be normalized.
Mind the maximum sequence length. Most original SBERT checkpoints truncate at 256 or 512 tokens. Long documents need to be chunked, and the chunking strategy materially affects retrieval quality. Modern variants like bge-m3 and jina-embeddings-v3 support 8K context.
Use approximate nearest neighbor indexes for scale. Exact cosine similarity over a million vectors is feasible but wasteful. FAISS, HNSW, ScaNN, and managed vector databases provide sub-linear retrieval with negligible quality loss.
Fine-tune on in-domain data. Off-the-shelf SBERT checkpoints are general-purpose. A few thousand in-domain query-passage pairs, optionally augmented with hard negatives mined from a base retriever, often delivers a substantial quality boost via MNR-loss fine-tuning.
Quantize when storage matters. Quantization-aware training (as used in Voyage and several recent open models) and Matryoshka truncation make it possible to store 100x to 200x more vectors per gigabyte with very little quality loss, dramatically lowering vector database costs.

Influence and legacy

It is hard to overstate the practical impact of the SBERT line of work. The paper itself has accumulated more than 10,000 citations, but the influence is more visible in deployed systems than in the citation graph: sentence-transformers is a hidden dependency of essentially every production semantic search system built between 2020 and 2024, the BEIR and MTEB benchmarks were created in part to evaluate SBERT-style models, and the bi-encoder plus cross-encoder reranker pattern that SBERT crystallized has become the canonical architecture for dense retrieval. The current crop of state-of-the-art embedding models (BGE, GTE, Voyage, Jina, OpenAI text-embedding-3, Cohere embed-v4, NV-Embed) all retain the bi-encoder Siamese-fine-tuning recipe at their core; they are, in a real sense, scaled-up and refined SBERTs.^[1]^[2]^[8]^[9]

SBERT also did something important sociologically. By packaging an empirically strong recipe in a clean Python API and seeding the Hugging Face Hub with high-quality pretrained checkpoints, it made dense semantic embeddings genuinely accessible to developers who were not NLP researchers. That accessibility, more than any single benchmark number, is why SBERT remains the default starting point for anyone who needs a sentence embedding in 2026.

References

Reimers, N. and Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, November 2019, 3982-3992. https://aclanthology.org/D19-1410/ and https://arxiv.org/abs/1908.10084
Hugging Face. *sentence-transformers* organization page on the Hugging Face Hub. https://huggingface.co/sentence-transformers
Pennington, J., Socher, R. and Manning, C. (2014). *GloVe: Global Vectors for Word Representation*. EMNLP 2014.
Sentence Transformers documentation. *Pretrained Models*. https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
Sentence Transformers documentation. *Training on Natural Language Inference data*. https://sbert.net/examples/sentence_transformer/training/nli/README.html
Pinecone Learn. *Next-Gen Sentence Embeddings with Multiple Negatives Ranking Loss*. https://www.pinecone.io/learn/series/nlp/fine-tune-sentence-transformers-mnr/
Hugging Face. *Sentence Transformers is joining Hugging Face*. https://huggingface.co/blog/sentence-transformers-joins-hf
Sentence Transformers documentation. *Cross-Encoders*. https://sbert.net/examples/cross_encoder/applications/README.html
Muennighoff, N. et al. *MTEB: Massive Text Embedding Benchmark*. https://huggingface.co/spaces/mteb/leaderboard and https://github.com/embeddings-benchmark/mteb
Kusupati, A. et al. (2022). *Matryoshka Representation Learning*. https://arxiv.org/abs/2205.13147
Xiao, S. et al. *FlagEmbedding: A One-Stop Retrieval Toolkit*. https://github.com/FlagOpen/FlagEmbedding
Voyage AI. *voyage-3-large: the new state-of-the-art general-purpose embedding model*. https://blog.voyageai.com/2025/01/07/voyage-3-large/

Sentence-BERT (SBERT)

Sentence-BERT (SBERT)

Background and motivation

Architecture

Pooling strategies

Siamese and triplet networks

Training objectives

Training data

Results in the original paper

The `sentence-transformers` library

Popular pretrained SBERT models

Comparison with cross-encoders

Use cases

Benchmarks and evaluation

Successors and the modern embedding landscape

Practical considerations

Influence and legacy

See also

References

Improve this article

Sentence-BERT (SBERT)

Background and motivation

Architecture

Pooling strategies

Siamese and triplet networks

Training objectives

Training data

Results in the original paper

The `sentence-transformers` library

Popular pretrained SBERT models

Comparison with cross-encoders

Use cases

Benchmarks and evaluation

Successors and the modern embedding landscape

Practical considerations

Influence and legacy

See also

References

Sentence-BERT (SBERT)

Background and motivation

Architecture

Pooling strategies

Siamese and triplet networks

Training objectives

Training data

Results in the original paper

The sentence-transformers library

Popular pretrained SBERT models

Comparison with cross-encoders

Use cases

Benchmarks and evaluation

Successors and the modern embedding landscape

Practical considerations

Influence and legacy

See also

References

Improve this article

Related Articles

GloVe (Global Vectors for Word Representation)

Embedding vector

Vector embeddings

Voyage AI

Dense Passage Retrieval (DPR)

Bert-base-uncased model

Sentence-BERT (SBERT)

Background and motivation

Architecture

Pooling strategies

Siamese and triplet networks

Training objectives

Training data

Results in the original paper

The sentence-transformers library

Popular pretrained SBERT models

Comparison with cross-encoders

Use cases

Benchmarks and evaluation

Successors and the modern embedding landscape

Practical considerations

Influence and legacy

See also

References

Related Articles

GloVe (Global Vectors for Word Representation)

Embedding vector

Vector embeddings

Voyage AI

Dense Passage Retrieval (DPR)

Bert-base-uncased model

The `sentence-transformers` library

The `sentence-transformers` library