# Sentence-BERT (SBERT)

> Source: https://aiwiki.ai/wiki/sentence-bert
> Updated: 2026-06-21
> Categories: AI Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Sentence-BERT (SBERT)** is a modification of the pretrained [BERT](/wiki/bert) [transformer](/wiki/transformer) network that produces semantically meaningful, fixed-size sentence embeddings comparable with simple cosine similarity. Introduced by Nils Reimers and Iryna Gurevych at the Technical University of Darmstadt in their EMNLP 2019 paper *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*, it reduced the cost of finding the most similar sentence pair in a collection of 10,000 sentences from about 65 hours with vanilla BERT to roughly 5 seconds with SBERT, while preserving most of BERT's accuracy.[1] The open-source library that implements it, `sentence-transformers`, has become one of the most widely deployed pieces of NLP software in the world, serving more than one million monthly unique users and hosting over 16,000 models on the [Hugging Face](/wiki/hugging_face) Hub as of 2025.[2][7] It powers a substantial fraction of the embedding pipelines behind modern [semantic search](/wiki/semantic_search) systems, [retrieval augmented generation](/wiki/retrieval_augmented_generation) stacks, clustering tools, and recommendation engines.[1][2]

The core idea is conceptually simple but pragmatically transformative. Vanilla BERT, although extremely strong on sentence-pair classification, requires both sentences to be concatenated and fed through the network jointly. This is fine for a single comparison but catastrophic for tasks like nearest-neighbor search over a corpus, because the cost grows with the square of the corpus size. SBERT replaces this cross-encoder formulation with a bi-encoder built from a Siamese (or triplet) configuration of BERT models. After fine-tuning on natural language inference and semantic similarity data, each sentence can be encoded once into a single dense vector, and then arbitrary pairs of vectors can be compared in microseconds. As the authors put it, SBERT "uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity," reducing the effort to find the most similar pair "from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT."[1]

## What problem did SBERT solve?

Before SBERT, the field had a fragmented set of options for representing whole sentences as fixed-length vectors. Methods like averaged word embeddings (averaging GloVe or word2vec vectors), Skip-Thought, InferSent, and the Universal Sentence Encoder were each useful but plateaued well below what large pretrained transformers seemed to promise. When BERT arrived in late 2018, it set new state-of-the-art results across an enormous range of NLP benchmarks, including sentence-pair tasks like STS-B, MRPC, and natural language inference. However, BERT's design treated sentence pairs as a joint input: the two sentences were concatenated with a `[SEP]` token, fed together through the transformer stack, and a classification head on top of the `[CLS]` token produced a similarity score or label. This cross-encoding approach is what produced the strong scores, but it makes BERT unsuitable for tasks where you need a stand-alone vector representation of a single sentence.[1]

The naive workaround, which many practitioners initially tried, was to feed individual sentences through BERT and either take the `[CLS]` token output or average the contextualized token vectors and treat the result as a sentence embedding. Reimers and Gurevych showed that this naive approach actually performed *worse* than averaging GloVe embeddings on the standard semantic textual similarity tasks. Averaging BERT-base outputs achieved only about 54.81 average Spearman correlation across STS12 through STS16, STSbenchmark, and SICK-R, while averaged GloVe embeddings hit 61.32. This counterintuitive result motivated the entire SBERT project. BERT's representations were rich, but its training objectives (masked language modeling and next-sentence prediction) had not aligned the geometry of the embedding space with semantic similarity.[1][3]

## How does the SBERT architecture work?

SBERT keeps the BERT (or RoBERTa, DistilBERT, ALBERT, MPNet, etc.) transformer encoder as its backbone but adds two key elements on top: a pooling layer that converts the variable-length sequence of token vectors into a fixed-size sentence vector, and a Siamese training setup that fine-tunes the encoder so that semantically similar sentences end up close together in the resulting vector space.[1]

### Pooling strategies

A standard BERT encoder emits a sequence of contextualized token embeddings whose length depends on the input. SBERT pools this sequence into a single vector. The original paper explored three pooling operations and found that the choice matters, especially when training with a regression-style objective.[1]

| Pooling strategy | What it does | Notes |
|---|---|---|
| CLS | Use the embedding of the special `[CLS]` token as the sentence vector. | Simple, matches how BERT is fine-tuned for classification, but underperforms mean pooling for semantic similarity. |
| MEAN | Average all token embeddings, optionally weighted by the attention mask so padding does not bias the result. | Default in `sentence-transformers`. Best overall tradeoff between quality and stability across tasks. |
| MAX | Take an element-wise maximum over the sequence dimension (max-over-time). | Performs noticeably worse than CLS or MEAN when the model is trained with a regression objective. |

Mean pooling has become the de facto standard, and almost every popular SBERT-style model on the [Hugging Face](/wiki/hugging_face) Hub uses it. Newer variants sometimes use a learned weighted pool or a small attention head, but mean pooling remains a strong baseline.[1][4]

### Siamese and triplet networks

The Siamese setup is what allows SBERT to learn embeddings that are directly comparable. During fine-tuning, two copies of the same BERT encoder (with shared weights, hence the term *Siamese*) process the two sentences in a pair independently. Each side runs the encoder forward, applies pooling, and produces a single vector. The two vectors are then combined according to whatever objective is being optimized. Because the weights are tied, what is really happening is that one model is being trained to produce representations that work well in pairs, but at inference time the model can be applied to a single sentence at a time. This is the same architectural pattern used in the foundational [two-tower model](/wiki/two-tower_model) family that underpins modern dense retrieval and recommendation systems.[1]

For data that comes in triplets (anchor, positive, negative), SBERT uses a triplet network with three weight-sharing copies of the encoder.[1]

### Training objectives

The original paper proposed three loss functions to fine-tune the Siamese architecture, and the choice of loss is closely coupled to the format of the available training data.[1]

- **Classification objective**: Used for natural language inference (NLI) data with discrete labels (entailment, contradiction, neutral). Given two sentence vectors `u` and `v`, the head computes the concatenation `(u, v, |u - v|)`, multiplies it by a trainable weight matrix, and applies softmax cross-entropy over the three NLI labels. The element-wise absolute difference `|u - v|` is the crucial design choice, providing a magnitude signal the linear layer can exploit.
- **Regression objective**: Used for STS-B style data where each sentence pair carries a real-valued similarity score. The loss is mean squared error between the cosine similarity of `u` and `v` and the target score. This objective is much more sensitive to pooling choice; MAX pooling collapses badly, while MEAN and CLS both work well.
- **Triplet objective**: Used when training data comes as triplets `(anchor, positive, negative)`. The loss enforces that the distance between anchor and positive is smaller than the distance between anchor and negative by some fixed margin.

Later work in the same library generalized these into a much broader catalog of more than 20 loss functions, but two are particularly important in modern practice: the original softmax-over-NLI loss and the **Multiple Negatives Ranking Loss (MNR)**, sometimes called in-batch negatives loss, which computes cross entropy over cosine similarities within a batch and treats every other example in the batch as a negative for each anchor. MNR loss has become the dominant choice and consistently outperforms the original softmax classification loss.[5][6]

## What data was SBERT trained on?

The original SBERT paper trained on a combination of two NLI corpora and then fine-tuned further on STS-B for tasks where supervised similarity scores were available.[1]

- **SNLI (Stanford Natural Language Inference)**: 570,000 sentence pairs labeled as entailment, contradiction, or neutral.
- **MultiNLI**: 430,000 sentence pairs covering more genres than SNLI.
- **STS-B (Semantic Textual Similarity Benchmark)**: 8,628 sentence pairs annotated with continuous similarity scores from 0 to 5.

The combined SNLI plus MultiNLI corpus is often referred to as **AllNLI** in the `sentence-transformers` documentation, and it remains a common starting point for fine-tuning new sentence encoders. In subsequent years, the community discovered that using NLI contradictions as *hard negatives* in a triplet or MNR setup substantially improves quality. Modern training recipes for state-of-the-art embedding models layer additional supervision from question-answer pairs, paraphrase corpora, web search click data, and synthetic data generated by [large language models](/wiki/large_language_model).[1][5][6]

## What results did the original paper report?

The Reimers and Gurevych paper evaluated SBERT on the standard suite of unsupervised STS tasks (STS12 through STS16, STS Benchmark, and SICK-R) using Spearman rank correlation. The headline numbers from the paper, averaged across the seven tasks, were as follows.[1]

| Method | Average Spearman correlation |
|---|---|
| Average GloVe embeddings | 61.32 |
| Average BERT-base CLS or token embeddings | 54.81 |
| InferSent (GloVe) | 65.01 |
| Universal Sentence Encoder | 71.22 |
| SBERT-NLI-base | 74.89 |
| SBERT-NLI-large | 76.55 |
| SRoBERTa-NLI-large | 76.68 |

SBERT-NLI-base improved on InferSent by about 11.7 points and on Universal Sentence Encoder by about 5.5 points without any task-specific supervision. SBERT also improved over BERT-as-a-bi-encoder by more than 20 points, demonstrating that the contribution was not the BERT backbone alone but the Siamese fine-tuning that adapted the geometry of the embedding space.[1]

## The `sentence-transformers` library

Alongside the paper, Reimers and Gurevych released an open-source PyTorch library called `sentence-transformers`, originally hosted under the `UKPLab` GitHub organization. The library bundles the model architecture, training code, evaluation harnesses, a wide catalog of pretrained checkpoints, and utility functions for cosine similarity, semantic search, paraphrase mining, and clustering. Its API famously reduces the experience of using a strong sentence encoder to a few lines of Python:

```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")
embeddings = model.encode(["A man is eating food.", "A man eats a meal."])
```

This ergonomic simplicity, combined with a steadily expanding zoo of pretrained checkpoints uploaded to the Hugging Face Hub under the `sentence-transformers/` namespace, is what made the library so dominant. The project is released under the Apache 2.0 license, and by 2025 it was serving more than one million monthly unique users, with over 16,000 Sentence Transformers models publicly available on the Hugging Face Hub. It was a near-universal dependency in production semantic search and RAG stacks.[2][7]

In October 2025 the project was officially transferred from the UKP Lab to Hugging Face, with maintainer Tom Aarsen (who had already been the de facto lead since late 2023) continuing to drive development under the `huggingface/sentence-transformers` repository. The transfer announcement described Sentence Transformers as a project that has "evolved from an innovative research project into a key technology" used by more than a million people every month, and it remains a community-driven open-source project under the same Apache 2.0 license. Subsequent releases brought tighter integration with the Hugging Face Trainer, support for new losses (Matryoshka Loss, Cached MNR, GISTEmbed), and better tooling for fine-tuning custom embedding models.[2][7]

## Which pretrained SBERT models are most popular?

The `sentence-transformers` namespace on the Hugging Face Hub hosts hundreds of pretrained checkpoints. A handful are dominant in production usage because they hit useful sweet spots between speed, quality, and language coverage.[4]

| Model | Backbone | Output dim | Parameters | Notes |
|---|---|---|---|---|
| all-mpnet-base-v2 | MPNet base | 768 | ~110M | Long the default high-quality general-purpose English encoder, fine-tuned on more than one billion sentence pairs with MNR loss. Top performer among general-purpose SBERT checkpoints on STS-B and many MTEB English tasks. |
| all-MiniLM-L6-v2 | MiniLM (6 layers, distilled) | 384 | ~22M | The most-downloaded sentence encoder ever shipped on the Hugging Face Hub. Much faster than mpnet-base, with only a few percentage points lost on most benchmarks. The default choice when latency or memory is constrained. |
| all-MiniLM-L12-v2 | MiniLM (12 layers, distilled) | 384 | ~33M | Slightly larger MiniLM with marginally better quality, still extremely fast. |
| paraphrase-multilingual-MiniLM-L12-v2 | MiniLM (multilingual) | 384 | ~118M | Trained on parallel data covering 50+ languages so that translations of the same sentence end up close in embedding space. The standard go-to multilingual SBERT checkpoint. |
| paraphrase-multilingual-mpnet-base-v2 | MPNet (multilingual) | 768 | ~278M | Higher-quality multilingual variant. Used widely for cross-lingual semantic search. |
| multi-qa-mpnet-base-dot-v1 | MPNet base | 768 | ~110M | Tuned specifically for asymmetric question-passage retrieval. Uses dot-product rather than cosine similarity. |
| msmarco-distilbert-base-v4 | DistilBERT | 768 | ~66M | Trained on the MS MARCO passage retrieval dataset, an early standard for dense retrieval. |

The `all-` prefix indicates that the model was trained on the combined billion-pair dataset assembled by the `sentence-transformers` maintainers, while `paraphrase-`, `multi-qa-`, and `msmarco-` prefixes indicate domain-targeted training data.[4]

## How does SBERT differ from a cross-encoder?

A persistent practical question is when to use the SBERT-style bi-encoder formulation versus a BERT-style cross-encoder. The answer turns on the difference between *recall* and *re-ranking*.[8]

- A **bi-encoder** like SBERT encodes each input independently and compares vectors with cosine similarity (or dot product). Encoding cost is `O(n)` for a corpus of size `n`, and similarity comparison is `O(1)` per pair, with billion-scale search made tractable by approximate nearest neighbor indexes like [FAISS](/wiki/faiss). Quality is somewhat lower than cross-encoders because the model never sees the two inputs together.
- A **cross-encoder** concatenates the query and document, runs them through the transformer jointly, and emits a single relevance score. Quality is consistently higher because cross-attention can exchange information between the two sequences. Cost, however, is `O(n)` per query, which is unworkable for retrieval over millions of documents.

The canonical production pattern that emerged is *retrieve-then-rerank*: an SBERT-style bi-encoder retrieves the top-`k` candidates with [k-nearest neighbors](/wiki/k_nearest_neighbors) search, and a smaller, more expensive cross-encoder re-scores those candidates. The `sentence-transformers` library packages cross-encoders alongside bi-encoders for exactly this workflow.[8]

## What is SBERT used for?

SBERT and its descendants are used wherever a fixed-size dense vector representation of text is useful.[2][8]

- **Semantic search**: Encode every document in a corpus once, then encode each query at request time and retrieve nearest neighbors with FAISS, ScaNN, HNSW, or a managed vector database such as Pinecone, Weaviate, Qdrant, Milvus, or pgvector.
- **Retrieval augmented generation (RAG)**: The default retrieval substrate for almost every open-source RAG implementation, including LangChain, LlamaIndex, Haystack, and most enterprise stacks built on top of them.
- **Clustering and topic discovery**: Tools like BERTopic use SBERT embeddings as the foundation for unsupervised topic modeling.
- **Paraphrase mining and deduplication**: Find near-duplicate documents in a corpus by computing pairwise similarities and thresholding.
- **Cross-lingual retrieval**: Multilingual SBERT variants align embedding spaces across languages so that an English query can retrieve documents written in Spanish, Mandarin, or Arabic.
- **Recommendation and matching**: Match users to items, candidates to job postings, or questions to existing answers in a help center.
- **Zero-shot classification**: Compare a new sentence to embedding centroids of labeled examples without retraining.

## How is SBERT benchmarked?

Over the years, the community has settled on a few standard evaluations for sentence and document encoders.

- **STS-B and the STS12 through STS17 tasks**: The original benchmarks from the SBERT paper. Reported as Spearman correlation between cosine similarity and human-rated similarity.
- **BEIR**: A heterogeneous zero-shot information retrieval benchmark covering tasks like fact verification, question answering, news retrieval, and biomedical search. Models that do well on STS often do not transfer perfectly to BEIR-style retrieval, which exposes the difference between symmetric similarity and asymmetric query-passage retrieval.
- **MTEB (Massive Text Embedding Benchmark)**: Maintained by the embeddings-benchmark consortium, MTEB aggregates more than 50 datasets across retrieval, clustering, classification, semantic similarity, and reranking. The MTEB leaderboard on Hugging Face has become the central scoreboard for embedding models. As of 2026, top spots are held by very large models trained on much larger and more diverse data than the original SBERT, but the gap on many practical tasks is smaller than the absolute scores would suggest.[9]

## Successors and the modern embedding landscape

The years since 2019 have seen an explosion of embedding models that build on the SBERT recipe. Most retain the bi-encoder architecture and contrastive training objective; the main innovations are larger and more diverse training data, better backbone models (often initialized from instruction-tuned LLMs rather than BERT), and techniques like [Matryoshka representation learning](https://arxiv.org/abs/2205.13147) that allow a single model to produce embeddings at multiple dimensionalities.[9]

| Model family | Provider | Default dim | Open weights | Notes |
|---|---|---|---|---|
| `all-mpnet-base-v2` (SBERT) | UKP Lab / Hugging Face | 768 | Yes | The classic open-source baseline. Widely used as a default. |
| BGE (`bge-large-en-v1.5`, `bge-m3`) | BAAI (Beijing Academy of AI) | 1024 (large), variable (m3) | Yes | The BGE series became the open-source state-of-the-art in 2023 and 2024. `bge-m3` supports dense, sparse, and ColBERT-style multi-vector retrieval in one model with 100+ language coverage and 8K context. |
| GTE (`gte-large-en-v1.5`, `gte-Qwen2-1.5B-instruct`) | Alibaba DAMO | 1024+ | Yes | Strong open-source competitor; recent variants are LLM-initialized. |
| Jina Embeddings (`jina-embeddings-v3`) | Jina AI | 1024 (truncatable via Matryoshka) | Yes | Long-context (8K tokens) and multilingual. Designed for production retrieval. |
| OpenAI `text-embedding-3-small` and `text-embedding-3-large` | OpenAI | 1536 / 3072 (truncatable) | No | Closed API, replaced `ada-002` in 2024. Uses Matryoshka so dimensions can be shortened. Small is the dominant production embedding API by usage. |
| Voyage `voyage-3-large`, `voyage-4` | Voyage AI (Anthropic) | 1024 default, up to 2048 | No | Top performer on retrieval benchmarks. Supports int8 and binary quantization for very low storage cost. |
| Cohere `embed-v4` | Cohere | 1024+ | No | Multilingual and multimodal embeddings. Strong on enterprise retrieval workloads. |
| Google Gemini Embedding | Google DeepMind | up to 3072 (truncatable) | No | Multimodal embeddings spanning text, image, video, audio, and PDF. |
| NV-Embed, Llama-Embed-Nemotron | NVIDIA | 4096 | Yes | Decoder-LLM-initialized embeddings that top current MTEB tables, at significant compute cost. |

Despite the proliferation of newer models, SBERT itself and the simple all-MiniLM-L6-v2 in particular remain extraordinarily popular because they are tiny, fast, and good enough for many real workloads. A small embedding model with a well-tuned cross-encoder reranker often beats a huge embedding model used alone, both on quality and on cost.[8][9]

## Practical considerations

A few practical lessons have accumulated in the years since the SBERT paper.

- **Normalize embeddings**. Most SBERT models are trained against cosine similarity, so it is conventional to L2-normalize the output vectors and use either cosine similarity or, equivalently for unit vectors, dot product. Some checkpoints (notably `multi-qa-*-dot-v1`) are trained for raw dot product and should not be normalized.
- **Mind the maximum sequence length**. Most original SBERT checkpoints truncate at 256 or 512 tokens. Long documents need to be chunked, and the chunking strategy materially affects retrieval quality. Modern variants like `bge-m3` and `jina-embeddings-v3` support 8K context.
- **Use approximate nearest neighbor indexes for scale**. Exact cosine similarity over a million vectors is feasible but wasteful. FAISS, HNSW, ScaNN, and managed vector databases provide sub-linear retrieval with negligible quality loss.
- **Fine-tune on in-domain data**. Off-the-shelf SBERT checkpoints are general-purpose. A few thousand in-domain query-passage pairs, optionally augmented with hard negatives mined from a base retriever, often delivers a substantial quality boost via MNR-loss fine-tuning.
- **Quantize when storage matters**. Quantization-aware training (as used in Voyage and several recent open models) and Matryoshka truncation make it possible to store 100x to 200x more vectors per gigabyte with very little quality loss, dramatically lowering vector database costs.

## Influence and legacy

It is hard to overstate the practical impact of the SBERT line of work. The paper itself has accumulated more than 10,000 citations, but the influence is more visible in deployed systems than in the citation graph: `sentence-transformers` is a hidden dependency of essentially every production semantic search system built between 2020 and 2024, the BEIR and MTEB benchmarks were created in part to evaluate SBERT-style models, and the bi-encoder plus cross-encoder reranker pattern that SBERT crystallized has become the canonical architecture for dense retrieval. The current crop of state-of-the-art embedding models (BGE, GTE, Voyage, Jina, OpenAI text-embedding-3, Cohere embed-v4, NV-Embed) all retain the bi-encoder Siamese-fine-tuning recipe at their core; they are, in a real sense, scaled-up and refined SBERTs.[1][2][8][9]

SBERT also did something important sociologically. By packaging an empirically strong recipe in a clean Python API and seeding the Hugging Face Hub with high-quality pretrained checkpoints, it made dense semantic embeddings genuinely accessible to developers who were not NLP researchers. That accessibility, more than any single benchmark number, is why SBERT remains the default starting point for anyone who needs a sentence embedding in 2026.

## See also

- [BERT](/wiki/bert)
- [Transformer](/wiki/transformer)
- [Large language model](/wiki/large_language_model)
- [Hugging Face](/wiki/hugging_face)
- [Semantic search](/wiki/semantic_search)
- [Retrieval augmented generation](/wiki/retrieval_augmented_generation)
- [FAISS](/wiki/faiss)
- [Two-tower model](/wiki/two-tower_model)
- [k-nearest neighbors](/wiki/k_nearest_neighbors)

## References

1. Reimers, N. and Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, November 2019, 3982-3992. https://aclanthology.org/D19-1410/ and https://arxiv.org/abs/1908.10084
2. Hugging Face. *sentence-transformers* organization page on the Hugging Face Hub. https://huggingface.co/sentence-transformers
3. Pennington, J., Socher, R. and Manning, C. (2014). *GloVe: Global Vectors for Word Representation*. EMNLP 2014.
4. Sentence Transformers documentation. *Pretrained Models*. https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
5. Sentence Transformers documentation. *Training on Natural Language Inference data*. https://sbert.net/examples/sentence_transformer/training/nli/README.html
6. Pinecone Learn. *Next-Gen Sentence Embeddings with Multiple Negatives Ranking Loss*. https://www.pinecone.io/learn/series/nlp/fine-tune-sentence-transformers-mnr/
7. Hugging Face. *Sentence Transformers is joining Hugging Face* (October 22, 2025). https://huggingface.co/blog/sentence-transformers-joins-hf
8. Sentence Transformers documentation. *Cross-Encoders*. https://sbert.net/examples/cross_encoder/applications/README.html
9. Muennighoff, N. et al. *MTEB: Massive Text Embedding Benchmark*. https://huggingface.co/spaces/mteb/leaderboard and https://github.com/embeddings-benchmark/mteb
10. Kusupati, A. et al. (2022). *Matryoshka Representation Learning*. https://arxiv.org/abs/2205.13147
11. Xiao, S. et al. *FlagEmbedding: A One-Stop Retrieval Toolkit*. https://github.com/FlagOpen/FlagEmbedding
12. Voyage AI. *voyage-3-large: the new state-of-the-art general-purpose embedding model*. https://blog.voyageai.com/2025/01/07/voyage-3-large/

