Sentence Similarity Models
Last reviewed
May 11, 2026
Sources
22 citations
Review status
Source-backed
Revision
v2 ยท 2,488 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
22 citations
Review status
Source-backed
Revision
v2 ยท 2,488 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sentence similarity models are machine learning systems that map natural language sentences to fixed-length numerical vectors (sentence embeddings) so that semantically related sentences sit close together in the vector space. Similarity between two sentences is then measured with a distance or angle metric, most commonly cosine similarity. These models power many Natural Language Processing Models applications, including semantic search, retrieval for retrieval-augmented generation, clustering, paraphrase mining, deduplication, and recommendation.
See also: Natural Language Processing Models
A sentence embedding is a dense vector, typically between 256 and 4096 dimensions, that encodes the meaning of a sentence or short passage. Two sentences with the same meaning should yield vectors whose cosine similarity is close to 1, while unrelated sentences approach 0. The related research area of semantic textual similarity (STS) measures how well a system predicts the human-judged similarity score for a sentence pair, scored 0 to 5 in the SemEval shared tasks.
In practice, these models are deployed as encoders inside larger pipelines. A search system encodes every document or passage once and stores the vectors in a vector database. At query time, the question is encoded with the same model and nearest neighbors are returned by approximate nearest-neighbor search. This bi-encoder pattern made dense retrieval practical at web scale and is the default building block for retrieval-augmented generation systems.
Early work treated documents as sparse bag-of-words vectors, sometimes weighted by term frequency-inverse document frequency (TF-IDF). Latent semantic analysis (LSA), introduced by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, applied singular value decomposition to a term-by-document matrix to project sparse counts into a roughly 100-dimensional latent space that captured synonymy through co-occurrence. LSA is considered the first practical dense representation for text retrieval.
The distributed word representation era, kicked off by Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), produced strong word embeddings but only weak sentence embeddings when word vectors were averaged. Le and Mikolov (2014) extended the idea with Paragraph Vector (doc2vec) in PV-DM and PV-DBOW variants, training a per-document vector alongside word vectors to predict words in a passage.
Deep sentence encoders began with Skip-Thought vectors (Kiros et al., 2015), which trained an encoder-decoder to predict surrounding sentences. Conneau, Kiela, Schwenk, Barrault, and Bordes published InferSent in 2017, showing a BiLSTM trained on the Stanford Natural Language Inference (SNLI) corpus produced general-purpose sentence representations beating unsupervised baselines on transfer tasks. Cer and colleagues at Google released the Universal Sentence Encoder (USE) in 2018, packaging a Transformer encoder and a Deep Averaging Network through TensorFlow Hub.
The single biggest leap was Sentence-BERT (SBERT), published by Nils Reimers and Iryna Gurevych at EMNLP-IJCNLP 2019. SBERT fine-tunes a pretrained BERT model in a Siamese / triplet architecture with mean pooling, producing fixed-size sentence embeddings comparable through cosine similarity. The headline result was practical: finding the most similar pair in a 10,000-sentence collection dropped from about 65 hours with vanilla BERT cross-encoding to roughly 5 seconds with SBERT, with similar accuracy. SBERT's authors also released the sentence-transformers Python library, which became the de facto standard for training and serving sentence embedding models.
Gao, Yao, and Chen introduced SimCSE in 2021, a contrastive learning recipe with simple ingredients. The unsupervised variant passes the same sentence through a Transformer twice with independent dropout masks and treats the outputs as a positive pair, with other in-batch sentences as negatives. The supervised variant uses SNLI/MNLI entailment pairs as positives and contradictions as hard negatives. SimCSE pushed average STS Spearman correlation above 81 percent with a base-sized model.
From 2022 onward, web-scale contrastive pretraining dominated. Ni and colleagues at Google released GTR (Generalizable T5 Retrieval) in late 2021, showing that scaling a T5 encoder to billions of parameters with a 768-dimensional bottleneck produced strong zero-shot retrieval. Wang and colleagues at Microsoft released the E5 family in December 2022, trained contrastively on a 270M-pair web dataset (CCPairs) with prefixes such as "query:" and "passage:". The Beijing Academy of Artificial Intelligence (BAAI) released the BGE (BAAI General Embedding) line in 2023 with the C-Pack technical report. Alibaba's GTE family followed, along with Nomic AI's open-data Nomic Embed, OpenAI's text-embedding-3 models in January 2024, Cohere Embed v3, Voyage AI embeddings, and Snowflake Arctic Embed. By 2024 the top of the public MTEB leaderboard was held by LLM-derived encoders such as E5-Mistral-7B-instruct and gte-Qwen2-7B-instruct.
Two broad architectures dominate. A bi-encoder (or dual encoder) processes each sentence independently and produces one vector per sentence; comparison is a cheap dot product or cosine similarity. A cross-encoder feeds a sentence pair jointly through a Transformer and outputs a similarity score, which is more accurate but cannot be precomputed and scales quadratically in the number of comparisons. In production, the two are often combined: a bi-encoder retrieves top-k candidates, then a cross-encoder reranks them.
Given a Transformer backbone, a pooling layer turns a sequence of token vectors into a single sentence vector. The three common choices are CLS pooling (the first token's hidden state), mean pooling (average all tokens with attention-mask weighting), and max pooling. SBERT and most successors found that mean pooling outperforms CLS for similarity tasks on BERT-style encoders.
Contemporary models train with contrastive objectives. In-batch negatives use other minibatch examples as negatives, making the loss cheap and coupled to batch size. Hard negative mining adds difficult, semantically nearby examples (typically discovered with a current retriever or BM25) to sharpen decision boundaries. Many systems train with a temperature-scaled InfoNCE loss popularized by SimCLR and CLIP.
Matryoshka Representation Learning (MRL), introduced by Kusupati and colleagues at NeurIPS 2022, trains an embedding so that every prefix (first 64, 128, 256, 512, ... dimensions) is itself a useful representation. A single model can then serve embeddings at multiple dimensions, trading retrieval cost against quality without retraining. OpenAI's text-embedding-3 family and Nomic Embed v1.5 both expose MRL truncation through their APIs.
Four recipes account for most modern sentence encoders. NLI supervision (InferSent, SBERT) trains a classifier on premise-hypothesis pairs from SNLI and MNLI. Contrastive training with in-batch negatives (SimCSE, GTR, E5, BGE) pulls positive pairs together and pushes random negatives apart. Retrieval-style training (MS MARCO and BEIR-style hard negatives) tunes the model on query-passage pairs from real search logs. Instruction tuning (E5-Mistral, BGE-en-icl, gte-Qwen2-instruct) prepends a natural-language task description so one model can switch between retrieval, classification, clustering, and STS at inference time.
Top models combine several stages: weakly supervised contrastive pretraining on hundreds of millions of noisy web pairs, supervised fine-tuning on a curated mix of MS MARCO, NLI, and synthetic data, then instruction tuning on diverse task formats.
| Model | Year | Parameters | Embedding dim | Notes |
|---|---|---|---|---|
| InferSent | 2017 | ~38M | 4096 | BiLSTM trained on SNLI |
| Universal Sentence Encoder | 2018 | ~147M (Transformer) | 512 | Two variants, TF Hub |
| Sentence-BERT bert-base-nli-mean-tokens | 2019 | 110M | 768 | First widely used SBERT model |
| all-MiniLM-L6-v2 | 2021 | 22M | 384 | Compact, very fast default |
| all-mpnet-base-v2 | 2021 | 109M | 768 | Popular high-quality default |
| SimCSE-RoBERTa-large | 2021 | 355M | 1024 | Supervised contrastive |
| GTR-T5-large | 2021 | 335M | 768 | Dual encoder from T5 |
| Sentence-T5 (ST5) | 2021 | up to 11B | 768 | Multi-stage T5 sentence encoder |
| E5-large-v2 | 2022 | 335M | 1024 | Microsoft weakly supervised |
| BGE-large-en-v1.5 | 2023 | 335M | 1024 | BAAI, MTEB leader at release |
| GTE-large | 2023 | 335M | 1024 | Alibaba general text embedding |
| E5-Mistral-7B-instruct | 2023 | 7.1B | 4096 | LLM-based, instruction-tuned |
| BGE-M3 | 2024 | 568M | 1024 | 100+ languages, dense/sparse/multi-vector |
| Nomic Embed v1 | 2024 | 137M | 768 | Open weights and open data, 8192 context |
| OpenAI text-embedding-3-small | 2024 | not disclosed | up to 1536 | Matryoshka, API only |
| OpenAI text-embedding-3-large | 2024 | not disclosed | up to 3072 | Matryoshka, API only |
| Cohere Embed v3 multilingual | 2023 | not disclosed | 1024 | Multilingual, API only |
| gte-Qwen2-7B-instruct | 2024 | 7.6B | 3584 | Top of MTEB at release, 32K context |
| Nomic Embed v2 MoE | 2024 | 305M (active) | 768 (MRL to 256) | First general-purpose MoE embedder |
| Snowflake arctic-embed-l | 2024 | 335M | 1024 | Retrieval-optimized open model |
| Snowflake arctic-embed-l-v2.0 | 2024 | 568M | 1024 | Multilingual, December 2024 |
| Voyage-3 | 2024 | not disclosed | 1024 | 32K context, multilingual |
| Voyage-3-large | 2025 | not disclosed | 1024 (MRL) | 32K context, multi-domain |
| Benchmark | Year | Coverage | Focus |
|---|---|---|---|
| STS-12 through STS-16 | 2012-2016 | English | Sentence pair similarity |
| SICK | 2014 | English | Compositional similarity and entailment |
| STS Benchmark (STSb) | 2017 | English | Curated dev/test from SemEval data |
| SemEval-2017 Task 1 | 2017 | 5 language pairs | Multilingual and cross-lingual STS |
| SentEval | 2018 | English | 17 transfer tasks for sentence embeddings |
| MS MARCO | 2018-2019 | English | Web search passage ranking |
| BEIR | 2021 | English, 18 datasets | Zero-shot heterogeneous retrieval |
| MIRACL | 2022 | 18 languages | Multilingual ad-hoc retrieval |
| MTEB | 2022 | 56+ datasets, 112 languages | 8 task types (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining) |
| MTEB-French, MTEB-Chinese, MTEB-Polish | 2023+ | Per-language | Language-specific MTEB tracks |
The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff, Tazi, Magne, and Reimers in 2022 (arXiv:2210.07316, EACL 2023), is the de facto industry benchmark. Its public leaderboard at huggingface.co/spaces/mteb/leaderboard is where new embedding releases compete, and as of 2024 top positions are usually held by LLM-based encoders such as gte-Qwen2-7B-instruct and SFR-Embedding-Mistral. BEIR (Thakur et al., NeurIPS 2021) remains the standard zero-shot retrieval benchmark and is incorporated as the retrieval slice of MTEB.
Semantic search is the most visible application: pre-computed embeddings of documents, paragraphs, FAQs, or product descriptions enable fast cosine-similarity lookup of the most relevant items for a query. The same retrieval pipeline is the first stage of every retrieval-augmented generation system, in which retrieved passages are concatenated into the prompt of a generative model.
Clustering and topic modeling group documents by embedding similarity (often combined with HDBSCAN and UMAP, as in BERTopic) without requiring a fixed taxonomy. Paraphrase detection, duplicate question detection (as on Quora and Stack Overflow), and plagiarism checking rely on threshold-based similarity comparisons. Recommendation systems use embeddings to match users with content, posts, or other users. Data engineering teams also use sentence embeddings for near-duplicate detection during dataset construction and for similarity-based generation metrics such as BERTScore.
Three trends define the current state. First, LLM-derived encoders dominate quality benchmarks: E5-Mistral-7B-instruct, gte-Qwen2-7B-instruct, SFR-Embedding-Mistral, and Linq-Embed-Mistral start from a 7B-parameter decoder model and fine-tune it on synthetic and curated retrieval data. They reach top MTEB scores at the cost of much higher serving latency than the 110M-parameter encoders that preceded them.
Second, adaptive dimensions through Matryoshka Representation Learning have become standard. OpenAI text-embedding-3, Nomic Embed v1.5 and v2, Snowflake arctic-embed-l-v2.0, and Voyage-3-large let users truncate vectors to 256 or fewer dimensions, cutting storage in vector databases with little accuracy loss.
Third, multilingual coverage has expanded sharply. BGE-M3 spans more than 100 languages and supports dense, sparse, and multi-vector retrieval from one model. Cohere Embed v3 multilingual covers about 100 languages at 1024 dimensions. Snowflake arctic-embed-l-v2.0 (December 2024) and Nomic Embed v2 MoE (early 2025) continue the trend. Domain-tuned variants for code, finance, and law have also appeared.
Domain transfer is brittle: an encoder trained on web text often performs poorly on legal contracts or clinical notes without further fine-tuning. Long-context degradation is real; even models accepting 8K or 32K tokens often summarize only the early part of a long input. Embedding models can amplify training-data biases. Evaluation has pitfalls: heavy reliance on the public MTEB leaderboard has been criticized for encouraging benchmark-specific overfitting, and the classification and clustering slices use simple linear probes whose results shift with small setup changes. Coverage for low-resource languages remains uneven, and embeddings still cannot reliably encode logical conditions such as numeric ranges, negations, or date arithmetic, which are common failure modes in retrieval-augmented generation.