Sentence Similarity Models

AI Models Natural Language Processing

27 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v4 · 5,369 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Sentence similarity models are machine learning systems that map natural language sentences to fixed-length numerical vectors (sentence embeddings) so that semantically related sentences sit close together in the vector space. Similarity between two sentences is then measured with a distance or angle metric, most commonly cosine similarity.^[5] These models power many Natural Language Processing Models applications, including semantic search, retrieval for retrieval-augmented generation, clustering, paraphrase mining, deduplication, and recommendation.

The field grew directly from earlier feature extraction research, but with a distinct goal: rather than producing representations for classification or generation, sentence similarity models are optimized specifically so that the geometry of the vector space reflects human-perceived semantic relatedness. This constraint shapes every design choice, from the objective function used during training to the pooling strategy applied at inference time.

Definition and core idea

A sentence embedding is a dense vector, typically between 256 and 4096 dimensions, that encodes the meaning of a sentence or short passage. Two sentences with the same meaning should yield vectors whose cosine similarity is close to 1, while unrelated sentences approach 0. The related research area of semantic textual similarity (STS) measures how well a system predicts the human-judged similarity score for a sentence pair, scored 0 to 5 in the SemEval shared tasks.^[19] High STS Spearman correlation is a necessary but not sufficient condition for good retrieval performance, which is why modern evaluation suites such as MTEB cover both STS and information retrieval tasks in parallel.^[10]

In practice, these models are deployed as encoders inside larger pipelines. A search system encodes every document or passage once and stores the vectors in a vector database. At query time, the question is encoded with the same model and nearest neighbors are returned by approximate nearest-neighbor search. This bi-encoder pattern made dense retrieval practical at web scale and is the default building block for retrieval-augmented generation systems.

History

Early work treated documents as sparse bag-of-words vectors, sometimes weighted by term frequency-inverse document frequency (TF-IDF). Latent semantic analysis (LSA), introduced by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, applied singular value decomposition to a term-by-document matrix to project sparse counts into a roughly 100-dimensional latent space that captured synonymy through co-occurrence.^[1] LSA is considered the first practical dense representation for text retrieval.

The distributed word representation era, kicked off by Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), produced strong word embeddings but only weak sentence embeddings when word vectors were averaged. Le and Mikolov (2014) extended the idea with Paragraph Vector (doc2vec) in PV-DM and PV-DBOW variants, training a per-document vector alongside word vectors to predict words in a passage.^[2]

Deep sentence encoders began with Skip-Thought vectors (Kiros et al., 2015), which trained an encoder-decoder to predict surrounding sentences. Conneau, Kiela, Schwenk, Barrault, and Bordes published InferSent in 2017, showing a BiLSTM trained on the Stanford Natural Language Inference (SNLI) corpus produced general-purpose sentence representations beating unsupervised baselines on transfer tasks.^[3] Cer and colleagues at Google released the Universal Sentence Encoder (USE) in 2018, packaging a Transformer encoder and a Deep Averaging Network through TensorFlow Hub.^[4]

The single biggest leap was Sentence-BERT (SBERT), published by Nils Reimers and Iryna Gurevych at EMNLP-IJCNLP 2019.^[5] SBERT fine-tunes a pretrained BERT model in a Siamese / triplet architecture with mean pooling, producing fixed-size sentence embeddings comparable through cosine similarity.^[5] The headline result was practical: finding the most similar pair in a 10,000-sentence collection dropped from about 65 hours with vanilla BERT cross-encoding to roughly 5 seconds with SBERT, with similar accuracy.^[5] SBERT's authors also released the sentence-transformers Python library, which became the de facto standard for training and serving sentence embedding models.^[18]

Gao, Yao, and Chen introduced SimCSE in 2021, a contrastive learning recipe with simple ingredients.^[6] The unsupervised variant passes the same sentence through a Transformer twice with independent dropout masks and treats the outputs as a positive pair, with other in-batch sentences as negatives. The key theoretical insight was that dropout acts as minimal data augmentation: removing it causes representation collapse, while applying it twice introduces just enough noise to create a non-trivial learning signal.^[6] The supervised variant uses SNLI/MNLI entailment pairs as positives and contradictions as hard negatives. SimCSE also introduced alignment and uniformity as diagnostic axes: alignment measures how close positive pairs are, while uniformity measures how evenly the embeddings cover the unit hypersphere.^[6] SimCSE pushed average STS Spearman correlation above 81 percent with a base-sized model.^[6]

From 2022 onward, web-scale contrastive pretraining dominated. Ni and colleagues at Google released GTR (Generalizable T5 Retrieval) in late 2021, showing that scaling a T5 encoder to billions of parameters with a 768-dimensional bottleneck produced strong zero-shot retrieval.^[7] Wang and colleagues at Microsoft released the E5 family in December 2022, trained contrastively on a 270M-pair web dataset (CCPairs) with prefixes such as "query:" and "passage:".^[11] The Beijing Academy of Artificial Intelligence (BAAI) released the BGE (BAAI General Embedding) line in 2023 with the C-Pack technical report.^[12] Alibaba's GTE family followed, along with Nomic AI's open-data Nomic Embed,^[15] OpenAI's text-embedding-3 models in January 2024,^[16] Cohere Embed v3, Voyage AI embeddings,^[21] and Snowflake Arctic Embed.^[20] By 2024 the top of the public MTEB leaderboard was held by LLM-derived encoders such as E5-Mistral-7B-instruct and gte-Qwen2-7B-instruct.^[13]^[22] NVIDIA's NV-Embed-v2, built on Mistral-7B-v0.1 with a novel latent-attention pooling mechanism, topped the overall MTEB leaderboard in August 2024 with a score of 72.31 across 56 tasks.^[23] The Qwen3-Embedding family (2025) and Microsoft's Harrier-OSS-v1 (March 2026) extended this trend, with the Harrier 27B model reaching the top of the multilingual MTEB v2 benchmark.^[24]^[25]

Architectures

Two broad architectures dominate. A bi-encoder (or dual encoder) processes each sentence independently and produces one vector per sentence; comparison is a cheap dot product or cosine similarity. A cross-encoder feeds a sentence pair jointly through a Transformer and outputs a similarity score, which is more accurate but cannot be precomputed and scales quadratically in the number of comparisons. In production, the two are often combined: a bi-encoder retrieves top-k candidates, then a cross-encoder reranks them.

Bi-encoders

The bi-encoder architecture is the foundation of scalable dense retrieval. Because each document is encoded once and stored, the only variable cost at query time is encoding the query (a single forward pass) and performing approximate nearest-neighbor (ANN) search over the stored vectors. Libraries such as FAISS, ScaNN, and HNSW-based indexes make the ANN step fast enough for billion-scale corpora. The tradeoff is that bi-encoders miss fine-grained token-level interactions between a query and a passage because the two are never jointly attended to during encoding.

Bi-encoders are optimized for asymmetric retrieval tasks where queries are short (a few words or a sentence) and passages are longer paragraphs. Many production models use different pooling configurations or even separately trained query and passage towers for best asymmetric performance, as seen in the "query:" and "passage:" prefix convention in E5.^[11]

Cross-encoders

A cross-encoder concatenates both sentences (separated by a [SEP] token) and runs a full Transformer forward pass over the combined sequence. Every token in one sentence can attend to every token in the other, so the model captures nuanced lexical and semantic interactions that bi-encoders miss. The output is a scalar relevance score. Cross-encoders consistently outperform bi-encoders on pairwise similarity tasks but cannot precompute scores for a corpus: scoring N documents against one query costs N full forward passes.^[5]

The standard deployment pattern for high-quality retrieval is therefore a two-stage pipeline. The bi-encoder retrieves top-100 or top-200 candidates in milliseconds, and the cross-encoder reranks them to produce a final top-5 or top-10 list. Models designed specifically for the reranking role, such as ms-marco-MiniLM-L-6-v2 from the sentence-transformers library and Cohere Rerank, are trained with pairwise or listwise ranking losses over query-passage relevance judgments.^[18]

Pooling strategies

Given a Transformer backbone, a pooling layer turns a sequence of token vectors into a single sentence vector. The three common choices are:

CLS pooling: uses the hidden state of the first [CLS] token. BERT was pre-trained with a next-sentence prediction objective that updates the CLS token, but without fine-tuning for similarity this vector captures only weak sentence-level semantics.^[5]
Mean pooling: averages all token hidden states weighted by the attention mask, so padding tokens do not contribute. SBERT and most later models find mean pooling outperforms CLS for similarity tasks on encoder-only architectures.^[5]
Max pooling: takes the element-wise maximum across all token hidden states, capturing the strongest activation per dimension. Useful when specific tokens dominate meaning, but it tends to ignore contextual relationships.

For decoder-only LLM-based encoders (E5-Mistral, NV-Embed, Harrier), last-token pooling or a learned latent-attention layer is more common because the autoregressive architecture does not produce a natural CLS token. NV-Embed introduced a latent-attention pooling layer that allows the LLM to attend to a set of learned latent vectors before producing the final embedding, consistently outperforming mean pooling on retrieval benchmarks.^[23]

Contrastive training objectives

Contemporary models train with contrastive objectives. The most widely used is the temperature-scaled InfoNCE loss (also called NT-Xent or SimCLR loss). For a minibatch of N sentence pairs, each positive pair is contrasted against the 2(N-1) in-batch negatives. The loss is:

L = -log [ exp(sim(qi, pi) / τ) / Σ_j exp(sim(qi, pj) / τ) ]

where sim denotes cosine similarity, τ is a temperature hyperparameter (typically 0.05 to 0.1), qi is a query embedding, and pi is its positive passage.^[6] Lower temperature sharpens the distribution and forces the model to distinguish harder negatives, but too low a temperature makes training unstable.

Hard negative mining adds difficult, semantically nearby examples to sharpen decision boundaries. Hard negatives are typically discovered with a current retriever or BM25 on each training query, or by using a cross-encoder teacher to score candidates and select those that score highly but are not positives. The Multiple Negatives Ranking Loss (MNR), implemented in sentence-transformers, is a variant that efficiently trains on (query, positive) pairs by treating all other positives in the batch as negatives.^[18]

Matryoshka Representation Learning

Matryoshka Representation Learning (MRL), introduced by Kusupati and colleagues at NeurIPS 2022, trains an embedding so that every prefix (first 64, 128, 256, 512, ... dimensions) is itself a useful representation.^[9] A single model can then serve embeddings at multiple dimensions, trading retrieval cost against quality without retraining.^[9] OpenAI's text-embedding-3 family and Nomic Embed v1.5 both expose MRL truncation through their APIs.^[15]^[16] The practical effect is significant: truncating from 3072 to 256 dimensions typically drops MTEB retrieval nDCG@10 by only 2 to 4 points while reducing storage cost by 12x.

Training paradigms

Four recipes account for most modern sentence encoders. NLI supervision (InferSent, SBERT) trains a classifier on premise-hypothesis pairs from SNLI and MNLI.^[3]^[5] Contrastive training with in-batch negatives (SimCSE, GTR, E5, BGE) pulls positive pairs together and pushes random negatives apart.^[6]^[7]^[11]^[12] Retrieval-style training (MS MARCO and BEIR-style hard negatives) tunes the model on query-passage pairs from real search logs.^[8] Instruction tuning (E5-Mistral, BGE-en-icl, gte-Qwen2-instruct) prepends a natural-language task description so one model can switch between retrieval, classification, clustering, and STS at inference time.^[13]^[22]

Top models combine several stages: weakly supervised contrastive pretraining on hundreds of millions of noisy web pairs, supervised fine-tuning on a curated mix of MS MARCO, NLI, and synthetic data, then instruction tuning on diverse task formats.^[11]^[13]

Evaluation metrics

Sentence embedding quality is assessed with different metrics depending on the task.

Cosine similarity

Cosine similarity measures the cosine of the angle between two vectors:

cos(u, v) = (u · v) / (||u|| ||v||)

It ranges from -1 (opposite directions) to 1 (identical directions) and is invariant to vector magnitude. Because most sentence embedding models L2-normalize their output vectors before returning them, cosine similarity is numerically equivalent to the dot product for normalized vectors. This property is exploited in ANN libraries: inner product search on normalized vectors gives cosine rankings.

Cosine similarity outperforms Euclidean distance for high-dimensional sentence embeddings because it focuses on direction rather than magnitude. A document that uses a word once and one that uses it ten times may differ greatly in magnitude but be semantically equivalent, and cosine similarity captures this.

Spearman correlation on STS

The standard scalar for STS benchmarks is Spearman's rank correlation (rho) between model-predicted cosine similarities and human-assigned gold similarity scores. Spearman correlation measures rank agreement rather than linear agreement, making it robust to non-linear relationships between raw cosine scores and human judgments. The SemEval STS tasks report Spearman correlation with the Pearson coefficient sometimes reported alongside for completeness.^[19]

A model that achieves Spearman rho = 0.90 on STSb dev has predicted the relative ordering of sentence pair similarities in near-perfect agreement with human annotators. State-of-the-art models (2024) reach 0.92 to 0.93 on the STSb test set.

Retrieval metrics

For retrieval tasks (MS MARCO, BEIR, MTEB retrieval), the standard metrics are:

nDCG@k (normalized discounted cumulative gain): the primary metric for MTEB retrieval; rewards placing relevant results higher in the ranking.^[10]
Recall@k: the fraction of relevant documents found in the top-k results; directly relevant for RAG pipelines that need to capture the correct passage.
MRR (mean reciprocal rank): the average reciprocal position of the first relevant result, commonly used on MS MARCO where each query has a single relevant passage.

MTEB reports nDCG@10 as the primary retrieval metric and averages across its 15 retrieval datasets for the leaderboard retrieval column score.^[10]

Alignment and uniformity

Gao and colleagues proposed measuring embedding quality along two axes.^[6] Alignment measures the average distance between embeddings of positive pairs (lower is better). Uniformity measures how evenly points are distributed on the unit hypersphere using a Gaussian potential kernel (lower is also better, because more uniform distributions use the space more efficiently and suffer less from representation degeneration).^[6] These two axes make the evaluation geometric rather than purely correlation-based and help diagnose whether a model collapses to a narrow cone (poor uniformity) or fails to cluster related sentences (poor alignment).

Notable models

Model	Year	Parameters	Embedding dim	Notes
InferSent	2017	~38M	4096	BiLSTM trained on SNLI^[3]
Universal Sentence Encoder	2018	~147M (Transformer)	512	Two variants, TF Hub^[4]
Sentence-BERT bert-base-nli-mean-tokens	2019	110M	768	First widely used SBERT model^[5]
all-MiniLM-L6-v2	2021	22M	384	Compact, very fast default^[18]
all-mpnet-base-v2	2021	109M	768	Popular high-quality default^[18]
SimCSE-RoBERTa-large	2021	355M	1024	Supervised contrastive^[6]
GTR-T5-large	2021	335M	768	Dual encoder from T5^[7]
Sentence-T5 (ST5)	2021	up to 11B	768	Multi-stage T5 sentence encoder
E5-large-v2	2022	335M	1024	Microsoft weakly supervised^[11]
BGE-large-en-v1.5	2023	335M	1024	BAAI, MTEB leader at release^[12]
GTE-large	2023	335M	1024	Alibaba general text embedding
E5-Mistral-7B-instruct	2023	7.1B	4096	LLM-based, instruction-tuned^[13]
BGE-M3	2024	568M	1024	100+ languages, dense/sparse/multi-vector^[14]
Nomic Embed v1	2024	137M	768	Open weights and open data, 8192 context^[15]
OpenAI text-embedding-3-small	2024	not disclosed	up to 1536	Matryoshka, API only^[16]
OpenAI text-embedding-3-large	2024	not disclosed	up to 3072	Matryoshka, API only^[16]
Cohere Embed v3 multilingual	2023	not disclosed	1024	Multilingual, API only
gte-Qwen2-7B-instruct	2024	7.6B	3584	Top of MTEB at release, 32K context^[22]
Nomic Embed v2 MoE	2024	305M (active)	768 (MRL to 256)	First general-purpose MoE embedder
Snowflake arctic-embed-l	2024	335M	1024	Retrieval-optimized open model^[20]
Snowflake arctic-embed-l-v2.0	2024	568M	1024	Multilingual, December 2024
Voyage-3	2024	not disclosed	1024	32K context, multilingual^[21]
Voyage-3-large	2025	not disclosed	1024 (MRL)	32K context, multi-domain
NV-Embed-v2	2024	7.8B (Mistral-based)	4096	MTEB overall leader at release (72.31); latent-attention pooling^[23]
Qwen3-Embedding-8B	2025	8B	variable	MTEB multilingual No.1 as of June 2025 (70.58); instruction-tuned^[24]
Qwen3-Embedding-0.6B	2025	0.6B	variable	Compact variant; strong multilingual performance^[24]
Microsoft Harrier-OSS-v1 27B	2026	27B	4096	Tops multilingual MTEB v2 (mid-70s score); MIT license; 32K context^[25]
Gemini Embedding 2	2026	not disclosed	variable	Google's first all-modality embedding model (March 2026)

Benchmarks

Benchmark	Year	Coverage	Focus
STS-12 through STS-16	2012-2016	English	Sentence pair similarity
SICK	2014	English	Compositional similarity and entailment
STS Benchmark (STSb)	2017	English	Curated dev/test from SemEval data^[19]
SemEval-2017 Task 1	2017	5 language pairs	Multilingual and cross-lingual STS^[19]
SentEval	2018	English	17 transfer tasks for sentence embeddings
MS MARCO	2018-2019	English	Web search passage ranking
BEIR	2021	English, 18 datasets	Zero-shot heterogeneous retrieval^[8]
MIRACL	2022	18 languages	Multilingual ad-hoc retrieval
MTEB	2022	56+ datasets, 112 languages	8 task types (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining)^[10]
MTEB-French, MTEB-Chinese, MTEB-Polish	2023+	Per-language	Language-specific MTEB tracks

The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff, Tazi, Magne, and Reimers in 2022 (arXiv:2210.07316, EACL 2023), is the de facto industry benchmark.^[10] Its public leaderboard at huggingface.co/spaces/mteb/leaderboard is where new embedding releases compete.^[17] As of mid-2025 and into 2026, top positions on the English MTEB are held by LLM-based encoders: NV-Embed-v2 (NVIDIA) led the overall leaderboard through late 2024,^[23] Qwen3-Embedding-8B topped the multilingual leaderboard as of June 2025,^[24] and Microsoft Harrier-OSS-v1 27B topped the multilingual MTEB v2 as of late March 2026.^[25] BEIR (Thakur et al., NeurIPS 2021) remains the standard zero-shot retrieval benchmark and is incorporated as the retrieval slice of MTEB.^[8]

A persistent criticism of MTEB-centric evaluation is that leaderboard pressure encourages benchmark-specific overfitting. The classification and clustering slices use simple linear probes that can shift substantially with small hyperparameter changes. Several groups have noted that models trained on data derived from MTEB-adjacent tasks can inflate scores without producing genuinely better embeddings for downstream applications. MTEB v2, launched to address some of these concerns, expanded coverage to 131 tasks across languages and introduced harder retrieval splits designed to reduce overlap with training data sources.^[26]

Applications

Semantic search is the most visible application: pre-computed embeddings of documents, paragraphs, FAQs, or product descriptions enable fast cosine-similarity lookup of the most relevant items for a query. The same retrieval pipeline is the first stage of every retrieval-augmented generation system, in which retrieved passages are concatenated into the prompt of a generative model.

Clustering and topic modeling group documents by embedding similarity (often combined with HDBSCAN and UMAP, as in BERTopic) without requiring a fixed taxonomy. Paraphrase detection, duplicate question detection (as on Quora and Stack Overflow), and plagiarism checking rely on threshold-based similarity comparisons. Recommendation systems use embeddings to match users with content, posts, or other users. Data engineering teams also use sentence embeddings for near-duplicate detection during dataset construction and for similarity-based generation metrics such as BERTScore.

The major practical applications break down as follows:

Semantic search and RAG retrieval. In a retrieval-augmented generation system, the retrieval component encodes a knowledge base into a vector store and retrieves the top-k passages for each user query. The quality of retrieved passages directly affects the faithfulness of the generated answer, making embedding model selection a critical decision. Most production RAG systems pair a bi-encoder for recall-oriented retrieval with a cross-encoder reranker for precision-oriented final ranking.

Dataset deduplication and quality filtering. Training large language models requires curating datasets with minimal near-duplicate content. Sentence embedding similarity, usually with cosine thresholds around 0.8 to 0.95, is used to flag near-duplicate sentences and paragraphs at scale. This was a key step in constructing datasets such as The Pile and various instruction-tuning corpora.

Cross-lingual alignment. Multilingual embedding models such as BGE-M3 and the Qwen3-Embedding family project sentences from different languages into a shared space, enabling cross-lingual retrieval: a query in English can retrieve relevant passages in French, Chinese, or Arabic without translation.^[14]^[24] The bitext mining task in MTEB measures this: given two corpora in different languages, how accurately can the model identify translation pairs?^[10]

Code search and technical retrieval. Embedding models trained on code-text pairs (such as CodeBERT or specialized variants of E5 and BGE) can match natural-language queries to code snippets. This powers features like GitHub Copilot's context retrieval and enterprise code search tools.

Evaluation metrics for generative models. BERTScore, an automated metric for evaluating generated text quality, computes the maximum cosine similarity between token embeddings of reference and candidate sentences. Sentence-level BERTScore extensions use sentence embeddings directly to measure passage-level similarity between generated and gold summaries.

Modern landscape (2024-2026)

Several converging trends define the current state of sentence similarity and text embedding research.

LLM-derived encoders

The first major trend is the dominance of LLM-derived encoders at the top of quality benchmarks. E5-Mistral-7B-instruct, gte-Qwen2-7B-instruct, SFR-Embedding-Mistral, and Linq-Embed-Mistral all start from a 7B-parameter decoder model and fine-tune it on synthetic and curated retrieval data.^[13]^[22] NVIDIA's NV-Embed-v2 extends the Mistral-7B base with a custom latent-attention pooling layer.^[23] These models reach top MTEB scores but impose much higher serving latency and memory requirements than the 110M-parameter encoders that preceded them. In 2025 and 2026 the trend extended to even larger models: Qwen3-Embedding-8B (Alibaba, 2025) and Microsoft Harrier-OSS-v1 at 27B parameters continued pushing quality benchmarks further at the cost of higher inference compute.^[24]^[25]

A practical consequence is a clearer two-tier market. For high-stakes retrieval in enterprise RAG pipelines, LLM-scale encoders are used despite their cost. For high-throughput embedding of large corpora, smaller models in the 100M to 500M parameter range remain dominant because they can be batched efficiently on commodity GPUs or even CPUs.

Adaptive embedding dimensions

The second trend is the widespread adoption of Matryoshka Representation Learning for adaptive embedding dimensions.^[9] OpenAI text-embedding-3, Nomic Embed v1.5 and v2, Snowflake arctic-embed-l-v2.0, and Voyage-3-large all let users truncate vectors to 256 or fewer dimensions, cutting storage and ANN index costs in vector databases with only small accuracy losses.^[16] MRL has become a standard feature request for new model releases, and most models published from 2024 onward include it.

Multilingual and multimodal expansion

The third trend is the sharp expansion of multilingual and multimodal coverage. BGE-M3 spans more than 100 languages and supports three retrieval modalities from a single model: dense embeddings (cosine similarity), sparse embeddings (BM25-style lexical matching through learned sparse vectors), and multi-vector embeddings (ColBERT-style late interaction).^[14] The Qwen3-Embedding family inherits the multilingual capabilities of the Qwen3 foundation models and topped the MTEB multilingual leaderboard in 2025.^[24] Microsoft Harrier-OSS-v1 (March 2026) supports a 32,768-token context window across languages and is released under an MIT license.^[25] Google's Gemini Embedding 2 (March 2026) is the first production embedding model to support all five modalities (text, image, audio, video, and document) from a single model, reflecting a broader trend toward unified multimodal retrieval.

Domain-tuned variants have also proliferated. Specialized embedding models for code (CodeBERT-based and LLM-derived), legal text, clinical notes, and financial documents are now published by both academic groups and commercial vendors, reflecting the brittleness of general-domain models when applied to specialized corpora.

Synthetic training data

A fourth trend, closely tied to the rise of LLM-derived encoders, is the use of synthetic training data generated by large language models. The E5-Mistral paper demonstrated that prompting GPT-4 to generate query-passage pairs for 93 task types across multiple languages produced training data that, when used for contrastive fine-tuning, yielded state-of-the-art embeddings without requiring manually labeled retrieval datasets.^[13]^[29] This technique has since been adopted widely, reducing dependence on expensive human annotation and enabling rapid domain adaptation.

Limitations

Sentence embedding models share several well-documented failure modes that constrain their applicability.

Domain transfer. An encoder trained on web text often performs poorly on legal contracts, clinical notes, or scientific papers without further fine-tuning.^[8] The STS and retrieval scores for a general-domain model can fall sharply when applied to specialized vocabulary or document structures that differ significantly from training data. Domain-specific fine-tuning on even a few thousand labeled pairs usually recovers most of the performance.

Negation blindness. Sentence embeddings frequently fail to distinguish affirmative from negated statements. "The drug is effective" and "The drug is not effective" often yield high cosine similarity because both sentences share the same salient content words ("drug", "effective"). Research through 2025 confirmed that LLMs and their embedding derivatives systematically underestimate the impact of negation tokens on meaning. This is a direct failure mode in retrieval-augmented generation when queries or passages contain negations or contraindications.

Numeracy gap. A 2024 study ("Revealing the Numeracy Gap") found that embedding models cannot reliably distinguish semantically important numeric differences.^[27] "The stock rose by 2%" and "The stock fell by 2%" produce embeddings that are very close despite having opposite implications, because the magnitude of the numbers is the same.^[27] Numeric reasoning is a consistent weak spot for embedding-based retrieval in finance, medicine, and engineering contexts.

Long-context degradation. Even models advertised as accepting 8K or 32K tokens tend to over-weight content from the beginning of the input (primacy bias) and to degrade in quality for truly long documents. Single-vector embedding compresses all content into one fixed-size vector, which is a lossy compression that becomes progressively worse as document length increases. Multi-vector approaches such as ColBERT-style late interaction partially address this at the cost of higher storage and retrieval complexity.

Evaluation overfitting. Heavy reliance on the public MTEB leaderboard has been criticized for encouraging benchmark-specific overfitting. Models whose training data overlaps with MTEB tasks can inflate scores. The classification and clustering slices use simple linear probes that are sensitive to hyperparameter choices. MTEB v2, with 131 tasks and harder splits, was designed to reduce this problem but itself becomes a target for optimization over time.^[26]

Logical and relational reasoning. Embedding models cannot reliably encode logical operators, orderings, or compositional constraints. Queries that require reasoning about multiple conditions (for instance, "documents about X that were published after 2020 and are not about Y") are poorly served by pure embedding retrieval, which is why hybrid systems combining embeddings with structured filters are common in production.^[28]

Bias amplification. Sentence embedding models trained on web-scale data inherit and can amplify social biases present in that data, for instance associating names associated with certain genders or ethnicities with particular concepts. Debiasing techniques such as SEAT (Sentence Embedding Association Test) audits and projection-based bias removal have been developed but are not universally applied.

Theoretical limitations of single-vector retrieval. A 2025 theoretical analysis ("On the Theoretical Limitations of Embedding-Based Retrieval") showed that single-vector bi-encoder retrieval is provably unable to correctly rank sets of documents defined by logical combinations of conditions, even with an ideal embedding model.^[28] This motivates multi-vector representations and hybrid retrieval architectures for semantically complex queries.

Sentence similarity models are a specialized subtype within the broader category of feature extraction models. Feature extraction models produce general-purpose text representations that can be used for any downstream task, including classification, generation, and structured prediction. Sentence similarity models are designed specifically so that the geometry of the output vector space reflects semantic relatedness: models are trained, evaluated, and deployed primarily for distance-based comparisons rather than for feeding into task-specific heads.

In practice, many models serve both roles. SBERT, E5, and BGE are used both as sentence similarity models (for STS and retrieval) and as feature extractors (input features for classifiers). However, a feature extraction model that has not been fine-tuned with a similarity objective, such as a raw BERT encoder, will produce embeddings whose geometry does not reliably reflect semantic similarity.^[5] This distinction motivates the separate evaluation tracks in MTEB, which tests both similarity-focused tasks (STS, pair classification) and representation quality on downstream tasks (classification, clustering).^[10]

Cross-encoders, discussed above in the architecture section, overlap with models commonly categorized under natural language inference and text classification rather than sentence similarity, since they produce a scalar score rather than a vector embedding. In practice they function as the high-precision stage of retrieval pipelines and are shipped alongside bi-encoders in libraries such as sentence-transformers.^[18]

Sparse retrieval systems such as BM25 and its learned successors (SPLADE, SPADE, and learned sparse retrieval) are often treated as alternatives or complements to dense sentence embeddings.^[31] BGE-M3 unifies both: it produces dense embeddings, learned sparse embeddings, and multi-vector representations from a single model, enabling late interaction scoring without requiring separate retrieval systems.^[14]

References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. *Indexing by Latent Semantic Analysis*. Journal of the American Society for Information Science, 1990. https://asistdl.onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9 ↩
Le, Q. and Mikolov, T. *Distributed Representations of Sentences and Documents*. ICML 2014. https://arxiv.org/abs/1405.4053 ↩
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. *Supervised Learning of Universal Sentence Representations from Natural Language Inference Data*. EMNLP 2017 (InferSent). https://arxiv.org/abs/1705.02364 ↩
Cer, D. et al. *Universal Sentence Encoder*. arXiv:1803.11175, 2018. https://arxiv.org/abs/1803.11175 ↩
Reimers, N. and Gurevych, I. *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*. EMNLP-IJCNLP 2019. https://arxiv.org/abs/1908.10084 ↩
Gao, T., Yao, X., and Chen, D. *SimCSE: Simple Contrastive Learning of Sentence Embeddings*. EMNLP 2021. https://arxiv.org/abs/2104.08821 ↩
Ni, J. et al. *Large Dual Encoders Are Generalizable Retrievers* (GTR). arXiv:2112.07899, 2021. https://arxiv.org/abs/2112.07899 ↩
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. *BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models*. NeurIPS 2021 Datasets and Benchmarks. https://arxiv.org/abs/2104.08663 ↩
Kusupati, A. et al. *Matryoshka Representation Learning*. NeurIPS 2022. https://arxiv.org/abs/2205.13147 ↩
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. *MTEB: Massive Text Embedding Benchmark*. EACL 2023. https://arxiv.org/abs/2210.07316 ↩
Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. *Text Embeddings by Weakly-Supervised Contrastive Pre-training* (E5). arXiv:2212.03533, 2022. https://arxiv.org/abs/2212.03533 ↩
Xiao, S., Liu, Z., Zhang, P., and Muennighoff, N. *C-Pack: Packed Resources for General Chinese Embeddings* (BGE). arXiv:2309.07597, 2023. https://arxiv.org/abs/2309.07597 ↩
Wang, L. et al. *Improving Text Embeddings with Large Language Models* (E5-Mistral). arXiv:2401.00368, 2024. https://arxiv.org/abs/2401.00368 ↩
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., and Liu, Z. *M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation* (BGE-M3). arXiv:2402.03216, 2024. https://arxiv.org/abs/2402.03216 ↩
Nussbaum, Z., Morris, J. X., Duderstadt, B., and Mulyar, A. *Nomic Embed: Training a Reproducible Long Context Text Embedder*. arXiv:2402.01613, 2024. https://arxiv.org/abs/2402.01613 ↩
OpenAI. *New embedding models and API updates* (text-embedding-3). January 25, 2024. https://openai.com/index/new-embedding-models-and-api-updates/ ↩
Hugging Face. *MTEB leaderboard*. https://huggingface.co/spaces/mteb/leaderboard ↩
Reimers, N. *sentence-transformers documentation and pretrained models*. https://www.sbert.net/docs/sentence_transformer/pretrained_models.html ↩
Cer, D. et al. *SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation*. ACL 2017. https://aclanthology.org/S17-2001/ ↩
Merrick, L. et al. *Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models*. arXiv:2405.05374, 2024. https://arxiv.org/abs/2405.05374 ↩
Voyage AI. *voyage-3 and voyage-3-lite: a new generation of small yet mighty general-purpose embedding models*. September 18, 2024. https://blog.voyageai.com/2024/09/18/voyage-3/ ↩
Alibaba-NLP. *gte-Qwen2-7B-instruct model card*. https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct ↩
Lee, J. et al. *NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models*. arXiv:2405.17428, 2024. https://arxiv.org/abs/2405.17428 ↩
Qwen Team. *Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models*. Qwen Blog, 2025. https://qwenlm.github.io/blog/qwen3-embedding/ ↩
Microsoft. *Harrier-OSS-v1: Industry-Leading Embedding Model*. Bing Blog, April 2026. https://blogs.bing.com/search/April-2026/Microsoft-Open-Sources-Industry-Leading-Embedding-Model ↩
Muennighoff, N. et al. *MTEB v2: Scaling the Massive Text Embedding Benchmark*. Hugging Face, 2025. https://huggingface.co/spaces/mteb/leaderboard ↩
Zhu, Y. et al. *Revealing the Numeracy Gap: An Empirical Investigation of Text Embedding Models*. arXiv:2509.05691, 2025. https://arxiv.org/abs/2509.05691 ↩
Weller, O. et al. *On the Theoretical Limitations of Embedding-Based Retrieval*. arXiv:2508.21038, 2025. https://arxiv.org/abs/2508.21038 ↩
Wang, L. et al. *Improving Text Embeddings with Large Language Models* (E5 synthetic data). ICLR 2024. https://arxiv.org/abs/2401.00368 ↩
Reimers, N. and Gurevych, I. *Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation*. EMNLP 2020. https://arxiv.org/abs/2004.09813
BM25 and learned sparse retrieval: Formal, T., Piwowarski, B., and Piolet, S. *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021. https://arxiv.org/abs/2107.05720 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Feature Extraction Models Sentence-transformers/all-MiniLM-L6-v2 model Sentence-transformers/all-mpnet-base-v2 model

Definition and core idea

History

Architectures

Bi-encoders

Cross-encoders

Pooling strategies

Contrastive training objectives

Matryoshka Representation Learning

Training paradigms

Evaluation metrics

Cosine similarity

Spearman correlation on STS

Retrieval metrics

Alignment and uniformity

Notable models

Benchmarks

Applications

Modern landscape (2024-2026)

LLM-derived encoders

Adaptive embedding dimensions

Multilingual and multimodal expansion

Synthetic training data

Limitations

Relationship to related models

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here