Sentence Similarity Models
Last reviewed
May 31, 2026
Sources
31 citations
Review status
Source-backed
Revision
v3 · 5,369 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
31 citations
Review status
Source-backed
Revision
v3 · 5,369 words
Add missing citations, update stale details, or suggest a clearer explanation.
Sentence similarity models are machine learning systems that map natural language sentences to fixed-length numerical vectors (sentence embeddings) so that semantically related sentences sit close together in the vector space. Similarity between two sentences is then measured with a distance or angle metric, most commonly cosine similarity. These models power many Natural Language Processing Models applications, including semantic search, retrieval for retrieval-augmented generation, clustering, paraphrase mining, deduplication, and recommendation.
The field grew directly from earlier feature extraction research, but with a distinct goal: rather than producing representations for classification or generation, sentence similarity models are optimized specifically so that the geometry of the vector space reflects human-perceived semantic relatedness. This constraint shapes every design choice, from the objective function used during training to the pooling strategy applied at inference time.
See also: Natural Language Processing Models, Feature Extraction Models
A sentence embedding is a dense vector, typically between 256 and 4096 dimensions, that encodes the meaning of a sentence or short passage. Two sentences with the same meaning should yield vectors whose cosine similarity is close to 1, while unrelated sentences approach 0. The related research area of semantic textual similarity (STS) measures how well a system predicts the human-judged similarity score for a sentence pair, scored 0 to 5 in the SemEval shared tasks. High STS Spearman correlation is a necessary but not sufficient condition for good retrieval performance, which is why modern evaluation suites such as MTEB cover both STS and information retrieval tasks in parallel.
In practice, these models are deployed as encoders inside larger pipelines. A search system encodes every document or passage once and stores the vectors in a vector database. At query time, the question is encoded with the same model and nearest neighbors are returned by approximate nearest-neighbor search. This bi-encoder pattern made dense retrieval practical at web scale and is the default building block for retrieval-augmented generation systems.
Early work treated documents as sparse bag-of-words vectors, sometimes weighted by term frequency-inverse document frequency (TF-IDF). Latent semantic analysis (LSA), introduced by Deerwester, Dumais, Furnas, Landauer, and Harshman in 1990, applied singular value decomposition to a term-by-document matrix to project sparse counts into a roughly 100-dimensional latent space that captured synonymy through co-occurrence. LSA is considered the first practical dense representation for text retrieval.
The distributed word representation era, kicked off by Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), produced strong word embeddings but only weak sentence embeddings when word vectors were averaged. Le and Mikolov (2014) extended the idea with Paragraph Vector (doc2vec) in PV-DM and PV-DBOW variants, training a per-document vector alongside word vectors to predict words in a passage.
Deep sentence encoders began with Skip-Thought vectors (Kiros et al., 2015), which trained an encoder-decoder to predict surrounding sentences. Conneau, Kiela, Schwenk, Barrault, and Bordes published InferSent in 2017, showing a BiLSTM trained on the Stanford Natural Language Inference (SNLI) corpus produced general-purpose sentence representations beating unsupervised baselines on transfer tasks. Cer and colleagues at Google released the Universal Sentence Encoder (USE) in 2018, packaging a Transformer encoder and a Deep Averaging Network through TensorFlow Hub.
The single biggest leap was Sentence-BERT (SBERT), published by Nils Reimers and Iryna Gurevych at EMNLP-IJCNLP 2019. SBERT fine-tunes a pretrained BERT model in a Siamese / triplet architecture with mean pooling, producing fixed-size sentence embeddings comparable through cosine similarity. The headline result was practical: finding the most similar pair in a 10,000-sentence collection dropped from about 65 hours with vanilla BERT cross-encoding to roughly 5 seconds with SBERT, with similar accuracy. SBERT's authors also released the sentence-transformers Python library, which became the de facto standard for training and serving sentence embedding models.
Gao, Yao, and Chen introduced SimCSE in 2021, a contrastive learning recipe with simple ingredients. The unsupervised variant passes the same sentence through a Transformer twice with independent dropout masks and treats the outputs as a positive pair, with other in-batch sentences as negatives. The key theoretical insight was that dropout acts as minimal data augmentation: removing it causes representation collapse, while applying it twice introduces just enough noise to create a non-trivial learning signal. The supervised variant uses SNLI/MNLI entailment pairs as positives and contradictions as hard negatives. SimCSE also introduced alignment and uniformity as diagnostic axes: alignment measures how close positive pairs are, while uniformity measures how evenly the embeddings cover the unit hypersphere. SimCSE pushed average STS Spearman correlation above 81 percent with a base-sized model.
From 2022 onward, web-scale contrastive pretraining dominated. Ni and colleagues at Google released GTR (Generalizable T5 Retrieval) in late 2021, showing that scaling a T5 encoder to billions of parameters with a 768-dimensional bottleneck produced strong zero-shot retrieval. Wang and colleagues at Microsoft released the E5 family in December 2022, trained contrastively on a 270M-pair web dataset (CCPairs) with prefixes such as "query:" and "passage:". The Beijing Academy of Artificial Intelligence (BAAI) released the BGE (BAAI General Embedding) line in 2023 with the C-Pack technical report. Alibaba's GTE family followed, along with Nomic AI's open-data Nomic Embed, OpenAI's text-embedding-3 models in January 2024, Cohere Embed v3, Voyage AI embeddings, and Snowflake Arctic Embed. By 2024 the top of the public MTEB leaderboard was held by LLM-derived encoders such as E5-Mistral-7B-instruct and gte-Qwen2-7B-instruct. NVIDIA's NV-Embed-v2, built on Mistral-7B-v0.1 with a novel latent-attention pooling mechanism, topped the overall MTEB leaderboard in August 2024 with a score of 72.31 across 56 tasks. The Qwen3-Embedding family (2025) and Microsoft's Harrier-OSS-v1 (March 2026) extended this trend, with the Harrier 27B model reaching the top of the multilingual MTEB v2 benchmark.
Two broad architectures dominate. A bi-encoder (or dual encoder) processes each sentence independently and produces one vector per sentence; comparison is a cheap dot product or cosine similarity. A cross-encoder feeds a sentence pair jointly through a Transformer and outputs a similarity score, which is more accurate but cannot be precomputed and scales quadratically in the number of comparisons. In production, the two are often combined: a bi-encoder retrieves top-k candidates, then a cross-encoder reranks them.
The bi-encoder architecture is the foundation of scalable dense retrieval. Because each document is encoded once and stored, the only variable cost at query time is encoding the query (a single forward pass) and performing approximate nearest-neighbor (ANN) search over the stored vectors. Libraries such as FAISS, ScaNN, and HNSW-based indexes make the ANN step fast enough for billion-scale corpora. The tradeoff is that bi-encoders miss fine-grained token-level interactions between a query and a passage because the two are never jointly attended to during encoding.
Bi-encoders are optimized for asymmetric retrieval tasks where queries are short (a few words or a sentence) and passages are longer paragraphs. Many production models use different pooling configurations or even separately trained query and passage towers for best asymmetric performance, as seen in the "query:" and "passage:" prefix convention in E5.
A cross-encoder concatenates both sentences (separated by a [SEP] token) and runs a full Transformer forward pass over the combined sequence. Every token in one sentence can attend to every token in the other, so the model captures nuanced lexical and semantic interactions that bi-encoders miss. The output is a scalar relevance score. Cross-encoders consistently outperform bi-encoders on pairwise similarity tasks but cannot precompute scores for a corpus: scoring N documents against one query costs N full forward passes.
The standard deployment pattern for high-quality retrieval is therefore a two-stage pipeline. The bi-encoder retrieves top-100 or top-200 candidates in milliseconds, and the cross-encoder reranks them to produce a final top-5 or top-10 list. Models designed specifically for the reranking role, such as ms-marco-MiniLM-L-6-v2 from the sentence-transformers library and Cohere Rerank, are trained with pairwise or listwise ranking losses over query-passage relevance judgments.
Given a Transformer backbone, a pooling layer turns a sequence of token vectors into a single sentence vector. The three common choices are:
For decoder-only LLM-based encoders (E5-Mistral, NV-Embed, Harrier), last-token pooling or a learned latent-attention layer is more common because the autoregressive architecture does not produce a natural CLS token. NV-Embed introduced a latent-attention pooling layer that allows the LLM to attend to a set of learned latent vectors before producing the final embedding, consistently outperforming mean pooling on retrieval benchmarks.
Contemporary models train with contrastive objectives. The most widely used is the temperature-scaled InfoNCE loss (also called NT-Xent or SimCLR loss). For a minibatch of N sentence pairs, each positive pair is contrasted against the 2(N-1) in-batch negatives. The loss is:
L = -log [ exp(sim(qi, pi) / τ) / Σ_j exp(sim(qi, pj) / τ) ]
where sim denotes cosine similarity, τ is a temperature hyperparameter (typically 0.05 to 0.1), qi is a query embedding, and pi is its positive passage. Lower temperature sharpens the distribution and forces the model to distinguish harder negatives, but too low a temperature makes training unstable.
Hard negative mining adds difficult, semantically nearby examples to sharpen decision boundaries. Hard negatives are typically discovered with a current retriever or BM25 on each training query, or by using a cross-encoder teacher to score candidates and select those that score highly but are not positives. The Multiple Negatives Ranking Loss (MNR), implemented in sentence-transformers, is a variant that efficiently trains on (query, positive) pairs by treating all other positives in the batch as negatives.
Matryoshka Representation Learning (MRL), introduced by Kusupati and colleagues at NeurIPS 2022, trains an embedding so that every prefix (first 64, 128, 256, 512, ... dimensions) is itself a useful representation. A single model can then serve embeddings at multiple dimensions, trading retrieval cost against quality without retraining. OpenAI's text-embedding-3 family and Nomic Embed v1.5 both expose MRL truncation through their APIs. The practical effect is significant: truncating from 3072 to 256 dimensions typically drops MTEB retrieval nDCG@10 by only 2 to 4 points while reducing storage cost by 12x.
Four recipes account for most modern sentence encoders. NLI supervision (InferSent, SBERT) trains a classifier on premise-hypothesis pairs from SNLI and MNLI. Contrastive training with in-batch negatives (SimCSE, GTR, E5, BGE) pulls positive pairs together and pushes random negatives apart. Retrieval-style training (MS MARCO and BEIR-style hard negatives) tunes the model on query-passage pairs from real search logs. Instruction tuning (E5-Mistral, BGE-en-icl, gte-Qwen2-instruct) prepends a natural-language task description so one model can switch between retrieval, classification, clustering, and STS at inference time.
Top models combine several stages: weakly supervised contrastive pretraining on hundreds of millions of noisy web pairs, supervised fine-tuning on a curated mix of MS MARCO, NLI, and synthetic data, then instruction tuning on diverse task formats.
Sentence embedding quality is assessed with different metrics depending on the task.
Cosine similarity measures the cosine of the angle between two vectors:
cos(u, v) = (u · v) / (||u|| ||v||)
It ranges from -1 (opposite directions) to 1 (identical directions) and is invariant to vector magnitude. Because most sentence embedding models L2-normalize their output vectors before returning them, cosine similarity is numerically equivalent to the dot product for normalized vectors. This property is exploited in ANN libraries: inner product search on normalized vectors gives cosine rankings.
Cosine similarity outperforms Euclidean distance for high-dimensional sentence embeddings because it focuses on direction rather than magnitude. A document that uses a word once and one that uses it ten times may differ greatly in magnitude but be semantically equivalent, and cosine similarity captures this.
The standard scalar for STS benchmarks is Spearman's rank correlation (rho) between model-predicted cosine similarities and human-assigned gold similarity scores. Spearman correlation measures rank agreement rather than linear agreement, making it robust to non-linear relationships between raw cosine scores and human judgments. The SemEval STS tasks report Spearman correlation with the Pearson coefficient sometimes reported alongside for completeness.
A model that achieves Spearman rho = 0.90 on STSb dev has predicted the relative ordering of sentence pair similarities in near-perfect agreement with human annotators. State-of-the-art models (2024) reach 0.92 to 0.93 on the STSb test set.
For retrieval tasks (MS MARCO, BEIR, MTEB retrieval), the standard metrics are:
MTEB reports nDCG@10 as the primary retrieval metric and averages across its 15 retrieval datasets for the leaderboard retrieval column score.
Gao and colleagues proposed measuring embedding quality along two axes. Alignment measures the average distance between embeddings of positive pairs (lower is better). Uniformity measures how evenly points are distributed on the unit hypersphere using a Gaussian potential kernel (lower is also better, because more uniform distributions use the space more efficiently and suffer less from representation degeneration). These two axes make the evaluation geometric rather than purely correlation-based and help diagnose whether a model collapses to a narrow cone (poor uniformity) or fails to cluster related sentences (poor alignment).
| Model | Year | Parameters | Embedding dim | Notes |
|---|---|---|---|---|
| InferSent | 2017 | ~38M | 4096 | BiLSTM trained on SNLI |
| Universal Sentence Encoder | 2018 | ~147M (Transformer) | 512 | Two variants, TF Hub |
| Sentence-BERT bert-base-nli-mean-tokens | 2019 | 110M | 768 | First widely used SBERT model |
| all-MiniLM-L6-v2 | 2021 | 22M | 384 | Compact, very fast default |
| all-mpnet-base-v2 | 2021 | 109M | 768 | Popular high-quality default |
| SimCSE-RoBERTa-large | 2021 | 355M | 1024 | Supervised contrastive |
| GTR-T5-large | 2021 | 335M | 768 | Dual encoder from T5 |
| Sentence-T5 (ST5) | 2021 | up to 11B | 768 | Multi-stage T5 sentence encoder |
| E5-large-v2 | 2022 | 335M | 1024 | Microsoft weakly supervised |
| BGE-large-en-v1.5 | 2023 | 335M | 1024 | BAAI, MTEB leader at release |
| GTE-large | 2023 | 335M | 1024 | Alibaba general text embedding |
| E5-Mistral-7B-instruct | 2023 | 7.1B | 4096 | LLM-based, instruction-tuned |
| BGE-M3 | 2024 | 568M | 1024 | 100+ languages, dense/sparse/multi-vector |
| Nomic Embed v1 | 2024 | 137M | 768 | Open weights and open data, 8192 context |
| OpenAI text-embedding-3-small | 2024 | not disclosed | up to 1536 | Matryoshka, API only |
| OpenAI text-embedding-3-large | 2024 | not disclosed | up to 3072 | Matryoshka, API only |
| Cohere Embed v3 multilingual | 2023 | not disclosed | 1024 | Multilingual, API only |
| gte-Qwen2-7B-instruct | 2024 | 7.6B | 3584 | Top of MTEB at release, 32K context |
| Nomic Embed v2 MoE | 2024 | 305M (active) | 768 (MRL to 256) | First general-purpose MoE embedder |
| Snowflake arctic-embed-l | 2024 | 335M | 1024 | Retrieval-optimized open model |
| Snowflake arctic-embed-l-v2.0 | 2024 | 568M | 1024 | Multilingual, December 2024 |
| Voyage-3 | 2024 | not disclosed | 1024 | 32K context, multilingual |
| Voyage-3-large | 2025 | not disclosed | 1024 (MRL) | 32K context, multi-domain |
| NV-Embed-v2 | 2024 | 7.8B (Mistral-based) | 4096 | MTEB overall leader at release (72.31); latent-attention pooling |
| Qwen3-Embedding-8B | 2025 | 8B | variable | MTEB multilingual No.1 as of June 2025 (70.58); instruction-tuned |
| Qwen3-Embedding-0.6B | 2025 | 0.6B | variable | Compact variant; strong multilingual performance |
| Microsoft Harrier-OSS-v1 27B | 2026 | 27B | 4096 | Tops multilingual MTEB v2 (mid-70s score); MIT license; 32K context |
| Gemini Embedding 2 | 2026 | not disclosed | variable | Google's first all-modality embedding model (March 2026) |
| Benchmark | Year | Coverage | Focus |
|---|---|---|---|
| STS-12 through STS-16 | 2012-2016 | English | Sentence pair similarity |
| SICK | 2014 | English | Compositional similarity and entailment |
| STS Benchmark (STSb) | 2017 | English | Curated dev/test from SemEval data |
| SemEval-2017 Task 1 | 2017 | 5 language pairs | Multilingual and cross-lingual STS |
| SentEval | 2018 | English | 17 transfer tasks for sentence embeddings |
| MS MARCO | 2018-2019 | English | Web search passage ranking |
| BEIR | 2021 | English, 18 datasets | Zero-shot heterogeneous retrieval |
| MIRACL | 2022 | 18 languages | Multilingual ad-hoc retrieval |
| MTEB | 2022 | 56+ datasets, 112 languages | 8 task types (retrieval, STS, classification, clustering, reranking, pair classification, summarization, bitext mining) |
| MTEB-French, MTEB-Chinese, MTEB-Polish | 2023+ | Per-language | Language-specific MTEB tracks |
The Massive Text Embedding Benchmark (MTEB), introduced by Muennighoff, Tazi, Magne, and Reimers in 2022 (arXiv:2210.07316, EACL 2023), is the de facto industry benchmark. Its public leaderboard at huggingface.co/spaces/mteb/leaderboard is where new embedding releases compete. As of mid-2025 and into 2026, top positions on the English MTEB are held by LLM-based encoders: NV-Embed-v2 (NVIDIA) led the overall leaderboard through late 2024, Qwen3-Embedding-8B topped the multilingual leaderboard as of June 2025, and Microsoft Harrier-OSS-v1 27B topped the multilingual MTEB v2 as of late March 2026. BEIR (Thakur et al., NeurIPS 2021) remains the standard zero-shot retrieval benchmark and is incorporated as the retrieval slice of MTEB.
A persistent criticism of MTEB-centric evaluation is that leaderboard pressure encourages benchmark-specific overfitting. The classification and clustering slices use simple linear probes that can shift substantially with small hyperparameter changes. Several groups have noted that models trained on data derived from MTEB-adjacent tasks can inflate scores without producing genuinely better embeddings for downstream applications. MTEB v2, launched to address some of these concerns, expanded coverage to 131 tasks across languages and introduced harder retrieval splits designed to reduce overlap with training data sources.
Semantic search is the most visible application: pre-computed embeddings of documents, paragraphs, FAQs, or product descriptions enable fast cosine-similarity lookup of the most relevant items for a query. The same retrieval pipeline is the first stage of every retrieval-augmented generation system, in which retrieved passages are concatenated into the prompt of a generative model.
Clustering and topic modeling group documents by embedding similarity (often combined with HDBSCAN and UMAP, as in BERTopic) without requiring a fixed taxonomy. Paraphrase detection, duplicate question detection (as on Quora and Stack Overflow), and plagiarism checking rely on threshold-based similarity comparisons. Recommendation systems use embeddings to match users with content, posts, or other users. Data engineering teams also use sentence embeddings for near-duplicate detection during dataset construction and for similarity-based generation metrics such as BERTScore.
The major practical applications break down as follows:
Semantic search and RAG retrieval. In a retrieval-augmented generation system, the retrieval component encodes a knowledge base into a vector store and retrieves the top-k passages for each user query. The quality of retrieved passages directly affects the faithfulness of the generated answer, making embedding model selection a critical decision. Most production RAG systems pair a bi-encoder for recall-oriented retrieval with a cross-encoder reranker for precision-oriented final ranking.
Dataset deduplication and quality filtering. Training large language models requires curating datasets with minimal near-duplicate content. Sentence embedding similarity, usually with cosine thresholds around 0.8 to 0.95, is used to flag near-duplicate sentences and paragraphs at scale. This was a key step in constructing datasets such as The Pile and various instruction-tuning corpora.
Cross-lingual alignment. Multilingual embedding models such as BGE-M3 and the Qwen3-Embedding family project sentences from different languages into a shared space, enabling cross-lingual retrieval: a query in English can retrieve relevant passages in French, Chinese, or Arabic without translation. The bitext mining task in MTEB measures this: given two corpora in different languages, how accurately can the model identify translation pairs?
Code search and technical retrieval. Embedding models trained on code-text pairs (such as CodeBERT or specialized variants of E5 and BGE) can match natural-language queries to code snippets. This powers features like GitHub Copilot's context retrieval and enterprise code search tools.
Evaluation metrics for generative models. BERTScore, an automated metric for evaluating generated text quality, computes the maximum cosine similarity between token embeddings of reference and candidate sentences. Sentence-level BERTScore extensions use sentence embeddings directly to measure passage-level similarity between generated and gold summaries.
Several converging trends define the current state of sentence similarity and text embedding research.
The first major trend is the dominance of LLM-derived encoders at the top of quality benchmarks. E5-Mistral-7B-instruct, gte-Qwen2-7B-instruct, SFR-Embedding-Mistral, and Linq-Embed-Mistral all start from a 7B-parameter decoder model and fine-tune it on synthetic and curated retrieval data. NVIDIA's NV-Embed-v2 extends the Mistral-7B base with a custom latent-attention pooling layer. These models reach top MTEB scores but impose much higher serving latency and memory requirements than the 110M-parameter encoders that preceded them. In 2025 and 2026 the trend extended to even larger models: Qwen3-Embedding-8B (Alibaba, 2025) and Microsoft Harrier-OSS-v1 at 27B parameters continued pushing quality benchmarks further at the cost of higher inference compute.
A practical consequence is a clearer two-tier market. For high-stakes retrieval in enterprise RAG pipelines, LLM-scale encoders are used despite their cost. For high-throughput embedding of large corpora, smaller models in the 100M to 500M parameter range remain dominant because they can be batched efficiently on commodity GPUs or even CPUs.
The second trend is the widespread adoption of Matryoshka Representation Learning for adaptive embedding dimensions. OpenAI text-embedding-3, Nomic Embed v1.5 and v2, Snowflake arctic-embed-l-v2.0, and Voyage-3-large all let users truncate vectors to 256 or fewer dimensions, cutting storage and ANN index costs in vector databases with only small accuracy losses. MRL has become a standard feature request for new model releases, and most models published from 2024 onward include it.
The third trend is the sharp expansion of multilingual and multimodal coverage. BGE-M3 spans more than 100 languages and supports three retrieval modalities from a single model: dense embeddings (cosine similarity), sparse embeddings (BM25-style lexical matching through learned sparse vectors), and multi-vector embeddings (ColBERT-style late interaction). The Qwen3-Embedding family inherits the multilingual capabilities of the Qwen3 foundation models and topped the MTEB multilingual leaderboard in 2025. Microsoft Harrier-OSS-v1 (March 2026) supports a 32,768-token context window across languages and is released under an MIT license. Google's Gemini Embedding 2 (March 2026) is the first production embedding model to support all five modalities (text, image, audio, video, and document) from a single model, reflecting a broader trend toward unified multimodal retrieval.
Domain-tuned variants have also proliferated. Specialized embedding models for code (CodeBERT-based and LLM-derived), legal text, clinical notes, and financial documents are now published by both academic groups and commercial vendors, reflecting the brittleness of general-domain models when applied to specialized corpora.
A fourth trend, closely tied to the rise of LLM-derived encoders, is the use of synthetic training data generated by large language models. The E5-Mistral paper demonstrated that prompting GPT-4 to generate query-passage pairs for 93 task types across multiple languages produced training data that, when used for contrastive fine-tuning, yielded state-of-the-art embeddings without requiring manually labeled retrieval datasets. This technique has since been adopted widely, reducing dependence on expensive human annotation and enabling rapid domain adaptation.
Sentence embedding models share several well-documented failure modes that constrain their applicability.
Domain transfer. An encoder trained on web text often performs poorly on legal contracts, clinical notes, or scientific papers without further fine-tuning. The STS and retrieval scores for a general-domain model can fall sharply when applied to specialized vocabulary or document structures that differ significantly from training data. Domain-specific fine-tuning on even a few thousand labeled pairs usually recovers most of the performance.
Negation blindness. Sentence embeddings frequently fail to distinguish affirmative from negated statements. "The drug is effective" and "The drug is not effective" often yield high cosine similarity because both sentences share the same salient content words ("drug", "effective"). Research through 2025 confirmed that LLMs and their embedding derivatives systematically underestimate the impact of negation tokens on meaning. This is a direct failure mode in retrieval-augmented generation when queries or passages contain negations or contraindications.
Numeracy gap. A 2024 study ("Revealing the Numeracy Gap") found that embedding models cannot reliably distinguish semantically important numeric differences. "The stock rose by 2%" and "The stock fell by 2%" produce embeddings that are very close despite having opposite implications, because the magnitude of the numbers is the same. Numeric reasoning is a consistent weak spot for embedding-based retrieval in finance, medicine, and engineering contexts.
Long-context degradation. Even models advertised as accepting 8K or 32K tokens tend to over-weight content from the beginning of the input (primacy bias) and to degrade in quality for truly long documents. Single-vector embedding compresses all content into one fixed-size vector, which is a lossy compression that becomes progressively worse as document length increases. Multi-vector approaches such as ColBERT-style late interaction partially address this at the cost of higher storage and retrieval complexity.
Evaluation overfitting. Heavy reliance on the public MTEB leaderboard has been criticized for encouraging benchmark-specific overfitting. Models whose training data overlaps with MTEB tasks can inflate scores. The classification and clustering slices use simple linear probes that are sensitive to hyperparameter choices. MTEB v2, with 131 tasks and harder splits, was designed to reduce this problem but itself becomes a target for optimization over time.
Logical and relational reasoning. Embedding models cannot reliably encode logical operators, orderings, or compositional constraints. Queries that require reasoning about multiple conditions (for instance, "documents about X that were published after 2020 and are not about Y") are poorly served by pure embedding retrieval, which is why hybrid systems combining embeddings with structured filters are common in production.
Bias amplification. Sentence embedding models trained on web-scale data inherit and can amplify social biases present in that data, for instance associating names associated with certain genders or ethnicities with particular concepts. Debiasing techniques such as SEAT (Sentence Embedding Association Test) audits and projection-based bias removal have been developed but are not universally applied.
Theoretical limitations of single-vector retrieval. A 2025 theoretical analysis ("On the Theoretical Limitations of Embedding-Based Retrieval") showed that single-vector bi-encoder retrieval is provably unable to correctly rank sets of documents defined by logical combinations of conditions, even with an ideal embedding model. This motivates multi-vector representations and hybrid retrieval architectures for semantically complex queries.
Sentence similarity models are a specialized subtype within the broader category of feature extraction models. Feature extraction models produce general-purpose text representations that can be used for any downstream task, including classification, generation, and structured prediction. Sentence similarity models are designed specifically so that the geometry of the output vector space reflects semantic relatedness: models are trained, evaluated, and deployed primarily for distance-based comparisons rather than for feeding into task-specific heads.
In practice, many models serve both roles. SBERT, E5, and BGE are used both as sentence similarity models (for STS and retrieval) and as feature extractors (input features for classifiers). However, a feature extraction model that has not been fine-tuned with a similarity objective, such as a raw BERT encoder, will produce embeddings whose geometry does not reliably reflect semantic similarity. This distinction motivates the separate evaluation tracks in MTEB, which tests both similarity-focused tasks (STS, pair classification) and representation quality on downstream tasks (classification, clustering).
Cross-encoders, discussed above in the architecture section, overlap with models commonly categorized under natural language inference and text classification rather than sentence similarity, since they produce a scalar score rather than a vector embedding. In practice they function as the high-precision stage of retrieval pipelines and are shipped alongside bi-encoders in libraries such as sentence-transformers.
Sparse retrieval systems such as BM25 and its learned successors (SPLADE, SPADE, and learned sparse retrieval) are often treated as alternatives or complements to dense sentence embeddings. BGE-M3 unifies both: it produces dense embeddings, learned sparse embeddings, and multi-vector representations from a single model, enabling late interaction scoring without requiring separate retrieval systems.