Vector embeddings

Information Retrieval Natural Language Processing

29 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

42 citations

Revision

v10 · 5,780 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: AI terms

Vector embeddings are dense numerical representations of objects (text, images, audio, video, code, graphs, or any structured data) that map them into a continuous vector space such that semantic similarity between objects corresponds to geometric proximity. They form the mathematical substrate of modern machine learning, natural language processing, and information retrieval, and they sit at the core of nearly every production retrieval-augmented generation (RAG) pipeline, semantic search engine, recommendation system, and multimodal AI model deployed today.^[1]^[2]

An embedding model takes an input (a sentence, a JPEG, a 30-second audio clip, a SMILES molecular string) and produces a fixed-length vector, typically of 256 to 4,096 dimensions. The values inside the vector have no individually interpretable meaning. Their utility comes from the geometry of the resulting space: cosine similarity between two embeddings approximates the semantic relatedness of the two inputs.^[3] On the Massive Text Embedding Benchmark (MTEB), OpenAI's text-embedding-3-large scores about 64.6% averaged across 56 datasets, roughly 3.6 points above the earlier text-embedding-ada-002 model, illustrating how quickly text embedding quality advanced between 2022 and 2024.^[16]^[21]

What is a vector embedding?

Formally, an embedding is a learned function $f: X \to \mathbb{R}^d$ that maps an input space $X$ (words, sentences, images, etc.) into a real-valued d-dimensional vector space, with the property that a chosen distance metric (cosine, dot product, or Euclidean) between $f(a)$ and $f(b)$ reflects a meaningful notion of similarity between $a$ and $b$ .^[1]

Key properties of modern embeddings:

Property	Description
Dense	All or nearly all dimensions are non-zero, in contrast to sparse one-hot or bag-of-words representations.
Distributed	Meaning is spread across many dimensions; no single dimension corresponds to a human concept.
Fixed-length	Output dimensionality d is constant for a given model regardless of input length.
Continuous	Lives in $\mathbb{R}^d$ , allowing arithmetic operations (addition, subtraction, interpolation).
Learned	Produced by a neural network trained on a large corpus with a self-supervised or contrastive objective.

History: how did vector embeddings develop?

Distributed representations (1986)

The conceptual foundation comes from Geoffrey Hinton's 1986 paper Learning distributed representations of concepts, which argued that concepts should be represented by patterns of activity over many units rather than by single dedicated symbols.^[4] This idea, that meaning emerges from the joint values of many features, is the philosophical ancestor of every modern embedding.

Latent semantic analysis (1990)

The first widely used dense vector representation of text was latent semantic analysis (LSA), introduced by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman in their 1990 paper Indexing by latent semantic analysis.^[5] LSA applies singular value decomposition (SVD) to a term-document matrix, producing low-rank approximations in which documents and terms are represented as vectors in the same reduced space. Documents with overlapping topical structure end up close in the SVD space even when they share no surface vocabulary, which lets LSA retrieve relevant documents that contain only synonyms of the query terms.

LSA established that a continuous, low-dimensional vector space could capture latent semantic structure, but its training was an $O(n^3)$ matrix decomposition that did not scale to web-sized corpora and produced static representations that could not be incrementally updated.^[5]

Word2Vec (2013)

The modern era of embeddings began with Word2Vec, released by Tomas Mikolov and colleagues at Google in two 2013 papers, Efficient estimation of word representations in vector space and Distributed representations of words and phrases and their compositionality.^[6]^[7] The first of these set out to learn high-quality word vectors cheaply: as its abstract states, 'The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks.'^[6] Word2Vec offered two shallow neural architectures:

CBOW (Continuous Bag-of-Words) predicts a target word from the surrounding context.
Skip-gram predicts the surrounding context words from a given target word.

Trained with negative sampling on a 100-billion-word Google News corpus, Word2Vec produced 300-dimensional word vectors that exhibited the now-famous analogy property: vector('king') minus vector('man') plus vector('woman') yields a vector closest to that of 'queen'.^[7] This compositional behavior was a striking demonstration that distributed representations could encode relational structure.

Word2Vec's contribution was as much engineering as scientific. The negative-sampling trick reduced training cost from $O(V)$ to $O(k)$ per example (where $V$ is vocabulary size and $k$ is a small constant), allowing training on billions of words on a single CPU in hours rather than weeks.^[7]

GloVe (2014)

In 2014, Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford introduced GloVe (Global Vectors for Word Representation) at EMNLP.^[8] GloVe combines the global matrix-factorization view of LSA with the local context-window view of Word2Vec. It factorizes the logarithm of the word-word co-occurrence matrix using a weighted least-squares objective, producing vectors that match or exceed Word2Vec on word analogy and similarity benchmarks while training faster on the same corpus.

The pretrained GloVe vectors trained on Common Crawl (840 billion tokens, 2.2 million vocabulary, 300 dimensions) became a de facto standard input for neural network NLP models in 2014 to 2017.^[8]

FastText (2016 to 2017)

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research extended Word2Vec with subword information in 2017's Enriching word vectors with subword information.^[9] FastText represents each word as a bag of character n-grams (typically n = 3 to 6), so the vector for 'unhappiness' is built from sub-units like 'un', 'unh', 'happi', 'iness'. This gives the model two important properties: it can produce vectors for out-of-vocabulary words by composing their character n-grams, and it captures morphological regularities (singular versus plural, tense, derivation) that Word2Vec misses.

FastText released pretrained vectors for 157 languages, which made it the first practical multilingual embedding system at scale.^[9]

Contextual embeddings: ELMo, BERT, and after (2018)

Word2Vec, GloVe, and FastText all produce a single vector per word type regardless of context, so 'bank' in 'river bank' gets the same vector as in 'savings bank'. Peters and colleagues' ELMo (2018) was the first widely adopted contextual model: it ran a bidirectional LSTM over the sentence and produced a different vector for each token depending on its full sentential context.^[10]

The decisive break came later in 2018 with BERT (Bidirectional Encoder Representations from Transformers) by Devlin and colleagues at Google.^[11] BERT replaced the LSTM with a Transformer encoder trained on masked language modeling and next-sentence prediction over 3.3 billion words. The hidden states of a frozen or fine-tuned BERT became the default text representation for nearly every NLP benchmark from 2018 to 2020.^[11]

Sentence embeddings

BERT's token-level vectors are not directly useful for sentence-level tasks like clustering or semantic search: pooling them naively (mean-pool, [CLS] token) produces representations that perform worse than averaged GloVe vectors on the STS benchmarks.^[12]

InferSent (Conneau et al., 2017) trained a BiLSTM on the SNLI natural language inference dataset to produce sentence vectors.^[13]
Universal Sentence Encoder (Cer et al., 2018) from Google offered a Transformer-based and a deep averaging network variant for general-purpose sentence embeddings.^[14]
Sentence-BERT (Reimers and Gurevych, EMNLP 2019) reframed BERT into a Siamese network fine-tuned on NLI and STS data.^[12] The resulting model produced semantically meaningful sentence embeddings that could be compared with cosine similarity, reducing the cost of finding the most similar pair in a 10,000-sentence collection from 65 hours with vanilla BERT to about 5 seconds. Sentence-BERT (often called SBERT) became the foundation of most open-source embedding models that followed.

Modern embedding APIs (2022 to 2026)

From 2022 onward, commercial providers shipped embedding endpoints as first-class API products. OpenAI launched text-embedding-ada-002 in December 2022 as a single-model replacement for five earlier embedding endpoints, then released text-embedding-3-small and text-embedding-3-large on January 25, 2024 with native dimension shortening (the Matryoshka representation learning trick).^[15]^[16] Cohere, Voyage AI, Google, NVIDIA, Mistral, Jina AI, Mixedbread, and the open-source BGE, GTE, and E5 model families all entered the market. Anthropic does not produce its own embedding model and recommends Voyage AI for use with Claude-based RAG systems.^[17]

How are embeddings created?

Every modern embedding model consists of three pieces: an encoder neural network, a pooling strategy, and a training objective.

Encoder

The encoder is almost always a Transformer (typically a BERT-style bidirectional encoder, though decoder-only LLMs are increasingly used as embedding backbones via repllama, e5-mistral, and similar).^[18] It maps a tokenized input into a sequence of hidden states.

Pooling

The variable-length sequence of hidden states is reduced to a single fixed-length vector using one of:

[CLS] pooling takes the hidden state of the special classification token.
Mean pooling averages all token hidden states (often weighted by the attention mask). This is the most common choice in modern models.
Last-token pooling takes the hidden state of the final token, used by decoder-only embedding models.
Weighted mean pooling assigns higher weights to later tokens, which has shown small improvements with autoregressive backbones.^[18]

Training objective

Modern embedding models are trained with a contrastive learning objective, most commonly the InfoNCE loss. Each training example is a (query, positive) pair plus a batch of in-batch negatives or hard-mined negatives. The loss pulls the query embedding toward the positive and pushes it away from the negatives:^[19]

L = -\log \frac{\exp(\mathrm{sim}(q, p^+) / \tau)}{\sum_i \exp(\mathrm{sim}(q, p_i) / \tau)}

where $\mathrm{sim}$ is cosine similarity and $\tau$ is a temperature hyperparameter. Training data typically combines hundreds of millions of weak pairs (question-answer pairs from Reddit, query-title pairs from search logs, citation pairs from academic papers) with millions of high-quality human-annotated pairs from datasets like MS MARCO, NLI, and STS.^[19]

Matryoshka representation learning

Introduced by Kusupati and colleagues in 2022, Matryoshka representation learning (MRL) trains a single model to produce embeddings whose prefix sub-vectors (the first 64, 128, 256, ... dimensions) are themselves valid embeddings.^[20] Users can truncate the embedding to fit storage and latency budgets without retraining. OpenAI's text-embedding-3-large supports dimensions from 256 to 3,072 via MRL, and Nomic, Mixedbread, and Jina v3 use the same trick.^[16] OpenAI reports that a 256-dimension text-embedding-3-large vector still outperforms a full 1,536-dimension text-embedding-ada-002 vector on MTEB, a direct demonstration of the storage-versus-quality flexibility MRL provides.^[16]

Word, sentence, document, and beyond

Embeddings exist at multiple granularities, each with different use cases.

Granularity	Typical models	Common uses
Word	Word2Vec, GloVe, FastText	Lexical similarity, analogy tasks, feature inputs to older NLP models
Subword / token	BERT, GPT tokenizer outputs	Token classification, named entity recognition, sequence labeling
Sentence	Sentence-BERT, Universal Sentence Encoder, all-MiniLM-L6-v2	Semantic search, paraphrase detection, clustering
Passage / paragraph	E5, BGE, Voyage, OpenAI text-embedding-3	RAG, question answering, document retrieval
Document	SPECTER, SciNCL, custom long-context models	Citation prediction, scientific paper similarity
Code	CodeBERT, CodeT5+, OpenAI text-embedding-3 (handles code), Voyage-code-3	Code search, duplicate detection, vulnerability detection
Image	CLIP, DINOv2, SigLIP	Image retrieval, zero-shot classification, text-to-image generation conditioning
Audio	wav2vec 2.0, CLAP, AudioMAE	Music similarity, speaker identification, audio tagging
Multimodal	CLIP, ImageBind, Voyage-multimodal-3	Cross-modal retrieval, multimodal RAG
Graph	node2vec, DeepWalk, GraphSAGE	Link prediction, node classification, recommendation

Which embedding models are most used in 2026?

Major commercial and open-source models (2024 to 2026)

Model	Provider	Year	Dimensions	Max context (tokens)	License
text-embedding-3-large	OpenAI	2024	256 to 3,072 (MRL)	8,191	Proprietary API
text-embedding-3-small	OpenAI	2024	512 to 1,536 (MRL)	8,191	Proprietary API
voyage-3-large	Voyage AI	2025	256 to 2,048 (MRL)	32,000	Proprietary API
voyage-3	Voyage AI	2024	1,024	32,000	Proprietary API
voyage-code-3	Voyage AI	2024	256 to 2,048 (MRL)	32,000	Proprietary API
voyage-multimodal-3	Voyage AI	2024	1,024	32,000	Proprietary API
embed-v3	Cohere	2023	384 / 1,024	512	Proprietary API
embed-multilingual-v3	Cohere	2023	1,024	512	Proprietary API
text-embedding-005	Google Vertex AI	2024	768	2,048	Proprietary API
gemini-embedding-001	Google	2025	768 to 3,072 (MRL)	8,192	Proprietary API
NV-Embed-v2	NVIDIA	2024	4,096	32,768	Open weights (CC-BY-NC)
BGE-M3	BAAI	2024	1,024	8,192	Open (MIT)
BGE-large-en-v1.5	BAAI	2023	1,024	512	Open (MIT)
GTE-large-en-v1.5	Alibaba	2024	1,024	8,192	Open (Apache 2.0)
mxbai-embed-large-v1	Mixedbread	2024	1,024	512	Open (Apache 2.0)
jina-embeddings-v3	Jina AI	2024	32 to 1,024 (MRL)	8,192	Open (CC-BY-NC)
Nomic Embed v1.5	Nomic AI	2024	64 to 768 (MRL)	8,192	Open (Apache 2.0)
E5-mistral-7b-instruct	Microsoft	2024	4,096	32,768	Open (MIT)
all-MiniLM-L6-v2	UKP / Hugging Face	2021	384	256	Open (Apache 2.0)

Sources: provider documentation, model cards on Hugging Face, and the MTEB leaderboard.^[16]^[17]^[21]^[22]^[23]

The trend across 2023 to 2026 is clear: models keep growing in context length (512 to 32,000+ tokens), are increasingly built on decoder-only LLM backbones, and almost universally support Matryoshka dimension truncation. Open-source models on the MTEB leaderboard now match or exceed the best proprietary APIs on most retrieval tasks.^[21]

Multimodal embeddings

CLIP (Contrastive Language-Image Pre-training), released by OpenAI in February 2021 in Learning transferable visual models from natural language supervision, is the canonical multimodal embedding model.^[24] CLIP trains a vision encoder (a Vision Transformer or ResNet) and a text encoder (a Transformer) jointly on 400 million image-caption pairs scraped from the web with a contrastive objective: each image embedding should be close to its caption's embedding and far from the embeddings of every other caption in the batch.

The result is a shared image-text embedding space in which a picture of a golden retriever and the string 'a photo of a golden retriever' produce embeddings with high cosine similarity. CLIP enables zero-shot image classification (rank images against a list of class-name prompts), text-to-image retrieval, and conditioning for diffusion models like Stable Diffusion and DALL-E 2.^[24]

ImageBind, released by Meta in May 2023, extended the CLIP idea to six modalities: images, text, audio, depth, thermal, and IMU motion data.^[25] ImageBind uses image-paired data (image-text, image-audio, image-depth, etc.) and shows that pairwise alignment with images is sufficient to learn a shared embedding space across all six modalities, enabling cross-modal retrieval (e.g., finding images from audio queries) without ever training on direct audio-text pairs.

Other important multimodal embedding models include:

Model	Modalities	Year	Notes
CLIP	Image + text	2021	OpenAI, 400M pairs, foundational
SigLIP	Image + text	2023	Google, sigmoid loss replaces softmax, more compute-efficient
BLIP-2	Image + text	2023	Bridges frozen vision encoder with LLM via Q-Former
OpenCLIP	Image + text	2022	LAION reproduction of CLIP, open weights
DINOv2	Image	2023	Meta, self-supervised vision-only embeddings
ImageBind	6 modalities	2023	Meta, image-anchored joint space
LanguageBind	6 modalities	2023	PKU, language-anchored alternative to ImageBind
Voyage-multimodal-3	Image + text	2024	Production API, mixed image-text documents
ColBERT-vision (ColPali)	Image + text	2024	Late-interaction multimodal retrieval for documents

How are embedding models benchmarked? MTEB and beyond

The Massive Text Embedding Benchmark (MTEB), introduced by Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers at EACL 2023, is the standard evaluation suite for English text embedding models.^[21] MTEB covers 56 datasets across 8 task categories: bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. A central finding of the paper is that breadth matters: the authors report that 'no particular text embedding method dominates across all tasks,' which is why practitioners are advised to benchmark candidate models on their own data rather than trusting a single leaderboard rank.^[21]

The MTEB leaderboard, hosted on Hugging Face Spaces, has become the field's de facto scoreboard. As of 2025 to 2026 it has expanded to include MMTEB (Massive Multilingual Text Embedding Benchmark) covering more than 250 languages and over 500 tasks, and specialized leaderboards for code, law, and long documents.^[21]^[22]

Key lessons from MTEB:

No single model wins every task. Strong retrieval models can underperform on STS tasks and vice versa.
Embedding dimensions follow a power law: doubling dimensions buys roughly 1 to 2 absolute points of average score.
Fine-tuning a base model on in-domain pairs typically beats picking a different general-purpose model for that domain.^[21]

Other important benchmarks include BEIR (zero-shot retrieval, Thakur et al. 2021), LoCo (long-context retrieval), CodeSearchNet (code retrieval), and MIRACL (multilingual retrieval).^[26]

Vector databases

Storing and efficiently searching billions of embeddings has spawned a new category of infrastructure: the vector database. These systems implement approximate nearest neighbor (ANN) search algorithms (HNSW, IVF, ScaNN, DiskANN) and add metadata filtering, hybrid keyword-plus-vector search, and operational tooling around them.

Database	Type	Year	Index	Notable features
Pinecone	Managed cloud	2019	Proprietary	Serverless, fully managed, namespaces
Weaviate	Open-source / managed	2019	HNSW	Built-in vectorization modules, GraphQL API
Qdrant	Open-source / managed	2021	HNSW	Rust-based, payload filtering, scalar quantization
Chroma	Open-source	2022	HNSW	Designed for LangChain and prototyping
Milvus	Open-source / managed	2019	HNSW, IVF, DiskANN	LF AI graduate, GPU acceleration, billion-scale
pgvector	PostgreSQL extension	2021	IVFFlat, HNSW	Brings vector search into existing PostgreSQL deployments
Vespa	Open-source / managed	2017	HNSW	Yahoo origin, hybrid retrieval and ranking
LanceDB	Embedded / serverless	2023	IVF-PQ	Columnar Lance format, multimodal-friendly
Turbopuffer	Managed cloud	2023	Custom	S3-backed, low cost per gigabyte
Elasticsearch / OpenSearch	Search engine	2022+	HNSW	Vector support added to existing keyword search
Redis Vector	Key-value store	2022	HNSW, FLAT	Vector search inside Redis
MongoDB Atlas Vector Search	Document DB	2023	HNSW	Vector search inside MongoDB
Azure AI Search	Managed cloud	2023	HNSW	Microsoft, integrated with Azure OpenAI

Sources: vendor documentation and ANN-Benchmarks results.^[27]^[28]

Most production systems combine ANN search with a metadata filter (e.g., 'customer_id = X AND created_at > Y'). Algorithms like Filtered DiskANN and the post-filter / pre-filter strategies in HNSW are active areas of research because naive filtering breaks the graph-walk assumptions of HNSW.^[28]

How is similarity between embeddings measured?

Three distance functions dominate practical use of embeddings.

Metric	Formula	Range	When to use
Cosine similarity	$\frac{a \cdot b}{\lVert a \rVert \lVert b \rVert}$	-1 to 1	Default for text embeddings; magnitude-invariant
Dot product	$a \cdot b$	unbounded	When embeddings are pre-normalized to unit length, equivalent to cosine; otherwise rewards larger vectors
Euclidean (L2) distance	$\sqrt{\sum_i (a_i - b_i)^2}$	0 to $\infty$	Image embeddings, geometric problems, k-means clustering

Most modern text embedding models (OpenAI text-embedding-3, BGE, GTE, Sentence-BERT) output unit-normalized vectors, in which case cosine similarity, dot product, and 2 minus the squared Euclidean distance are monotonically related and yield identical rankings.^[3] Choosing one over another in that case is a matter of compute cost: dot product is the cheapest, then cosine, then Euclidean.

Aggregate similarity for sets

When comparing sets of embeddings (e.g., a multi-vector representation of a long document) more sophisticated metrics apply: ColBERT's late interaction sums the maximum cosine similarity from each query token to any document token; SPLADE uses sparse lexical interaction; cross-encoders compare the joint representation of a (query, document) concatenation.^[29]

What are vector embeddings used for?

Semantic search

Semantic search replaces keyword matching with vector similarity. The query and corpus documents are embedded with the same model; documents whose embeddings have the highest cosine similarity to the query are returned. This handles synonymy ('automobile' matches 'car'), paraphrase ('how to fix a leaky faucet' matches 'plumbing repair'), and conceptual relatedness ('quiet electric vehicles' matches a Tesla review).^[1]

Production systems usually combine semantic search with traditional BM25 keyword search using reciprocal rank fusion or a learned reranker on top, since neither approach dominates the other across all queries.^[26]

Retrieval-augmented generation (RAG)

Retrieval-augmented generation, introduced by Lewis and colleagues in 2020, has become the dominant pattern for building LLM applications over private data.^[30] The pipeline:

Chunk source documents into passages (typically 200 to 1,000 tokens each).
Embed every chunk with an embedding model and store in a vector database.
At query time, embed the user question and retrieve the top-k most similar chunks.
Prepend retrieved chunks to the LLM prompt as context.
Generate the answer.

The quality of step 3 dominates end-to-end answer quality, which is why embedding model choice is among the highest-leverage decisions in RAG system design.^[31]

Classification and clustering

Embeddings act as feature inputs for downstream classifiers. A logistic regression or small MLP trained on top of frozen embeddings often matches or exceeds fine-tuning the base model when labeled data is scarce. The MTEB classification subset measures exactly this scenario.^[21]

For unsupervised analysis, embeddings combined with k-means, HDBSCAN, or agglomerative clustering produce thematic groupings of large text corpora, which is the basis of tools like BERTopic for topic modeling.^[32]

Recommendation

User and item embeddings, often trained jointly with a two-tower architecture, power recommendation systems at scale. YouTube's deep neural network recommender (Covington et al. 2016) was an early influential public design; modern systems at TikTok, Spotify, Netflix, Amazon, and Pinterest all rely on dense vector retrieval as the candidate-generation stage.^[33]

Anomaly detection

Objects whose embeddings sit far from their cluster centroid (or have low density in the embedding space) are flagged as outliers. This is used in fraud detection, content moderation, and drug discovery (where unusual molecular embeddings can indicate novel chemistry).^[1]

Deduplication and near-duplicate detection

Minhash-style locality-sensitive hashing on top of embeddings (or direct ANN nearest-neighbor search) finds near-duplicate documents, web pages, or images. Common Crawl, LAION, and the C4 dataset all use embedding-based deduplication as part of their preprocessing pipelines.^[34]

Cross-lingual transfer

Multilingual embedding models (LaBSE, multilingual-E5, BGE-M3, Cohere embed-multilingual-v3) map sentences in different languages into the same vector space. This enables zero-shot cross-lingual retrieval (an English query retrieves Japanese documents) and bitext mining for machine translation training data.^[9]^[35]

Drug discovery and protein embeddings

Molecular embeddings from models like ChemBERTa, MolFormer, and Uni-Mol, and protein embeddings from ESM-2 and AlphaFold, apply the same dense-vector ideas to chemistry and biology, supporting binding affinity prediction, protein function annotation, and reaction yield prediction.^[36]

Visualization and dimensionality reduction

Embeddings of 768 to 4,096 dimensions cannot be plotted directly. Three dimensionality reduction techniques are commonly applied to project them into 2 or 3 dimensions for human inspection.

Technique	Year	Preserves	Strengths	Weaknesses
PCA (Principal Component Analysis)	1901	Global linear variance	Fast, deterministic, invertible	Misses nonlinear structure
t-SNE (t-distributed Stochastic Neighbor Embedding)	2008 (van der Maaten and Hinton)	Local neighborhoods	Reveals tight clusters	Slow, hyperparameter-sensitive, distorts global geometry
UMAP (Uniform Manifold Approximation and Projection)	2018 (McInnes, Healy, Melville)	Local and some global	Faster than t-SNE, better global structure, can transform new data	Stochastic; cluster sizes and distances are not directly meaningful

UMAP has largely replaced t-SNE as the default tool for embedding visualization in domains like single-cell RNA sequencing and large NLP corpora because it is roughly an order of magnitude faster on millions of points and tends to preserve more of the global structure.^[37]^[38]

Interactive visualization platforms like the TensorFlow Embedding Projector, Nomic Atlas, and the Hugging Face Spaces hosted Embedding Atlas let users explore embedding spaces with hover, search, and color-by-metadata features.^[39]

What are the limitations of vector embeddings?

Embeddings have well-documented failure modes that practitioners must design around:

Anisotropy. Out-of-the-box BERT embeddings occupy a narrow cone in the high-dimensional space, which compresses cosine similarities into a small range and degrades discrimination. Whitening, contrastive fine-tuning (Sentence-BERT), and post-hoc normalization fix this.^[40]
Bias. Embeddings inherit the biases of their training data. Bolukbasi and colleagues' 2016 paper Man is to Computer Programmer as Woman is to Homemaker? showed that Word2Vec embeddings encode gender stereotypes that propagate into downstream applications.^[41]
Domain mismatch. A model trained on web text often underperforms on legal contracts, medical records, or scientific papers without domain-specific fine-tuning.
Out-of-distribution inputs. Embeddings of inputs very different from training data (random byte sequences, adversarial text) can produce misleading similarities.
Stability across model versions. Re-embedding an entire corpus is expensive. Switching from text-embedding-ada-002 to text-embedding-3-large requires rebuilding the index, which has driven interest in compatibility-aware embedding training.^[16]
Information leakage. Recent work has shown that embeddings can be partially inverted to recover the original text, with implications for privacy when raw embeddings are shared.^[42]

What does it cost to run embeddings at scale?

Embedding spend has two components: embedding generation (charged per token by API providers) and storage plus query in the vector database (charged per gigabyte of vectors and per query).

As of early 2026 the OpenAI text-embedding-3-small endpoint costs about $0.02 per million tokens and text-embedding-3-large costs about $0.13 per million tokens. Voyage, Cohere, and Google charge in a similar range.^[16]^[17] Embedding 100 million 500-token documents with text-embedding-3-small costs roughly $1,000.

Storage cost is dominated by dimensionality. A 1,536-dimensional float32 embedding takes 6 KB; a billion of them is 6 TB. Standard mitigations include:

Technique	Storage savings	Quality impact
Float32 to float16	2x	Negligible
Float32 to int8 quantization	4x	Small (typically less than 1% on MTEB)
Float32 to binary (1-bit) quantization	32x	Moderate; recoverable with rerank
Matryoshka truncation (3,072 to 768)	4x	Small to moderate, depends on task
Product quantization (PQ)	8x to 32x	Tunable trade-off

Mixedbread, Cohere, and Voyage all support int8 and binary embedding outputs natively, often combined with a final cosine rerank on the top-100 candidates to recover full-precision quality.^[27]

Explain Vector embeddings Like I'm 5 (ELI5)

Imagine you have a box of different toys like cars, dolls, and balls. Now, we want to sort these toys based on how similar they are. We can use something called 'vector embedding' to help us with this. Vector embedding is like giving each toy a secret code made of numbers. Toys that are similar will have secret codes that are very close to each other, and toys that are not similar will have secret codes that are very different.

For example, let's say we have a red car, a blue car, and a doll. We can give them secret codes like this:

Red car: [1, 2, 3]

Blue car: [1, 2, 4]

Doll: [5, 6, 7]

See how the red car and the blue car have secret codes that are very close to each other, while the doll has a different secret code? That's because the cars are more similar to each other than the doll.

Vector embedding can also be used for words, pictures, sounds, and many other things. It helps computers understand and sort these things by how similar they are, just like we sorted the toys.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). 'Distributed representations of words and phrases and their compositionality.' *Advances in Neural Information Processing Systems* 26. ↩
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). 'Dense passage retrieval for open-domain question answering.' *EMNLP 2020*. ↩
Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to information retrieval*. Cambridge University Press, chapter 6. ↩
Hinton, G. E. (1986). 'Learning distributed representations of concepts.' *Proceedings of the Eighth Annual Conference of the Cognitive Science Society*. ↩
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). 'Indexing by latent semantic analysis.' *Journal of the American Society for Information Science*, 41(6), 391 to 407. ↩
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). 'Efficient estimation of word representations in vector space.' arXiv:1301.3781. ↩
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). 'Distributed representations of words and phrases and their compositionality.' arXiv:1310.4546. ↩
Pennington, J., Socher, R., and Manning, C. D. (2014). 'GloVe: Global vectors for word representation.' *EMNLP 2014*. ↩
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). 'Enriching word vectors with subword information.' *Transactions of the Association for Computational Linguistics*, 5, 135 to 146. ↩
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). 'Deep contextualized word representations.' *NAACL 2018*. ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). 'BERT: Pre-training of deep bidirectional transformers for language understanding.' *NAACL 2019*. ↩
Reimers, N. and Gurevych, I. (2019). 'Sentence-BERT: Sentence embeddings using Siamese BERT-networks.' *EMNLP 2019*. ↩
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). 'Supervised learning of universal sentence representations from natural language inference data.' *EMNLP 2017*. ↩
Cer, D. et al. (2018). 'Universal sentence encoder.' arXiv:1803.11175. ↩
OpenAI (2022). 'New and improved embedding model.' OpenAI blog, December 15, 2022. ↩
OpenAI (2024). 'New embedding models and API updates.' OpenAI blog, January 25, 2024. https://openai.com/index/new-embedding-models-and-api-updates/. ↩
Anthropic. 'Embeddings.' Anthropic Claude documentation, https://docs.anthropic.com/claude/docs/embeddings. ↩
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). 'Improving text embeddings with large language models.' *ACL 2024*. ↩
Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., and Grave, E. (2022). 'Unsupervised dense information retrieval with contrastive learning.' *Transactions on Machine Learning Research*. ↩
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. (2022). 'Matryoshka representation learning.' *NeurIPS 2022*. ↩
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2023). 'MTEB: Massive text embedding benchmark.' *EACL 2023*. https://aclanthology.org/2023.eacl-main.148/. ↩
Hugging Face. 'MTEB Leaderboard.' https://huggingface.co/spaces/mteb/leaderboard. ↩
Voyage AI (2024 to 2025). 'Voyage embedding models documentation.' https://docs.voyageai.com. ↩
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). 'Learning transferable visual models from natural language supervision.' *ICML 2021*. ↩
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., and Misra, I. (2023). 'ImageBind: One embedding space to bind them all.' *CVPR 2023*. ↩
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. (2021). 'BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models.' *NeurIPS 2021 Datasets and Benchmarks*. ↩
ANN-Benchmarks (2024). 'Benchmarking nearest neighbor search algorithms.' https://ann-benchmarks.com. ↩
Malkov, Y. A. and Yashunin, D. A. (2018). 'Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.' *IEEE Transactions on Pattern Analysis and Machine Intelligence*. ↩
Khattab, O. and Zaharia, M. (2020). 'ColBERT: Efficient and effective passage search via contextualized late interaction over BERT.' *SIGIR 2020*. ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). 'Retrieval-augmented generation for knowledge-intensive NLP tasks.' *NeurIPS 2020*. ↩
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2024). 'Retrieval-augmented generation for large language models: A survey.' arXiv:2312.10997. ↩
Grootendorst, M. (2022). 'BERTopic: Neural topic modeling with a class-based TF-IDF procedure.' arXiv:2203.05794. ↩
Covington, P., Adams, J., and Sargin, E. (2016). 'Deep neural networks for YouTube recommendations.' *RecSys 2016*. ↩
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). 'Deduplicating training data makes language models better.' *ACL 2022*. ↩
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2022). 'Language-agnostic BERT sentence embedding.' *ACL 2022*. ↩
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). 'Evolutionary-scale prediction of atomic-level protein structure with a language model.' *Science*, 379(6637), 1123 to 1130. ↩
van der Maaten, L. and Hinton, G. (2008). 'Visualizing data using t-SNE.' *Journal of Machine Learning Research*, 9, 2579 to 2605. ↩
McInnes, L., Healy, J., and Melville, J. (2018). 'UMAP: Uniform manifold approximation and projection for dimension reduction.' arXiv:1802.03426. ↩
Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Viegas, F. B., and Wattenberg, M. (2016). 'Embedding projector: Interactive visualization and interpretation of embeddings.' *NIPS 2016 Workshop on Interpretable ML*. ↩
Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). 'On the sentence embeddings from pre-trained language models.' *EMNLP 2020*. ↩
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. (2016). 'Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.' *NeurIPS 2016*. ↩
Morris, J. X., Kuleshov, V., Shmatikov, V., and Rush, A. M. (2023). 'Text embeddings reveal (almost) as much as text.' *EMNLP 2023*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit