# Vector embeddings

> Source: https://aiwiki.ai/wiki/vector_embeddings
> Updated: 2026-06-21
> Categories: Information Retrieval, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [AI terms](/wiki/ai_terms)*

**Vector embeddings** are dense numerical representations of objects (text, images, audio, video, code, graphs, or any structured data) that map them into a continuous vector space such that semantic similarity between objects corresponds to geometric proximity. They form the mathematical substrate of modern [machine learning](/wiki/machine_learning), [natural language processing](/wiki/natural_language_processing), and [information retrieval](/wiki/information_retrieval), and they sit at the core of nearly every production [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) pipeline, [semantic search](/wiki/semantic_search) engine, recommendation system, and [multimodal](/wiki/multimodal) AI model deployed today.[1][2]

An embedding model takes an input (a sentence, a JPEG, a 30-second audio clip, a SMILES molecular string) and produces a fixed-length vector, typically of 256 to 4,096 dimensions. The values inside the vector have no individually interpretable meaning. Their utility comes from the geometry of the resulting space: cosine similarity between two embeddings approximates the semantic relatedness of the two inputs.[3] On the [Massive Text Embedding Benchmark](/wiki/mteb) (MTEB), [OpenAI](/wiki/openai)'s text-embedding-3-large scores about 64.6% averaged across 56 datasets, roughly 3.6 points above the earlier text-embedding-ada-002 model, illustrating how quickly text embedding quality advanced between 2022 and 2024.[16][21]

## What is a vector embedding?

Formally, an embedding is a learned function f: X to R^d that maps an input space X (words, sentences, images, etc.) into a real-valued d-dimensional vector space, with the property that a chosen distance metric (cosine, dot product, or Euclidean) between f(a) and f(b) reflects a meaningful notion of similarity between a and b.[1]

Key properties of modern embeddings:

| Property | Description |
| --- | --- |
| **Dense** | All or nearly all dimensions are non-zero, in contrast to sparse one-hot or [bag-of-words](/wiki/bag_of_words) representations. |
| **Distributed** | Meaning is spread across many dimensions; no single dimension corresponds to a human concept. |
| **Fixed-length** | Output dimensionality d is constant for a given model regardless of input length. |
| **Continuous** | Lives in R^d, allowing arithmetic operations (addition, subtraction, interpolation). |
| **Learned** | Produced by a [neural network](/wiki/neural_network) trained on a large [corpus](/wiki/corpus) with a self-supervised or contrastive objective. |

## History: how did vector embeddings develop?

### Distributed representations (1986)

The conceptual foundation comes from Geoffrey Hinton's 1986 paper *Learning distributed representations of concepts*, which argued that concepts should be represented by patterns of activity over many units rather than by single dedicated symbols.[4] This idea, that meaning emerges from the joint values of many features, is the philosophical ancestor of every modern embedding.

### Latent semantic analysis (1990)

The first widely used dense vector representation of text was [latent semantic analysis](/wiki/latent_semantic_analysis) (LSA), introduced by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman in their 1990 paper *Indexing by latent semantic analysis*.[5] LSA applies [singular value decomposition](/wiki/singular_value_decomposition) (SVD) to a term-document matrix, producing low-rank approximations in which documents and terms are represented as vectors in the same reduced space. Documents with overlapping topical structure end up close in the SVD space even when they share no surface vocabulary, which lets LSA retrieve relevant documents that contain only synonyms of the query terms.

LSA established that a continuous, low-dimensional vector space could capture latent semantic structure, but its training was an O(n^3) matrix decomposition that did not scale to web-sized corpora and produced static representations that could not be incrementally updated.[5]

### Word2Vec (2013)

The modern era of embeddings began with [Word2Vec](/wiki/word2vec), released by Tomas Mikolov and colleagues at Google in two 2013 papers, *Efficient estimation of word representations in vector space* and *Distributed representations of words and phrases and their compositionality*.[6][7] The first of these set out to learn high-quality word vectors cheaply: as its abstract states, 'The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks.'[6] Word2Vec offered two shallow neural architectures:

- **CBOW (Continuous Bag-of-Words)** predicts a target word from the surrounding context.
- **Skip-gram** predicts the surrounding context words from a given target word.

Trained with negative sampling on a 100-billion-word Google News corpus, Word2Vec produced 300-dimensional word vectors that exhibited the now-famous analogy property: vector('king') minus vector('man') plus vector('woman') yields a vector closest to that of 'queen'.[7] This compositional behavior was a striking demonstration that distributed representations could encode relational structure.

Word2Vec's contribution was as much engineering as scientific. The negative-sampling trick reduced training cost from O(V) to O(k) per example (where V is vocabulary size and k is a small constant), allowing training on billions of words on a single CPU in hours rather than weeks.[7]

### GloVe (2014)

In 2014, Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford introduced [GloVe](/wiki/glove) (Global Vectors for Word Representation) at EMNLP.[8] GloVe combines the global matrix-factorization view of LSA with the local context-window view of Word2Vec. It factorizes the logarithm of the word-word co-occurrence matrix using a weighted least-squares objective, producing vectors that match or exceed Word2Vec on word analogy and similarity benchmarks while training faster on the same corpus.

The pretrained GloVe vectors trained on Common Crawl (840 billion tokens, 2.2 million vocabulary, 300 dimensions) became a de facto standard input for [neural network](/wiki/neural_network) NLP models in 2014 to 2017.[8]

### FastText (2016 to 2017)

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research extended Word2Vec with subword information in 2017's *Enriching word vectors with subword information*.[9] [FastText](/wiki/fasttext) represents each word as a bag of character n-grams (typically n = 3 to 6), so the vector for 'unhappiness' is built from sub-units like 'un', 'unh', 'happi', 'iness'. This gives the model two important properties: it can produce vectors for out-of-vocabulary words by composing their character n-grams, and it captures morphological regularities (singular versus plural, tense, derivation) that Word2Vec misses.

FastText released pretrained vectors for 157 languages, which made it the first practical multilingual embedding system at scale.[9]

### Contextual embeddings: ELMo, BERT, and after (2018)

Word2Vec, GloVe, and FastText all produce a single vector per word type regardless of context, so 'bank' in 'river bank' gets the same vector as in 'savings bank'. Peters and colleagues' [ELMo](/wiki/elmo) (2018) was the first widely adopted contextual model: it ran a bidirectional [LSTM](/wiki/lstm) over the sentence and produced a different vector for each token depending on its full sentential context.[10]

The decisive break came later in 2018 with [BERT](/wiki/bert) (Bidirectional Encoder Representations from Transformers) by Devlin and colleagues at Google.[11] BERT replaced the LSTM with a [Transformer](/wiki/transformer) encoder trained on masked language modeling and next-sentence prediction over 3.3 billion words. The hidden states of a frozen or fine-tuned BERT became the default text representation for nearly every NLP benchmark from 2018 to 2020.[11]

### Sentence embeddings

BERT's token-level vectors are not directly useful for sentence-level tasks like clustering or [semantic search](/wiki/semantic_search): pooling them naively (mean-pool, [CLS] token) produces representations that perform worse than averaged GloVe vectors on the STS benchmarks.[12]

- **InferSent** (Conneau et al., 2017) trained a BiLSTM on the SNLI natural language inference dataset to produce sentence vectors.[13]
- **Universal Sentence Encoder** (Cer et al., 2018) from Google offered a Transformer-based and a deep averaging network variant for general-purpose sentence embeddings.[14]
- **Sentence-BERT** (Reimers and Gurevych, EMNLP 2019) reframed BERT into a Siamese network fine-tuned on NLI and STS data.[12] The resulting model produced semantically meaningful sentence embeddings that could be compared with cosine similarity, reducing the cost of finding the most similar pair in a 10,000-sentence collection from 65 hours with vanilla BERT to about 5 seconds. Sentence-BERT (often called SBERT) became the foundation of most open-source embedding models that followed.

### Modern embedding APIs (2022 to 2026)

From 2022 onward, commercial providers shipped embedding endpoints as first-class API products. OpenAI launched text-embedding-ada-002 in December 2022 as a single-model replacement for five earlier embedding endpoints, then released text-embedding-3-small and text-embedding-3-large on January 25, 2024 with native dimension shortening (the Matryoshka representation learning trick).[15][16] Cohere, Voyage AI, Google, NVIDIA, Mistral, Jina AI, Mixedbread, and the open-source [BGE](/wiki/bge), GTE, and E5 model families all entered the market. Anthropic does not produce its own embedding model and recommends Voyage AI for use with Claude-based RAG systems.[17]

## How are embeddings created?

Every modern embedding model consists of three pieces: an encoder neural network, a pooling strategy, and a training objective.

### Encoder

The encoder is almost always a [Transformer](/wiki/transformer) (typically a BERT-style bidirectional encoder, though decoder-only LLMs are increasingly used as embedding backbones via repllama, e5-mistral, and similar).[18] It maps a tokenized input into a sequence of hidden states.

### Pooling

The variable-length sequence of hidden states is reduced to a single fixed-length vector using one of:

- **[CLS] pooling** takes the hidden state of the special classification token.
- **Mean pooling** averages all token hidden states (often weighted by the attention mask). This is the most common choice in modern models.
- **Last-token pooling** takes the hidden state of the final token, used by decoder-only embedding models.
- **Weighted mean pooling** assigns higher weights to later tokens, which has shown small improvements with autoregressive backbones.[18]

### Training objective

Modern embedding models are trained with a [contrastive learning](/wiki/contrastive_learning) objective, most commonly the InfoNCE loss. Each training example is a (query, positive) pair plus a batch of in-batch negatives or hard-mined negatives. The loss pulls the query embedding toward the positive and pushes it away from the negatives:[19]

L = -log(exp(sim(q, p+) / tau) / sum_i exp(sim(q, p_i) / tau))

where sim is cosine similarity and tau is a temperature hyperparameter. Training data typically combines hundreds of millions of weak pairs (question-answer pairs from Reddit, query-title pairs from search logs, citation pairs from academic papers) with millions of high-quality human-annotated pairs from datasets like MS MARCO, NLI, and STS.[19]

### Matryoshka representation learning

Introduced by Kusupati and colleagues in 2022, [Matryoshka representation learning](/wiki/matryoshka_representation_learning) (MRL) trains a single model to produce embeddings whose prefix sub-vectors (the first 64, 128, 256, ... dimensions) are themselves valid embeddings.[20] Users can truncate the embedding to fit storage and latency budgets without retraining. OpenAI's text-embedding-3-large supports dimensions from 256 to 3,072 via MRL, and Nomic, Mixedbread, and Jina v3 use the same trick.[16] OpenAI reports that a 256-dimension text-embedding-3-large vector still outperforms a full 1,536-dimension text-embedding-ada-002 vector on MTEB, a direct demonstration of the storage-versus-quality flexibility MRL provides.[16]

## Word, sentence, document, and beyond

Embeddings exist at multiple granularities, each with different use cases.

| Granularity | Typical models | Common uses |
| --- | --- | --- |
| [Word](/wiki/word_embedding) | [Word2Vec](/wiki/word2vec), [GloVe](/wiki/glove), [FastText](/wiki/fasttext) | Lexical similarity, [analogy tasks](/wiki/analogy), feature inputs to older NLP models |
| Subword / token | [BERT](/wiki/bert), GPT tokenizer outputs | Token classification, named entity recognition, sequence labeling |
| Sentence | [Sentence-BERT](/wiki/sentence-bert), Universal Sentence Encoder, all-MiniLM-L6-v2 | Semantic search, paraphrase detection, clustering |
| Passage / paragraph | E5, BGE, Voyage, OpenAI text-embedding-3 | RAG, question answering, document retrieval |
| Document | SPECTER, SciNCL, custom long-context models | Citation prediction, scientific paper similarity |
| Code | CodeBERT, CodeT5+, OpenAI text-embedding-3 (handles code), Voyage-code-3 | Code search, duplicate detection, [vulnerability detection](/wiki/vulnerability_detection) |
| Image | [CLIP](/wiki/clip), DINOv2, SigLIP | Image retrieval, [zero-shot classification](/wiki/zero_shot_one_shot_and_few_shot_learning), [text-to-image](/wiki/text-to-image_models) generation conditioning |
| Audio | wav2vec 2.0, CLAP, AudioMAE | Music similarity, speaker identification, audio tagging |
| Multimodal | [CLIP](/wiki/clip), [ImageBind](/wiki/imagebind), Voyage-multimodal-3 | Cross-modal retrieval, multimodal RAG |
| Graph | node2vec, DeepWalk, GraphSAGE | Link prediction, node classification, recommendation |

## Which embedding models are most used in 2026?

### Major commercial and open-source models (2024 to 2026)

| Model | Provider | Year | Dimensions | Max context (tokens) | License |
| --- | --- | --- | --- | --- | --- |
| text-embedding-3-large | [OpenAI](/wiki/openai) | 2024 | 256 to 3,072 (MRL) | 8,191 | Proprietary API |
| text-embedding-3-small | [OpenAI](/wiki/openai) | 2024 | 512 to 1,536 (MRL) | 8,191 | Proprietary API |
| voyage-3-large | [Voyage AI](/wiki/voyage_ai) | 2025 | 256 to 2,048 (MRL) | 32,000 | Proprietary API |
| voyage-3 | [Voyage AI](/wiki/voyage_ai) | 2024 | 1,024 | 32,000 | Proprietary API |
| voyage-code-3 | [Voyage AI](/wiki/voyage_ai) | 2024 | 256 to 2,048 (MRL) | 32,000 | Proprietary API |
| voyage-multimodal-3 | [Voyage AI](/wiki/voyage_ai) | 2024 | 1,024 | 32,000 | Proprietary API |
| embed-v3 | [Cohere](/wiki/cohere) | 2023 | 384 / 1,024 | 512 | Proprietary API |
| embed-multilingual-v3 | [Cohere](/wiki/cohere) | 2023 | 1,024 | 512 | Proprietary API |
| text-embedding-005 | [Google](/wiki/google) Vertex AI | 2024 | 768 | 2,048 | Proprietary API |
| gemini-embedding-001 | [Google](/wiki/google) | 2025 | 768 to 3,072 (MRL) | 8,192 | Proprietary API |
| NV-Embed-v2 | [NVIDIA](/wiki/nvidia) | 2024 | 4,096 | 32,768 | Open weights (CC-BY-NC) |
| BGE-M3 | BAAI | 2024 | 1,024 | 8,192 | Open (MIT) |
| BGE-large-en-v1.5 | BAAI | 2023 | 1,024 | 512 | Open (MIT) |
| GTE-large-en-v1.5 | Alibaba | 2024 | 1,024 | 8,192 | Open (Apache 2.0) |
| mxbai-embed-large-v1 | Mixedbread | 2024 | 1,024 | 512 | Open (Apache 2.0) |
| jina-embeddings-v3 | Jina AI | 2024 | 32 to 1,024 (MRL) | 8,192 | Open (CC-BY-NC) |
| Nomic Embed v1.5 | Nomic AI | 2024 | 64 to 768 (MRL) | 8,192 | Open (Apache 2.0) |
| E5-mistral-7b-instruct | Microsoft | 2024 | 4,096 | 32,768 | Open (MIT) |
| all-MiniLM-L6-v2 | UKP / Hugging Face | 2021 | 384 | 256 | Open (Apache 2.0) |

Sources: provider documentation, model cards on Hugging Face, and the MTEB leaderboard.[16][17][21][22][23]

The trend across 2023 to 2026 is clear: models keep growing in context length (512 to 32,000+ tokens), are increasingly built on decoder-only [LLM](/wiki/large_language_model) backbones, and almost universally support Matryoshka dimension truncation. Open-source models on the MTEB leaderboard now match or exceed the best proprietary APIs on most retrieval tasks.[21]

### Multimodal embeddings

[CLIP](/wiki/clip) (Contrastive Language-Image Pre-training), released by OpenAI in February 2021 in *Learning transferable visual models from natural language supervision*, is the canonical multimodal embedding model.[24] CLIP trains a vision encoder (a [Vision Transformer](/wiki/vision_transformer) or ResNet) and a text encoder (a Transformer) jointly on 400 million image-caption pairs scraped from the web with a contrastive objective: each image embedding should be close to its caption's embedding and far from the embeddings of every other caption in the batch.

The result is a shared image-text embedding space in which a picture of a golden retriever and the string 'a photo of a golden retriever' produce embeddings with high cosine similarity. CLIP enables zero-shot image classification (rank images against a list of class-name prompts), text-to-image retrieval, and conditioning for [diffusion models](/wiki/diffusion_models) like [Stable Diffusion](/wiki/stable_diffusion) and [DALL-E 2](/wiki/dall_e).[24]

[ImageBind](/wiki/imagebind), released by Meta in May 2023, extended the CLIP idea to six modalities: images, text, audio, depth, thermal, and IMU motion data.[25] ImageBind uses image-paired data (image-text, image-audio, image-depth, etc.) and shows that pairwise alignment with images is sufficient to learn a shared embedding space across all six modalities, enabling cross-modal retrieval (e.g., finding images from audio queries) without ever training on direct audio-text pairs.

Other important multimodal embedding models include:

| Model | Modalities | Year | Notes |
| --- | --- | --- | --- |
| [CLIP](/wiki/clip) | Image + text | 2021 | OpenAI, 400M pairs, foundational |
| [SigLIP](/wiki/siglip) | Image + text | 2023 | Google, sigmoid loss replaces softmax, more compute-efficient |
| [BLIP-2](/wiki/blip2) | Image + text | 2023 | Bridges frozen vision encoder with LLM via Q-Former |
| OpenCLIP | Image + text | 2022 | LAION reproduction of CLIP, open weights |
| [DINOv2](/wiki/dinov2) | Image | 2023 | Meta, self-supervised vision-only embeddings |
| [ImageBind](/wiki/imagebind) | 6 modalities | 2023 | Meta, image-anchored joint space |
| LanguageBind | 6 modalities | 2023 | PKU, language-anchored alternative to ImageBind |
| Voyage-multimodal-3 | Image + text | 2024 | Production API, mixed image-text documents |
| ColBERT-vision (ColPali) | Image + text | 2024 | Late-interaction multimodal retrieval for documents |

## How are embedding models benchmarked? MTEB and beyond

The [Massive Text Embedding Benchmark](/wiki/mteb) (MTEB), introduced by Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers at EACL 2023, is the standard evaluation suite for English text embedding models.[21] MTEB covers 56 datasets across 8 task categories: bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. A central finding of the paper is that breadth matters: the authors report that 'no particular text embedding method dominates across all tasks,' which is why practitioners are advised to benchmark candidate models on their own data rather than trusting a single leaderboard rank.[21]

The MTEB leaderboard, hosted on Hugging Face Spaces, has become the field's de facto scoreboard. As of 2025 to 2026 it has expanded to include [MMTEB](/wiki/mmteb) (Massive Multilingual Text Embedding Benchmark) covering more than 250 languages and over 500 tasks, and specialized leaderboards for code, law, and long documents.[21][22]

Key lessons from MTEB:

- No single model wins every task. Strong retrieval models can underperform on STS tasks and vice versa.
- Embedding dimensions follow a power law: doubling dimensions buys roughly 1 to 2 absolute points of average score.
- Fine-tuning a base model on in-domain pairs typically beats picking a different general-purpose model for that domain.[21]

Other important benchmarks include BEIR (zero-shot retrieval, Thakur et al. 2021), LoCo (long-context retrieval), CodeSearchNet (code retrieval), and MIRACL (multilingual retrieval).[26]

## Vector databases

Storing and efficiently searching billions of embeddings has spawned a new category of infrastructure: the [vector database](/wiki/vector_database). These systems implement [approximate nearest neighbor](/wiki/ann) (ANN) search algorithms (HNSW, IVF, ScaNN, DiskANN) and add metadata filtering, hybrid keyword-plus-vector search, and operational tooling around them.

| Database | Type | Year | Index | Notable features |
| --- | --- | --- | --- | --- |
| [Pinecone](/wiki/pinecone) | Managed cloud | 2019 | Proprietary | Serverless, fully managed, namespaces |
| [Weaviate](/wiki/weaviate) | Open-source / managed | 2019 | HNSW | Built-in vectorization modules, GraphQL API |
| [Qdrant](/wiki/qdrant) | Open-source / managed | 2021 | HNSW | Rust-based, payload filtering, scalar quantization |
| [Chroma](/wiki/chroma) | Open-source | 2022 | HNSW | Designed for [LangChain](/wiki/langchain) and prototyping |
| [Milvus](/wiki/milvus) | Open-source / managed | 2019 | HNSW, IVF, DiskANN | LF AI graduate, GPU acceleration, billion-scale |
| [pgvector](/wiki/pgvector) | PostgreSQL extension | 2021 | IVFFlat, HNSW | Brings vector search into existing PostgreSQL deployments |
| Vespa | Open-source / managed | 2017 | HNSW | Yahoo origin, hybrid retrieval and ranking |
| LanceDB | Embedded / serverless | 2023 | IVF-PQ | Columnar Lance format, multimodal-friendly |
| Turbopuffer | Managed cloud | 2023 | Custom | S3-backed, low cost per gigabyte |
| Elasticsearch / OpenSearch | Search engine | 2022+ | HNSW | Vector support added to existing keyword search |
| Redis Vector | Key-value store | 2022 | HNSW, FLAT | Vector search inside Redis |
| MongoDB Atlas Vector Search | Document DB | 2023 | HNSW | Vector search inside MongoDB |
| Azure AI Search | Managed cloud | 2023 | HNSW | Microsoft, integrated with Azure OpenAI |

Sources: vendor documentation and ANN-Benchmarks results.[27][28]

Most production systems combine ANN search with a metadata filter (e.g., 'customer_id = X AND created_at > Y'). Algorithms like Filtered DiskANN and the post-filter / pre-filter strategies in HNSW are active areas of research because naive filtering breaks the graph-walk assumptions of HNSW.[28]

## How is similarity between embeddings measured?

Three distance functions dominate practical use of embeddings.

| Metric | Formula | Range | When to use |
| --- | --- | --- | --- |
| **Cosine similarity** | (a . b) / (||a|| * ||b||) | -1 to 1 | Default for text embeddings; magnitude-invariant |
| **Dot product** | a . b | unbounded | When embeddings are pre-normalized to unit length, equivalent to cosine; otherwise rewards larger vectors |
| **Euclidean (L2) distance** | sqrt(sum_i (a_i - b_i)^2) | 0 to infinity | Image embeddings, geometric problems, [k-means clustering](/wiki/k-means) |

Most modern text embedding models (OpenAI text-embedding-3, BGE, GTE, Sentence-BERT) output unit-normalized vectors, in which case cosine similarity, dot product, and 2 minus the squared Euclidean distance are monotonically related and yield identical rankings.[3] Choosing one over another in that case is a matter of compute cost: dot product is the cheapest, then cosine, then Euclidean.

### Aggregate similarity for sets

When comparing sets of embeddings (e.g., a multi-vector representation of a long document) more sophisticated metrics apply: ColBERT's late interaction sums the maximum cosine similarity from each query token to any document token; SPLADE uses sparse [lexical](/wiki/lexicon) interaction; cross-encoders compare the joint representation of a (query, document) concatenation.[29]

## What are vector embeddings used for?

### Semantic search

[Semantic search](/wiki/semantic_search) replaces keyword matching with vector similarity. The query and corpus documents are embedded with the same model; documents whose embeddings have the highest cosine similarity to the query are returned. This handles synonymy ('automobile' matches 'car'), paraphrase ('how to fix a leaky faucet' matches 'plumbing repair'), and conceptual relatedness ('quiet electric vehicles' matches a Tesla review).[1]

Production systems usually combine semantic search with traditional [BM25](/wiki/bm25) keyword search using reciprocal rank fusion or a learned reranker on top, since neither approach dominates the other across all queries.[26]

### Retrieval-augmented generation (RAG)

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation), introduced by Lewis and colleagues in 2020, has become the dominant pattern for building LLM applications over private data.[30] The pipeline:

1. Chunk source documents into passages (typically 200 to 1,000 tokens each).
2. Embed every chunk with an embedding model and store in a vector database.
3. At query time, embed the user question and retrieve the top-k most similar chunks.
4. Prepend retrieved chunks to the LLM prompt as context.
5. Generate the answer.

The quality of step 3 dominates end-to-end answer quality, which is why embedding model choice is among the highest-leverage decisions in RAG system design.[31]

### Classification and clustering

Embeddings act as feature inputs for downstream classifiers. A logistic regression or small [MLP](/wiki/perceptron) trained on top of frozen embeddings often matches or exceeds fine-tuning the base model when labeled data is scarce. The MTEB classification subset measures exactly this scenario.[21]

For unsupervised analysis, embeddings combined with [k-means](/wiki/k-means), HDBSCAN, or agglomerative clustering produce thematic groupings of large text corpora, which is the basis of tools like [BERTopic](/wiki/bertopic) for topic modeling.[32]

### Recommendation

User and item embeddings, often trained jointly with a two-tower architecture, power recommendation systems at scale. YouTube's deep neural network recommender (Covington et al. 2016) was an early influential public design; modern systems at TikTok, Spotify, Netflix, Amazon, and Pinterest all rely on dense vector retrieval as the candidate-generation stage.[33]

### Anomaly detection

Objects whose embeddings sit far from their cluster centroid (or have low density in the embedding space) are flagged as outliers. This is used in [fraud detection](/wiki/fraud_detection), content moderation, and [drug discovery](/wiki/ai_drug_discovery) (where unusual molecular embeddings can indicate novel chemistry).[1]

### Deduplication and near-duplicate detection

Minhash-style locality-sensitive hashing on top of embeddings (or direct ANN nearest-neighbor search) finds near-duplicate documents, web pages, or images. Common Crawl, LAION, and the C4 dataset all use embedding-based deduplication as part of their preprocessing pipelines.[34]

### Cross-lingual transfer

Multilingual embedding models (LaBSE, multilingual-E5, BGE-M3, Cohere embed-multilingual-v3) map sentences in different languages into the same vector space. This enables zero-shot cross-lingual retrieval (an English query retrieves Japanese documents) and bitext mining for [machine translation](/wiki/machine_translation) training data.[9][35]

### Drug discovery and protein embeddings

Molecular embeddings from models like ChemBERTa, MolFormer, and Uni-Mol, and protein embeddings from [ESM-2](/wiki/esm_2) and [AlphaFold](/wiki/alphafold), apply the same dense-vector ideas to chemistry and biology, supporting binding affinity prediction, protein function annotation, and reaction yield prediction.[36]

## Visualization and dimensionality reduction

Embeddings of 768 to 4,096 dimensions cannot be plotted directly. Three dimensionality reduction techniques are commonly applied to project them into 2 or 3 dimensions for human inspection.

| Technique | Year | Preserves | Strengths | Weaknesses |
| --- | --- | --- | --- | --- |
| **[PCA](/wiki/pca)** (Principal Component Analysis) | 1901 | Global linear variance | Fast, deterministic, invertible | Misses nonlinear structure |
| **[t-SNE](/wiki/t_sne)** (t-distributed Stochastic Neighbor Embedding) | 2008 (van der Maaten and Hinton) | Local neighborhoods | Reveals tight clusters | Slow, hyperparameter-sensitive, distorts global geometry |
| **[UMAP](/wiki/umap)** (Uniform Manifold Approximation and Projection) | 2018 (McInnes, Healy, Melville) | Local and some global | Faster than t-SNE, better global structure, can transform new data | Stochastic; cluster sizes and distances are not directly meaningful |

UMAP has largely replaced t-SNE as the default tool for embedding visualization in domains like [single-cell RNA sequencing](/wiki/single_cell_rna_sequencing) and large NLP corpora because it is roughly an order of magnitude faster on millions of points and tends to preserve more of the global structure.[37][38]

Interactive visualization platforms like the TensorFlow [Embedding Projector](/wiki/embedding_projector), Nomic Atlas, and the Hugging Face Spaces hosted Embedding Atlas let users explore embedding spaces with hover, search, and color-by-metadata features.[39]

## What are the limitations of vector embeddings?

Embeddings have well-documented failure modes that practitioners must design around:

- **Anisotropy.** Out-of-the-box BERT embeddings occupy a narrow cone in the high-dimensional space, which compresses cosine similarities into a small range and degrades discrimination. Whitening, contrastive fine-tuning (Sentence-BERT), and post-hoc normalization fix this.[40]
- **Bias.** Embeddings inherit the biases of their training data. Bolukbasi and colleagues' 2016 paper *Man is to Computer Programmer as Woman is to Homemaker?* showed that Word2Vec embeddings encode gender stereotypes that propagate into downstream applications.[41]
- **Domain mismatch.** A model trained on web text often underperforms on legal contracts, medical records, or scientific papers without domain-specific fine-tuning.
- **Out-of-distribution inputs.** Embeddings of inputs very different from training data (random byte sequences, adversarial text) can produce misleading similarities.
- **Stability across model versions.** Re-embedding an entire corpus is expensive. Switching from text-embedding-ada-002 to text-embedding-3-large requires rebuilding the index, which has driven interest in compatibility-aware embedding training.[16]
- **Information leakage.** Recent work has shown that embeddings can be partially inverted to recover the original text, with implications for privacy when raw embeddings are shared.[42]

## What does it cost to run embeddings at scale?

Embedding spend has two components: embedding generation (charged per token by API providers) and storage plus query in the vector database (charged per gigabyte of vectors and per query).

As of early 2026 the OpenAI text-embedding-3-small endpoint costs about $0.02 per million tokens and text-embedding-3-large costs about $0.13 per million tokens. Voyage, Cohere, and Google charge in a similar range.[16][17] Embedding 100 million 500-token documents with text-embedding-3-small costs roughly $1,000.

Storage cost is dominated by dimensionality. A 1,536-dimensional float32 embedding takes 6 KB; a billion of them is 6 TB. Standard mitigations include:

| Technique | Storage savings | Quality impact |
| --- | --- | --- |
| Float32 to float16 | 2x | Negligible |
| Float32 to int8 quantization | 4x | Small (typically less than 1% on MTEB) |
| Float32 to binary (1-bit) quantization | 32x | Moderate; recoverable with rerank |
| Matryoshka truncation (3,072 to 768) | 4x | Small to moderate, depends on task |
| Product quantization (PQ) | 8x to 32x | Tunable trade-off |

Mixedbread, Cohere, and Voyage all support int8 and binary embedding outputs natively, often combined with a final cosine rerank on the top-100 candidates to recover full-precision quality.[27]

## Explain Vector embeddings Like I'm 5 (ELI5)

Imagine you have a box of different toys like cars, dolls, and balls. Now, we want to sort these toys based on how similar they are. We can use something called 'vector embedding' to help us with this. Vector embedding is like giving each toy a secret code made of numbers. Toys that are similar will have secret codes that are very close to each other, and toys that are not similar will have secret codes that are very different.

For example, let's say we have a red car, a blue car, and a doll. We can give them secret codes like this:

Red car: [1, 2, 3]

Blue car: [1, 2, 4]

Doll: [5, 6, 7]

See how the red car and the blue car have secret codes that are very close to each other, while the doll has a different secret code? That's because the cars are more similar to each other than the doll.

Vector embedding can also be used for words, pictures, sounds, and many other things. It helps computers understand and sort these things by how similar they are, just like we sorted the toys.

## See also

- [Word2Vec](/wiki/word2vec)
- [GloVe](/wiki/glove)
- [FastText](/wiki/fasttext)
- [BERT](/wiki/bert)
- [Sentence-BERT](/wiki/sentence-bert)
- [CLIP](/wiki/clip)
- [ImageBind](/wiki/imagebind)
- [Vector database](/wiki/vector_database)
- [Approximate nearest neighbor](/wiki/ann)
- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [Semantic search](/wiki/semantic_search)
- [MTEB](/wiki/mteb)
- [Cosine similarity](/wiki/cosine_similarity)
- [UMAP](/wiki/umap)
- [t-SNE](/wiki/t_sne)
- [Latent semantic analysis](/wiki/latent_semantic_analysis)
- [Contrastive learning](/wiki/contrastive_learning)
- [Matryoshka representation learning](/wiki/matryoshka_representation_learning)

## References

1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). 'Distributed representations of words and phrases and their compositionality.' *Advances in Neural Information Processing Systems* 26.
2. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). 'Dense passage retrieval for open-domain question answering.' *EMNLP 2020*.
3. Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to information retrieval*. Cambridge University Press, chapter 6.
4. Hinton, G. E. (1986). 'Learning distributed representations of concepts.' *Proceedings of the Eighth Annual Conference of the Cognitive Science Society*.
5. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). 'Indexing by latent semantic analysis.' *Journal of the American Society for Information Science*, 41(6), 391 to 407.
6. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). 'Efficient estimation of word representations in vector space.' arXiv:1301.3781.
7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). 'Distributed representations of words and phrases and their compositionality.' arXiv:1310.4546.
8. Pennington, J., Socher, R., and Manning, C. D. (2014). 'GloVe: Global vectors for word representation.' *EMNLP 2014*.
9. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). 'Enriching word vectors with subword information.' *Transactions of the Association for Computational Linguistics*, 5, 135 to 146.
10. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). 'Deep contextualized word representations.' *NAACL 2018*.
11. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). 'BERT: Pre-training of deep bidirectional transformers for language understanding.' *NAACL 2019*.
12. Reimers, N. and Gurevych, I. (2019). 'Sentence-BERT: Sentence embeddings using Siamese BERT-networks.' *EMNLP 2019*.
13. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bordes, A. (2017). 'Supervised learning of universal sentence representations from natural language inference data.' *EMNLP 2017*.
14. Cer, D. et al. (2018). 'Universal sentence encoder.' arXiv:1803.11175.
15. OpenAI (2022). 'New and improved embedding model.' OpenAI blog, December 15, 2022.
16. OpenAI (2024). 'New embedding models and API updates.' OpenAI blog, January 25, 2024. https://openai.com/index/new-embedding-models-and-api-updates/.
17. Anthropic. 'Embeddings.' Anthropic Claude documentation, https://docs.anthropic.com/claude/docs/embeddings.
18. Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). 'Improving text embeddings with large language models.' *ACL 2024*.
19. Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., and Grave, E. (2022). 'Unsupervised dense information retrieval with contrastive learning.' *Transactions on Machine Learning Research*.
20. Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. (2022). 'Matryoshka representation learning.' *NeurIPS 2022*.
21. Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2023). 'MTEB: Massive text embedding benchmark.' *EACL 2023*. https://aclanthology.org/2023.eacl-main.148/.
22. Hugging Face. 'MTEB Leaderboard.' https://huggingface.co/spaces/mteb/leaderboard.
23. Voyage AI (2024 to 2025). 'Voyage embedding models documentation.' https://docs.voyageai.com.
24. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). 'Learning transferable visual models from natural language supervision.' *ICML 2021*.
25. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K. V., Joulin, A., and Misra, I. (2023). 'ImageBind: One embedding space to bind them all.' *CVPR 2023*.
26. Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. (2021). 'BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models.' *NeurIPS 2021 Datasets and Benchmarks*.
27. ANN-Benchmarks (2024). 'Benchmarking nearest neighbor search algorithms.' https://ann-benchmarks.com.
28. Malkov, Y. A. and Yashunin, D. A. (2018). 'Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.' *IEEE Transactions on Pattern Analysis and Machine Intelligence*.
29. Khattab, O. and Zaharia, M. (2020). 'ColBERT: Efficient and effective passage search via contextualized late interaction over BERT.' *SIGIR 2020*.
30. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). 'Retrieval-augmented generation for knowledge-intensive NLP tasks.' *NeurIPS 2020*.
31. Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2024). 'Retrieval-augmented generation for large language models: A survey.' arXiv:2312.10997.
32. Grootendorst, M. (2022). 'BERTopic: Neural topic modeling with a class-based TF-IDF procedure.' arXiv:2203.05794.
33. Covington, P., Adams, J., and Sargin, E. (2016). 'Deep neural networks for YouTube recommendations.' *RecSys 2016*.
34. Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. (2022). 'Deduplicating training data makes language models better.' *ACL 2022*.
35. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2022). 'Language-agnostic BERT sentence embedding.' *ACL 2022*.
36. Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). 'Evolutionary-scale prediction of atomic-level protein structure with a language model.' *Science*, 379(6637), 1123 to 1130.
37. van der Maaten, L. and Hinton, G. (2008). 'Visualizing data using t-SNE.' *Journal of Machine Learning Research*, 9, 2579 to 2605.
38. McInnes, L., Healy, J., and Melville, J. (2018). 'UMAP: Uniform manifold approximation and projection for dimension reduction.' arXiv:1802.03426.
39. Smilkov, D., Thorat, N., Nicholson, C., Reif, E., Viegas, F. B., and Wattenberg, M. (2016). 'Embedding projector: Interactive visualization and interpretation of embeddings.' *NIPS 2016 Workshop on Interpretable ML*.
40. Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). 'On the sentence embeddings from pre-trained language models.' *EMNLP 2020*.
41. Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. (2016). 'Man is to computer programmer as woman is to homemaker? Debiasing word embeddings.' *NeurIPS 2016*.
42. Morris, J. X., Kuleshov, V., Shmatikov, V., and Rush, A. M. (2023). 'Text embeddings reveal (almost) as much as text.' *EMNLP 2023*.

