Vector embeddings
Last reviewed
Apr 30, 2026
Sources
42 citations
Review status
Source-backed
Revision
v7 ยท 5,590 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
42 citations
Review status
Source-backed
Revision
v7 ยท 5,590 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: AI terms
Vector embeddings are dense numerical representations of objects (text, images, audio, video, code, graphs, or any structured data) that map them into a continuous vector space such that semantic similarity between objects corresponds to geometric proximity. They form the mathematical substrate of modern machine learning, natural language processing, and information retrieval, and they sit at the core of nearly every production retrieval-augmented generation (RAG) pipeline, semantic search engine, recommendation system, and multimodal AI model deployed today.[1][2]
An embedding model takes an input (a sentence, a JPEG, a 30-second audio clip, a SMILES molecular string) and produces a fixed-length vector, typically of 256 to 4,096 dimensions. The values inside the vector have no individually interpretable meaning. Their utility comes from the geometry of the resulting space: cosine similarity between two embeddings approximates the semantic relatedness of the two inputs.[3]
Formally, an embedding is a learned function f: X to R^d that maps an input space X (words, sentences, images, etc.) into a real-valued d-dimensional vector space, with the property that a chosen distance metric (cosine, dot product, or Euclidean) between f(a) and f(b) reflects a meaningful notion of similarity between a and b.[1]
Key properties of modern embeddings:
| Property | Description |
|---|---|
| Dense | All or nearly all dimensions are non-zero, in contrast to sparse one-hot or bag-of-words representations. |
| Distributed | Meaning is spread across many dimensions; no single dimension corresponds to a human concept. |
| Fixed-length | Output dimensionality d is constant for a given model regardless of input length. |
| Continuous | Lives in R^d, allowing arithmetic operations (addition, subtraction, interpolation). |
| Learned | Produced by a neural network trained on a large corpus with a self-supervised or contrastive objective. |
The conceptual foundation comes from Geoffrey Hinton's 1986 paper Learning distributed representations of concepts, which argued that concepts should be represented by patterns of activity over many units rather than by single dedicated symbols.[4] This idea, that meaning emerges from the joint values of many features, is the philosophical ancestor of every modern embedding.
The first widely used dense vector representation of text was latent semantic analysis (LSA), introduced by Scott Deerwester, Susan Dumais, George Furnas, Thomas Landauer, and Richard Harshman in their 1990 paper Indexing by latent semantic analysis.[5] LSA applies singular value decomposition (SVD) to a term-document matrix, producing low-rank approximations in which documents and terms are represented as vectors in the same reduced space. Documents with overlapping topical structure end up close in the SVD space even when they share no surface vocabulary, which lets LSA retrieve relevant documents that contain only synonyms of the query terms.
LSA established that a continuous, low-dimensional vector space could capture latent semantic structure, but its training was an O(n^3) matrix decomposition that did not scale to web-sized corpora and produced static representations that could not be incrementally updated.[5]
The modern era of embeddings began with Word2Vec, released by Tomas Mikolov and colleagues at Google in two 2013 papers, Efficient estimation of word representations in vector space and Distributed representations of words and phrases and their compositionality.[6][7] Word2Vec offered two shallow neural architectures:
Trained with negative sampling on a 100-billion-word Google News corpus, Word2Vec produced 300-dimensional word vectors that exhibited the now-famous analogy property: vector('king') minus vector('man') plus vector('woman') yields a vector closest to that of 'queen'.[7] This compositional behavior was a striking demonstration that distributed representations could encode relational structure.
Word2Vec's contribution was as much engineering as scientific. The negative-sampling trick reduced training cost from O(V) to O(k) per example (where V is vocabulary size and k is a small constant), allowing training on billions of words on a single CPU in hours rather than weeks.[7]
In 2014, Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford introduced GloVe (Global Vectors for Word Representation) at EMNLP.[8] GloVe combines the global matrix-factorization view of LSA with the local context-window view of Word2Vec. It factorizes the logarithm of the word-word co-occurrence matrix using a weighted least-squares objective, producing vectors that match or exceed Word2Vec on word analogy and similarity benchmarks while training faster on the same corpus.
The pretrained GloVe vectors trained on Common Crawl (840 billion tokens, 2.2 million vocabulary, 300 dimensions) became a de facto standard input for neural network NLP models in 2014 to 2017.[8]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research extended Word2Vec with subword information in 2017's Enriching word vectors with subword information.[9] FastText represents each word as a bag of character n-grams (typically n = 3 to 6), so the vector for 'unhappiness' is built from sub-units like 'un', 'unh', 'happi', 'iness'. This gives the model two important properties: it can produce vectors for out-of-vocabulary words by composing their character n-grams, and it captures morphological regularities (singular versus plural, tense, derivation) that Word2Vec misses.
FastText released pretrained vectors for 157 languages, which made it the first practical multilingual embedding system at scale.[9]
Word2Vec, GloVe, and FastText all produce a single vector per word type regardless of context, so 'bank' in 'river bank' gets the same vector as in 'savings bank'. Peters and colleagues' ELMo (2018) was the first widely adopted contextual model: it ran a bidirectional LSTM over the sentence and produced a different vector for each token depending on its full sentential context.[10]
The decisive break came later in 2018 with BERT (Bidirectional Encoder Representations from Transformers) by Devlin and colleagues at Google.[11] BERT replaced the LSTM with a Transformer encoder trained on masked language modeling and next-sentence prediction over 3.3 billion words. The hidden states of a frozen or fine-tuned BERT became the default text representation for nearly every NLP benchmark from 2018 to 2020.[11]
BERT's token-level vectors are not directly useful for sentence-level tasks like clustering or semantic search: pooling them naively (mean-pool, [CLS] token) produces representations that perform worse than averaged GloVe vectors on the STS benchmarks.[12]
From 2022 onward, commercial providers shipped embedding endpoints as first-class API products. OpenAI launched text-embedding-ada-002 in December 2022 as a single-model replacement for five earlier embedding endpoints, then released text-embedding-3-small and text-embedding-3-large in January 2024 with native dimension shortening (the Matryoshka representation learning trick).[15][16] Cohere, Voyage AI, Google, NVIDIA, Mistral, Jina AI, Mixedbread, and the open-source BGE, GTE, and E5 model families all entered the market. Anthropic does not produce its own embedding model and recommends Voyage AI for use with Claude-based RAG systems.[17]
Every modern embedding model consists of three pieces: an encoder neural network, a pooling strategy, and a training objective.
The encoder is almost always a Transformer (typically a BERT-style bidirectional encoder, though decoder-only LLMs are increasingly used as embedding backbones via repllama, e5-mistral, and similar).[18] It maps a tokenized input into a sequence of hidden states.
The variable-length sequence of hidden states is reduced to a single fixed-length vector using one of:
Modern embedding models are trained with a contrastive learning objective, most commonly the InfoNCE loss. Each training example is a (query, positive) pair plus a batch of in-batch negatives or hard-mined negatives. The loss pulls the query embedding toward the positive and pushes it away from the negatives:[19]
L = -log(exp(sim(q, p+) / tau) / sum_i exp(sim(q, p_i) / tau))
where sim is cosine similarity and tau is a temperature hyperparameter. Training data typically combines hundreds of millions of weak pairs (question-answer pairs from Reddit, query-title pairs from search logs, citation pairs from academic papers) with millions of high-quality human-annotated pairs from datasets like MS MARCO, NLI, and STS.[19]
Introduced by Kusupati and colleagues in 2022, Matryoshka representation learning (MRL) trains a single model to produce embeddings whose prefix sub-vectors (the first 64, 128, 256, ... dimensions) are themselves valid embeddings.[20] Users can truncate the embedding to fit storage and latency budgets without retraining. OpenAI's text-embedding-3-large supports dimensions from 256 to 3,072 via MRL, and Nomic, Mixedbread, and Jina v3 use the same trick.[16]
Embeddings exist at multiple granularities, each with different use cases.
| Granularity | Typical models | Common uses |
|---|---|---|
| Word | Word2Vec, GloVe, FastText | Lexical similarity, analogy tasks, feature inputs to older NLP models |
| Subword / token | BERT, GPT tokenizer outputs | Token classification, named entity recognition, sequence labeling |
| Sentence | Sentence-BERT, Universal Sentence Encoder, all-MiniLM-L6-v2 | Semantic search, paraphrase detection, clustering |
| Passage / paragraph | E5, BGE, Voyage, OpenAI text-embedding-3 | RAG, question answering, document retrieval |
| Document | SPECTER, SciNCL, custom long-context models | Citation prediction, scientific paper similarity |
| Code | CodeBERT, CodeT5+, OpenAI text-embedding-3 (handles code), Voyage-code-3 | Code search, duplicate detection, vulnerability detection |
| Image | CLIP, DINOv2, SigLIP | Image retrieval, zero-shot classification, text-to-image generation conditioning |
| Audio | wav2vec 2.0, CLAP, AudioMAE | Music similarity, speaker identification, audio tagging |
| Multimodal | CLIP, ImageBind, Voyage-multimodal-3 | Cross-modal retrieval, multimodal RAG |
| Graph | node2vec, DeepWalk, GraphSAGE | Link prediction, node classification, recommendation |
| Model | Provider | Year | Dimensions | Max context (tokens) | License |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 2024 | 256 to 3,072 (MRL) | 8,191 | Proprietary API |
| text-embedding-3-small | OpenAI | 2024 | 512 to 1,536 (MRL) | 8,191 | Proprietary API |
| voyage-3-large | Voyage AI | 2025 | 256 to 2,048 (MRL) | 32,000 | Proprietary API |
| voyage-3 | Voyage AI | 2024 | 1,024 | 32,000 | Proprietary API |
| voyage-code-3 | Voyage AI | 2024 | 256 to 2,048 (MRL) | 32,000 | Proprietary API |
| voyage-multimodal-3 | Voyage AI | 2024 | 1,024 | 32,000 | Proprietary API |
| embed-v3 | Cohere | 2023 | 384 / 1,024 | 512 | Proprietary API |
| embed-multilingual-v3 | Cohere | 2023 | 1,024 | 512 | Proprietary API |
| text-embedding-005 | Google Vertex AI | 2024 | 768 | 2,048 | Proprietary API |
| gemini-embedding-001 | 2025 | 768 to 3,072 (MRL) | 8,192 | Proprietary API | |
| NV-Embed-v2 | NVIDIA | 2024 | 4,096 | 32,768 | Open weights (CC-BY-NC) |
| BGE-M3 | BAAI | 2024 | 1,024 | 8,192 | Open (MIT) |
| BGE-large-en-v1.5 | BAAI | 2023 | 1,024 | 512 | Open (MIT) |
| GTE-large-en-v1.5 | Alibaba | 2024 | 1,024 | 8,192 | Open (Apache 2.0) |
| mxbai-embed-large-v1 | Mixedbread | 2024 | 1,024 | 512 | Open (Apache 2.0) |
| jina-embeddings-v3 | Jina AI | 2024 | 32 to 1,024 (MRL) | 8,192 | Open (CC-BY-NC) |
| Nomic Embed v1.5 | Nomic AI | 2024 | 64 to 768 (MRL) | 8,192 | Open (Apache 2.0) |
| E5-mistral-7b-instruct | Microsoft | 2024 | 4,096 | 32,768 | Open (MIT) |
| all-MiniLM-L6-v2 | UKP / Hugging Face | 2021 | 384 | 256 | Open (Apache 2.0) |
Sources: provider documentation, model cards on Hugging Face, and the MTEB leaderboard.[16][17][21][22][23]
The trend across 2023 to 2026 is clear: models keep growing in context length (512 to 32,000+ tokens), are increasingly built on decoder-only LLM backbones, and almost universally support Matryoshka dimension truncation. Open-source models on the MTEB leaderboard now match or exceed the best proprietary APIs on most retrieval tasks.[21]
CLIP (Contrastive Language-Image Pre-training), released by OpenAI in February 2021 in Learning transferable visual models from natural language supervision, is the canonical multimodal embedding model.[24] CLIP trains a vision encoder (a Vision Transformer or ResNet) and a text encoder (a Transformer) jointly on 400 million image-caption pairs scraped from the web with a contrastive objective: each image embedding should be close to its caption's embedding and far from the embeddings of every other caption in the batch.
The result is a shared image-text embedding space in which a picture of a golden retriever and the string 'a photo of a golden retriever' produce embeddings with high cosine similarity. CLIP enables zero-shot image classification (rank images against a list of class-name prompts), text-to-image retrieval, and conditioning for diffusion models like Stable Diffusion and DALL-E 2.[24]
ImageBind, released by Meta in May 2023, extended the CLIP idea to six modalities: images, text, audio, depth, thermal, and IMU motion data.[25] ImageBind uses image-paired data (image-text, image-audio, image-depth, etc.) and shows that pairwise alignment with images is sufficient to learn a shared embedding space across all six modalities, enabling cross-modal retrieval (e.g., finding images from audio queries) without ever training on direct audio-text pairs.
Other important multimodal embedding models include:
| Model | Modalities | Year | Notes |
|---|---|---|---|
| CLIP | Image + text | 2021 | OpenAI, 400M pairs, foundational |
| SigLIP | Image + text | 2023 | Google, sigmoid loss replaces softmax, more compute-efficient |
| BLIP-2 | Image + text | 2023 | Bridges frozen vision encoder with LLM via Q-Former |
| OpenCLIP | Image + text | 2022 | LAION reproduction of CLIP, open weights |
| DINOv2 | Image | 2023 | Meta, self-supervised vision-only embeddings |
| ImageBind | 6 modalities | 2023 | Meta, image-anchored joint space |
| LanguageBind | 6 modalities | 2023 | PKU, language-anchored alternative to ImageBind |
| Voyage-multimodal-3 | Image + text | 2024 | Production API, mixed image-text documents |
| ColBERT-vision (ColPali) | Image + text | 2024 | Late-interaction multimodal retrieval for documents |
The Massive Text Embedding Benchmark (MTEB), introduced by Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers at EACL 2023, is the standard evaluation suite for English text embedding models.[21] MTEB covers 56 datasets across 8 task categories: bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization.
The MTEB leaderboard, hosted on Hugging Face Spaces, has become the field's de facto scoreboard. As of 2025 to 2026 it has expanded to include MMTEB (Massive Multilingual Text Embedding Benchmark) covering more than 250 languages and over 500 tasks, and specialized leaderboards for code, law, and long documents.[21][22]
Key lessons from MTEB:
Other important benchmarks include BEIR (zero-shot retrieval, Thakur et al. 2021), LoCo (long-context retrieval), CodeSearchNet (code retrieval), and MIRACL (multilingual retrieval).[26]
Storing and efficiently searching billions of embeddings has spawned a new category of infrastructure: the vector database. These systems implement approximate nearest neighbor (ANN) search algorithms (HNSW, IVF, ScaNN, DiskANN) and add metadata filtering, hybrid keyword-plus-vector search, and operational tooling around them.
| Database | Type | Year | Index | Notable features |
|---|---|---|---|---|
| Pinecone | Managed cloud | 2019 | Proprietary | Serverless, fully managed, namespaces |
| Weaviate | Open-source / managed | 2019 | HNSW | Built-in vectorization modules, GraphQL API |
| Qdrant | Open-source / managed | 2021 | HNSW | Rust-based, payload filtering, scalar quantization |
| Chroma | Open-source | 2022 | HNSW | Designed for LangChain and prototyping |
| Milvus | Open-source / managed | 2019 | HNSW, IVF, DiskANN | LF AI graduate, GPU acceleration, billion-scale |
| pgvector | PostgreSQL extension | 2021 | IVFFlat, HNSW | Brings vector search into existing PostgreSQL deployments |
| Vespa | Open-source / managed | 2017 | HNSW | Yahoo origin, hybrid retrieval and ranking |
| LanceDB | Embedded / serverless | 2023 | IVF-PQ | Columnar Lance format, multimodal-friendly |
| Turbopuffer | Managed cloud | 2023 | Custom | S3-backed, low cost per gigabyte |
| Elasticsearch / OpenSearch | Search engine | 2022+ | HNSW | Vector support added to existing keyword search |
| Redis Vector | Key-value store | 2022 | HNSW, FLAT | Vector search inside Redis |
| MongoDB Atlas Vector Search | Document DB | 2023 | HNSW | Vector search inside MongoDB |
| Azure AI Search | Managed cloud | 2023 | HNSW | Microsoft, integrated with Azure OpenAI |
Sources: vendor documentation and ANN-Benchmarks results.[27][28]
Most production systems combine ANN search with a metadata filter (e.g., 'customer_id = X AND created_at > Y'). Algorithms like Filtered DiskANN and the post-filter / pre-filter strategies in HNSW are active areas of research because naive filtering breaks the graph-walk assumptions of HNSW.[28]
Three distance functions dominate practical use of embeddings.
| Metric | Formula | Range | When to use |
|---|---|---|---|
| Cosine similarity | (a . b) / ( | a | |
| Dot product | a . b | unbounded | When embeddings are pre-normalized to unit length, equivalent to cosine; otherwise rewards larger vectors |
| Euclidean (L2) distance | sqrt(sum_i (a_i - b_i)^2) | 0 to infinity | Image embeddings, geometric problems, k-means clustering |
Most modern text embedding models (OpenAI text-embedding-3, BGE, GTE, Sentence-BERT) output unit-normalized vectors, in which case cosine similarity, dot product, and 2 minus the squared Euclidean distance are monotonically related and yield identical rankings.[3] Choosing one over another in that case is a matter of compute cost: dot product is the cheapest, then cosine, then Euclidean.
When comparing sets of embeddings (e.g., a multi-vector representation of a long document) more sophisticated metrics apply: ColBERT's late interaction sums the maximum cosine similarity from each query token to any document token; SPLADE uses sparse lexical interaction; cross-encoders compare the joint representation of a (query, document) concatenation.[29]
Semantic search replaces keyword matching with vector similarity. The query and corpus documents are embedded with the same model; documents whose embeddings have the highest cosine similarity to the query are returned. This handles synonymy ('automobile' matches 'car'), paraphrase ('how to fix a leaky faucet' matches 'plumbing repair'), and conceptual relatedness ('quiet electric vehicles' matches a Tesla review).[1]
Production systems usually combine semantic search with traditional BM25 keyword search using reciprocal rank fusion or a learned reranker on top, since neither approach dominates the other across all queries.[26]
Retrieval-augmented generation, introduced by Lewis and colleagues in 2020, has become the dominant pattern for building LLM applications over private data.[30] The pipeline:
The quality of step 3 dominates end-to-end answer quality, which is why embedding model choice is among the highest-leverage decisions in RAG system design.[31]
Embeddings act as feature inputs for downstream classifiers. A logistic regression or small MLP trained on top of frozen embeddings often matches or exceeds fine-tuning the base model when labeled data is scarce. The MTEB classification subset measures exactly this scenario.[21]
For unsupervised analysis, embeddings combined with k-means, HDBSCAN, or agglomerative clustering produce thematic groupings of large text corpora, which is the basis of tools like BERTopic for topic modeling.[32]
User and item embeddings, often trained jointly with a two-tower architecture, power recommendation systems at scale. YouTube's deep neural network recommender (Covington et al. 2016) was an early influential public design; modern systems at TikTok, Spotify, Netflix, Amazon, and Pinterest all rely on dense vector retrieval as the candidate-generation stage.[33]
Objects whose embeddings sit far from their cluster centroid (or have low density in the embedding space) are flagged as outliers. This is used in fraud detection, content moderation, and drug discovery (where unusual molecular embeddings can indicate novel chemistry).[1]
Minhash-style locality-sensitive hashing on top of embeddings (or direct ANN nearest-neighbor search) finds near-duplicate documents, web pages, or images. Common Crawl, LAION, and the C4 dataset all use embedding-based deduplication as part of their preprocessing pipelines.[34]
Multilingual embedding models (LaBSE, multilingual-E5, BGE-M3, Cohere embed-multilingual-v3) map sentences in different languages into the same vector space. This enables zero-shot cross-lingual retrieval (an English query retrieves Japanese documents) and bitext mining for machine translation training data.[9][35]
Molecular embeddings from models like ChemBERTa, MolFormer, and Uni-Mol, and protein embeddings from ESM-2 and AlphaFold, apply the same dense-vector ideas to chemistry and biology, supporting binding affinity prediction, protein function annotation, and reaction yield prediction.[36]
Embeddings of 768 to 4,096 dimensions cannot be plotted directly. Three dimensionality reduction techniques are commonly applied to project them into 2 or 3 dimensions for human inspection.
| Technique | Year | Preserves | Strengths | Weaknesses |
|---|---|---|---|---|
| PCA (Principal Component Analysis) | 1901 | Global linear variance | Fast, deterministic, invertible | Misses nonlinear structure |
| t-SNE (t-distributed Stochastic Neighbor Embedding) | 2008 (van der Maaten and Hinton) | Local neighborhoods | Reveals tight clusters | Slow, hyperparameter-sensitive, distorts global geometry |
| UMAP (Uniform Manifold Approximation and Projection) | 2018 (McInnes, Healy, Melville) | Local and some global | Faster than t-SNE, better global structure, can transform new data | Stochastic; cluster sizes and distances are not directly meaningful |
UMAP has largely replaced t-SNE as the default tool for embedding visualization in domains like single-cell RNA sequencing and large NLP corpora because it is roughly an order of magnitude faster on millions of points and tends to preserve more of the global structure.[37][38]
Interactive visualization platforms like the TensorFlow Embedding Projector, Nomic Atlas, and the Hugging Face Spaces hosted Embedding Atlas let users explore embedding spaces with hover, search, and color-by-metadata features.[39]
Embeddings have well-documented failure modes that practitioners must design around:
Embedding spend has two components: embedding generation (charged per token by API providers) and storage plus query in the vector database (charged per gigabyte of vectors and per query).
As of early 2026 the OpenAI text-embedding-3-small endpoint costs about $0.02 per million tokens and text-embedding-3-large costs about $0.13 per million tokens. Voyage, Cohere, and Google charge in a similar range.[16][17] Embedding 100 million 500-token documents with text-embedding-3-small costs roughly $1,000.
Storage cost is dominated by dimensionality. A 1,536-dimensional float32 embedding takes 6 KB; a billion of them is 6 TB. Standard mitigations include:
| Technique | Storage savings | Quality impact |
|---|---|---|
| Float32 to float16 | 2x | Negligible |
| Float32 to int8 quantization | 4x | Small (typically less than 1% on MTEB) |
| Float32 to binary (1-bit) quantization | 32x | Moderate; recoverable with rerank |
| Matryoshka truncation (3,072 to 768) | 4x | Small to moderate, depends on task |
| Product quantization (PQ) | 8x to 32x | Tunable trade-off |
Mixedbread, Cohere, and Voyage all support int8 and binary embedding outputs natively, often combined with a final cosine rerank on the top-100 candidates to recover full-precision quality.[27]
Imagine you have a box of different toys like cars, dolls, and balls. Now, we want to sort these toys based on how similar they are. We can use something called 'vector embedding' to help us with this. Vector embedding is like giving each toy a secret code made of numbers. Toys that are similar will have secret codes that are very close to each other, and toys that are not similar will have secret codes that are very different.
For example, let's say we have a red car, a blue car, and a doll. We can give them secret codes like this:
Red car: [1, 2, 3]
Blue car: [1, 2, 4]
Doll: [5, 6, 7]
See how the red car and the blue car have secret codes that are very close to each other, while the doll has a different secret code? That's because the cars are more similar to each other than the doll.
Vector embedding can also be used for words, pictures, sounds, and many other things. It helps computers understand and sort these things by how similar they are, just like we sorted the toys.