Embeddings

Deep Learning Information Retrieval Machine Learning Natural Language Processing

40 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v7 · 7,931 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Embeddings are dense vector representations of data in a continuous vector space, where semantically similar items are mapped to nearby points. An embedding turns a discrete object (a word, sentence, image, or audio clip) into a fixed-length list of floating-point numbers, typically between 100 and 3,072 dimensions, so that distance and angle between vectors encode similarity in meaning. In machine learning and artificial intelligence, embeddings serve as the bridge between raw data (words, sentences, images, audio) and the numerical representations that algorithms can process. Rather than treating data as discrete symbols, embeddings capture meaning, relationships, and context by encoding information into fixed-length arrays of floating-point numbers.

The technique became practical in 2013, when Tomas Mikolov and colleagues at Google published Word2Vec and reported that "it takes less than a day to learn high quality word vectors from a 1.6 billion words data set."^[1] That efficiency breakthrough turned embeddings into a foundational component of modern AI systems. From natural language processing (NLP) to computer vision, recommendation engines to retrieval-augmented generation (RAG), embeddings enable machines to reason about similarity, analogy, and relatedness across virtually any type of data. The development of embedding techniques over the past decade represents one of the most impactful shifts in how AI systems represent and process information.

What is an embedding in machine learning?

Formally, an embedding is a learned function that maps an input from a high-dimensional or discrete space into a lower-dimensional continuous vector space in which geometric proximity corresponds to semantic similarity. A vocabulary of 50,000 words that would otherwise need 50,000 sparse one-hot dimensions can instead be represented in, for example, 300 dense dimensions, with each word occupying a meaningful point rather than an isolated axis. The defining property is that the relationship between two items is recoverable from their vectors: similar items have a small angle (high cosine similarity) between them, and certain semantic relationships appear as consistent vector offsets, the canonical example being vector("king") - vector("man") + vector("woman") landing closest to vector("queen").^[2]

History and motivation

Before embeddings became standard, machine learning systems typically represented text using sparse, high-dimensional vectors. Bag-of-words models and TF-IDF (term frequency-inverse document frequency) representations created vectors with as many dimensions as there were unique words in the vocabulary, often tens or hundreds of thousands. These sparse representations suffered from several problems: they treated every word as independent of every other word, they could not capture synonymy or polysemy, and they required enormous amounts of memory.

Early work on distributed representations dates back to the 1980s, when Geoffrey Hinton introduced the idea of representing concepts as patterns of activity across multiple processing units in a neural network. Yoshua Bengio and colleagues advanced this idea in 2003 with their neural probabilistic language model, which learned continuous word representations as part of a language modeling task.^[12] However, it was not until 2013, when Tomas Mikolov and his team at Google published Word2Vec, that embeddings became practical for large-scale applications.^[1]

Word embeddings

Word embeddings assign a dense vector to each word in a vocabulary such that words with similar meanings end up close together in the vector space. The major breakthroughs in this area came from three models: Word2Vec, GloVe, and FastText.

Word2Vec

Word2Vec was introduced by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean at Google in 2013.^[1] The paper proposed "two novel model architectures for computing continuous vector representations of words from very large data sets" and reported "large improvements in accuracy at much lower computational cost" than prior neural network approaches.^[1] The key insight was that a shallow neural network trained on a simple word prediction task could learn rich semantic representations. Word2Vec offered two training architectures:

Continuous Bag of Words (CBOW): Predicts a target word from its surrounding context words. Given a window of context words, the model averages their vectors and uses the result to predict the center word. CBOW trains faster and tends to perform better on frequent words.
Skip-gram: Predicts the surrounding context words given a center word. For each word in a sentence, the model tries to predict which words appear nearby. Skip-gram works well with smaller datasets and represents rare words more effectively.

Both architectures used a vocabulary-sized softmax output layer, which was computationally expensive. Mikolov introduced two approximation techniques to make training feasible on large corpora: hierarchical softmax (organizing the vocabulary as a binary tree) and negative sampling (training the model to distinguish real context pairs from randomly generated negative pairs).^[2]

Word2Vec models were typically trained with 100 to 300 dimensions. Google released a pre-trained model with 300-dimensional vectors trained on roughly 100 billion words from Google News, covering a vocabulary of 3 million words and phrases.^[1] One of the most celebrated properties of Word2Vec embeddings was their ability to capture analogies through vector arithmetic. The classic example is: vector("king") - vector("man") + vector("woman") produces a vector closest to vector("queen").^[2]

GloVe

GloVe (Global Vectors for Word Representation) was developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford University in 2014.^[3] While Word2Vec learned embeddings through local context windows, GloVe was designed as "a new global log-bilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods."^[3]

GloVe constructs a word-word co-occurrence matrix from the entire corpus, where each entry records how frequently two words appear within a specified window of each other. The model then factorizes this co-occurrence matrix using a weighted least-squares objective. The key insight was that ratios of co-occurrence probabilities encode meaning: for instance, the ratio of P(ice | solid) to P(ice | gas) is much larger than 1, while the ratio of P(steam | solid) to P(steam | gas) is much smaller than 1. GloVe's objective function was designed to preserve these ratios in the learned vector space.^[3]

The Stanford NLP Group released several pre-trained GloVe models:

Training corpus	Tokens	Vocabulary size	Available dimensions
Wikipedia 2014 + Gigaword 5	6B	400K	50, 100, 200, 300
Common Crawl (42B)	42B	1.9M	300
Common Crawl (840B)	840B	2.2M	300
Twitter	27B	1.2M	25, 50, 100, 200

GloVe achieved competitive or superior performance to Word2Vec on word analogy, word similarity, and named entity recognition tasks.^[3] The 300-dimensional vectors trained on 840 billion tokens became one of the most widely used pre-trained embedding resources in NLP.

FastText

FastText was developed by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research (FAIR) and released in 2016.^[4] FastText extended the Word2Vec Skip-gram architecture by incorporating subword information.^[4]

Instead of learning a single vector per word, FastText represents each word as a bag of character n-grams (by default, n-grams of length 3 to 6) plus the word itself. The embedding for a word is the sum of the vectors for all its constituent n-grams.^[4] This approach provided two major advantages over Word2Vec and GloVe:

Handling out-of-vocabulary words: Because FastText builds word representations from character n-grams, it can generate embeddings for words never seen during training. A misspelled word like "embedddings" would still receive a reasonable vector because it shares most of its character n-grams with "embeddings."
Morphological awareness: Languages with rich morphology benefit from subword representations. Words like "running," "runner," and "runs" share character n-grams, so their embeddings naturally capture the relationship between morphological variants.

Facebook released pre-trained FastText vectors for 157 languages, each with 300 dimensions, trained on Wikipedia and Common Crawl data. FastText also included a text classification component that achieved accuracy competitive with deep learning models while training orders of magnitude faster.

Comparison of static word embedding models

Model	Year	Developers	Method	Subword support	Typical dimensions	Key strength
Word2Vec	2013	Mikolov et al. (Google)	CBOW / Skip-gram	No	100-300	Efficient training, analogy properties
GloVe	2014	Pennington et al. (Stanford)	Co-occurrence matrix factorization	No	50-300	Global statistics, strong benchmarks
FastText	2016	Bojanowski et al. (Facebook)	Skip-gram with char n-grams	Yes	100-300	OOV handling, morphological awareness

Limitations of static word embeddings

Word2Vec, GloVe, and FastText all produce a single fixed vector for each word, regardless of context. The word "bank" receives the same embedding whether it appears in "river bank" or "investment bank." This fundamental limitation, known as the polysemy problem, motivated the development of contextual embeddings.

Contextual embeddings

Contextual embeddings generate different vector representations for the same word depending on the surrounding text. This approach captures polysemy and allows models to represent word meaning in a context-dependent way.

ELMo

ELMo (Embeddings from Language Models) was introduced by Matthew Peters and colleagues at the Allen Institute for AI in February 2018.^[5] ELMo was the first widely adopted model to produce contextual word embeddings.

ELMo uses a two-layer bidirectional LSTM (biLSTM) trained as a language model on a large text corpus (the 1 Billion Word Benchmark). The forward LSTM predicts the next word given the preceding context, while the backward LSTM predicts the previous word given the following context. The two directions are trained independently and their outputs are concatenated.^[5]

For each token, ELMo produces three layers of representations: the character-based word embedding (from a character CNN), the first biLSTM layer output, and the second biLSTM layer output. The final ELMo embedding is a task-specific weighted combination of all three layers, where the weights are learned during fine-tuning on downstream tasks. Research showed that lower layers tend to capture syntactic information, while higher layers capture more semantic information.^[5]

ELMo embeddings improved the state of the art across six NLP tasks when its paper was published, and the work received the Best Paper Award at NAACL 2018.^[5] However, ELMo's bidirectionality was shallow: the forward and backward LSTMs were trained separately and only concatenated, rather than jointly attending to both left and right context at every layer.

BERT embeddings

BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google in October 2018, represented a major leap forward in contextual embeddings.^[6] Unlike ELMo's separate forward and backward passes, BERT uses a transformer encoder that attends to the full input sequence simultaneously at every layer, producing deeply bidirectional representations.^[6] On release, BERT obtained "new state-of-the-art results on eleven natural language processing tasks," pushing the GLUE benchmark score to 80.5 (a 7.7 point absolute improvement), MultiNLI accuracy to 86.7%, and SQuAD v1.1 question answering test F1 to 93.2.^[6]

BERT was pre-trained on two objectives: Masked Language Modeling (MLM), where the model predicts randomly masked tokens from both left and right context, and Next Sentence Prediction (NSP), where the model learns relationships between sentence pairs.^[6] The resulting hidden states can be used as contextual embeddings for downstream tasks.

BERT-Base produces 768-dimensional embeddings, while BERT-Large produces 1024-dimensional embeddings.^[6] Researchers found that different layers capture different types of information, and a common practice is to average or concatenate the last four hidden layers to produce general-purpose token embeddings.

The success of BERT led to a family of encoder models that produce contextual embeddings, including RoBERTa, ALBERT, ELECTRA, and DeBERTa. These models are widely used as the backbone for embedding-based applications.

Sentence and document embeddings

While word and token-level embeddings are useful for many tasks, applications like semantic search, document clustering, and sentence similarity require fixed-length representations of entire sentences or documents.

Sentence-BERT (SBERT)

Sentence-BERT was introduced by Nils Reimers and Iryna Gurevych at the Technical University of Darmstadt in 2019.^[7] The authors identified a critical limitation of using BERT directly for sentence similarity: comparing two sentences with BERT requires feeding both sentences into the network simultaneously, making it computationally prohibitive at scale. Finding the most similar pair among 10,000 sentences would require roughly 50 million inference computations, taking about 65 hours.^[7]

SBERT solved this by fine-tuning BERT using a siamese network architecture. Two identical BERT models (sharing weights) independently encode two sentences, and a pooling operation (typically mean pooling over token embeddings) produces fixed-size sentence vectors. The network is trained using either a classification objective (with a softmax classifier on the concatenated sentence representations) for NLI data, or a regression objective (minimizing the mean squared error between predicted and gold similarity scores) for STS data.^[7]

The result was a model that could encode sentences into vectors that are directly comparable using cosine similarity. As the authors put it, SBERT "reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa, to about 5 seconds with SBERT, while maintaining the accuracy from BERT."^[7] The Sentence Transformers library, built on top of Hugging Face Transformers, has become the standard framework for training and using sentence embedding models.

Universal Sentence Encoder

The Universal Sentence Encoder (USE) was published by Daniel Cer and colleagues at Google in 2018.^[8] It encodes sentences into 512-dimensional vectors and was designed specifically for transfer learning across a wide range of NLP tasks.^[8]

USE offered two model variants:

Transformer variant: Uses a self-attention-based transformer architecture, providing higher accuracy at the cost of greater computation.
Deep Averaging Network (DAN) variant: Averages word and bigram embeddings and passes them through a feed-forward network. This variant is faster but slightly less accurate.

USE was released through TensorFlow Hub and gained popularity for its simplicity: users could encode any English sentence into a fixed-length vector with a single function call. Google later released a multilingual version supporting 16 languages.

Image embeddings

Embeddings are not limited to text. In computer vision, image embeddings represent images as dense vectors that capture visual content, style, and semantic meaning.

CNN feature extraction

Convolutional neural networks (CNNs) trained on large-scale image classification tasks like ImageNet learn hierarchical visual features. Early layers detect edges and textures, middle layers capture parts and patterns, and deeper layers represent high-level semantic concepts. Removing the final classification layer from a pre-trained CNN and using the output of the penultimate layer (the layer before the softmax) as an embedding has been a standard technique since the mid-2010s.

Commonly used CNN architectures for image embeddings include:

Architecture	Year	Penultimate layer dimensions	Notable feature
VGG-16 / VGG-19	2014	4096	Simple, widely used baseline
ResNet-50	2015	2048	Residual connections, deeper networks
Inception v3	2015	2048	Multi-scale feature extraction
EfficientNet	2019	1280-1792	Compound scaling, efficient

These CNN-derived embeddings enable applications like image similarity search, visual recommendation systems, and transfer learning for specialized domains (medical imaging, satellite imagery, industrial inspection).

Vision Transformer embeddings

The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google in 2020, adapted the transformer architecture from NLP to image classification. ViT splits an image into fixed-size patches (typically 16x16 pixels), linearly projects each patch into a vector, and processes the sequence of patch embeddings through a standard transformer encoder. The output of the [CLS] token serves as the image embedding.

ViT-based embeddings have largely supplanted CNN features for many applications, particularly when combined with large-scale pre-training on datasets like LAION-5B or using self-supervised methods like DINOv2.

Multimodal embeddings

Multimodal embeddings map data from different modalities (text, images, audio) into a shared vector space, enabling cross-modal comparisons. If a photo of a dog and the text "a photo of a dog" are both embedded into the same space, their vectors should be close together.

CLIP

CLIP (Contrastive Language-Image Pre-training) was developed by OpenAI and released in January 2021.^[9] CLIP jointly trains an image encoder and a text encoder on 400 million image-text pairs collected from the internet, using a contrastive learning objective.^[9] By learning from natural-language supervision, the best CLIP model raised zero-shot ImageNet accuracy from a proof-of-concept 11.5% to 76.2%, matching the accuracy of the original supervised ResNet-50 without using any of ImageNet's 1.28 million training labels.^[9]

During training, CLIP receives a batch of image-text pairs. The image encoder (a Vision Transformer or ResNet) produces image embeddings, and the text encoder (a transformer) produces text embeddings. The model is trained to maximize the cosine similarity between matching image-text pairs while minimizing similarity between non-matching pairs.^[9] Both encoders produce vectors in the same 512-dimensional space (for ViT-B/32) or 768-dimensional space (for ViT-L/14).

CLIP's shared embedding space enables several applications:

Zero-shot image classification: Classify images by comparing their embeddings to text embeddings of class descriptions, without any task-specific training.
Image-text retrieval: Search for images using natural language queries, or find text descriptions that match a given image.
Image similarity: Compare images by measuring the distance between their embeddings.

CLIP's embeddings became the foundation for many downstream systems, including DALL-E 2 (which used CLIP embeddings to guide image generation) and numerous image search engines.

ImageBind

ImageBind was released by Meta AI in May 2023 as a model that learns a joint embedding space across six modalities: images, text, audio, depth, thermal, and inertial measurement unit (IMU) data.^[10] The key insight behind ImageBind is that not all modality pairs need to be trained together. By using images as a "binding" modality (since image-paired data exists for each of the other modalities), ImageBind can align all six modalities into a single vector space.^[10]

This approach enables emergent cross-modal capabilities that were never explicitly trained. For example, ImageBind can retrieve audio clips given a text query (even though it was never trained on text-audio pairs directly) because both text and audio are aligned to images in the shared space.^[10] ImageBind appeared at CVPR 2023 as a highlighted paper.^[10]

Modern embedding models

The embedding model ecosystem has expanded rapidly since 2023, with both commercial API providers and open-source projects releasing increasingly capable models. Modern embedding models typically support Matryoshka Representation Learning (MRL), which trains embeddings so that the first N dimensions of a vector form a useful lower-dimensional embedding on their own. This allows users to truncate embeddings to smaller sizes with minimal loss in quality, reducing storage and computation costs.

OpenAI embedding models

OpenAI released its third-generation embedding models in January 2024:

text-embedding-3-small: Produces 1536-dimensional embeddings with a maximum input of 8,191 tokens. Designed as a cost-effective option for most use cases.
text-embedding-3-large: Produces up to 3,072-dimensional embeddings with a maximum input of 8,191 tokens. Supports MRL, allowing users to reduce dimensionality (e.g., to 256 dimensions) while retaining strong performance. The 256-dimensional version of text-embedding-3-large outperforms the full 1,536-dimensional ada-002 model.

These models replaced the earlier text-embedding-ada-002 (released in December 2022), which produced 1,536-dimensional embeddings and was one of the first widely adopted commercial embedding APIs.

Cohere Embed

Cohere's Embed v4, released in 2025, is a multimodal embedding model supporting both text and images. It handles interleaved text and image content, making it suitable for document understanding and visual search. Key specifications include a default dimension of 1,536 (with MRL support for 256, 512, and 1,024 dimensions), a context window of approximately 128,000 tokens, and support for multiple output formats including float, int8, uint8, binary, and ubinary precision.

Voyage AI

Voyage AI, acquired by Anthropic, has released several competitive embedding models. Voyage-3-large (January 2025) offers 1,024-dimensional embeddings with MRL support (256, 512, 2,048 dimensions) and has been shown to outperform OpenAI's text-embedding-3-large by an average of 9.74% across evaluated domains. Voyage-3.5 (May 2025) further improved quality with 2,048-dimensional embeddings and a 32,000-token context window, reducing vector database costs by 83% compared to OpenAI's text-embedding-3-large when using int8 quantization.

BGE (BAAI General Embedding)

The BGE series from the Beijing Academy of Artificial Intelligence (BAAI) has been a leading open-source embedding family. BGE-M3 (released January 2024) stands for Multi-linguality, Multi-granularity, and Multi-functionality:^[13]

Multi-linguality: Supports over 100 languages.
Multi-granularity: Handles inputs from short sentences to long documents up to 8,192 tokens.
Multi-functionality: Performs dense retrieval, sparse (lexical) retrieval, and multi-vector (ColBERT-style) retrieval simultaneously from a single model.

BGE-M3 produces 1,024-dimensional dense embeddings and has ranked at or near the top of the MTEB (Massive Text Embedding Benchmark) leaderboard.^[13]

Other notable models

Model	Developer	Dimensions	Max tokens	Key feature
Jina Embeddings v3	Jina AI	1024 (MRL: 32-1024)	8,192	Task-specific LoRA adapters, 570M params
Nomic Embed Text v1.5	Nomic AI	768 (MRL: 64-768)	8,192	Fully open-source, reproducible training
Gemini Embedding	Google	3,072	8,192	Distilled from Gemini LLM, multimodal v2
mixedbread mxbai-embed-large	mixedbread.ai	1,024	512	Top open-source MTEB performer

Comprehensive embedding model comparison

The following table summarizes the major embedding models available as of early 2026:

Model	Provider	Release	Dimensions	Max tokens	Open source	MRL support	Multimodal
text-embedding-3-small	OpenAI	Jan 2024	1,536	8,191	No	Yes	No
text-embedding-3-large	OpenAI	Jan 2024	3,072	8,191	No	Yes	No
Embed v4	Cohere	2025	1,536	~128,000	No	Yes	Yes
Voyage-3-large	Voyage AI	Jan 2025	1,024	32,000	No	Yes	No
Voyage-3.5	Voyage AI	May 2025	2,048	32,000	No	Yes	No
BGE-M3	BAAI	Jan 2024	1,024	8,192	Yes	No	No
Jina Embeddings v3	Jina AI	Sep 2024	1,024	8,192	Yes	Yes	No
Nomic Embed v1.5	Nomic AI	2024	768	8,192	Yes	Yes	No
Gemini Embedding 001	Google	2024	3,072	8,192	No	Yes	No
Gemini Embedding 2	Google	2025	3,072	8,192	No	Yes	Yes

Embedding dimensions and their effects

The dimensionality of an embedding vector is one of the most important design choices in any embedding-based system. It affects retrieval quality, storage costs, computation speed, and memory usage.

How dimensions encode information

Each dimension in an embedding vector represents a learned feature or concept. In a well-trained embedding model, individual dimensions rarely correspond to interpretable human concepts, but collectively the dimensions form a coordinate system where geometric relationships (distances and angles) encode semantic relationships. Higher-dimensional spaces can, in principle, capture finer-grained distinctions between concepts, because there are more axes along which items can differ.

The quality-cost tradeoff

Increasing embedding dimensions generally improves retrieval quality, but with diminishing returns. Research and benchmarks have consistently shown that:

Going from 64 to 256 dimensions provides a large quality improvement.
Going from 256 to 768 dimensions provides a moderate improvement.
Going from 768 to 1,536 or 3,072 dimensions provides a smaller, sometimes negligible improvement depending on the task.

For most production RAG systems, 384 to 768 dimensions deliver a strong balance of accuracy, speed, and cost. The accuracy curve typically flattens somewhere between 768 and 1,024 dimensions for general-purpose retrieval.

Storage and computation costs scale linearly with dimensions. A concrete example: a collection of 10 million vectors costs roughly $3.75 per month at 384 dimensions versus $30 per month at 3,072 dimensions (using float32 storage). Query latency also scales with dimension count; computing similarity between 384-dimensional vectors is roughly 4 times faster than for 1,536-dimensional vectors.

Matryoshka Representation Learning

Matryoshka Representation Learning (MRL), introduced by Aditya Kusupati and colleagues at NeurIPS 2022, addresses the dimension tradeoff by training a single model to produce embeddings that are useful at multiple scales simultaneously.^[11] The name refers to Russian nesting dolls (matryoshka): the first 256 dimensions of a 1,024-dimensional embedding form a useful 256-dimensional embedding on their own.^[11]

MRL works by computing the training loss at multiple dimension checkpoints during training. For example, the model computes the contrastive loss for the first 64, 128, 256, 512, and 1,024 dimensions separately and sums the losses. This forces the model to pack the most important information into the first dimensions.^[11]

The authors reported that Matryoshka representations deliver "up to 14x smaller embedding size for ImageNet-1K classification at the same level of accuracy."^[11] The practical benefits of MRL are substantial:

Up to 14x smaller embedding size for image classification at the same level of accuracy.^[11]
Up to 14x real-world speed improvements for large-scale retrieval.^[11]
Flexible deployment where different use cases can choose different dimension/cost tradeoffs from a single model.

Most modern commercial embedding models (OpenAI text-embedding-3, Cohere Embed v4, Voyage, Jina v3, Nomic Embed v1.5) support MRL.

Similarity measures

Once data is embedded into vector space, comparing items requires a distance or similarity function. The three most common measures for embeddings are cosine similarity, dot product, and Euclidean distance.

Cosine similarity

Cosine similarity measures the angle between two vectors, ignoring their magnitudes:

\mathrm{cosine\_similarity}(A, B) = \frac{A \cdot B}{\lVert A \rVert \, \lVert B \rVert}

Values range from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality (no similarity). Cosine similarity is the most widely used metric for text embeddings because it normalizes for vector length, making it robust to differences in document length or embedding magnitude. Most embedding model providers recommend cosine similarity as the default metric.

Dot product

The dot product (inner product) computes the sum of element-wise products of two vectors:

\mathrm{dot\_product}(A, B) = \sum_i A_i B_i

Unlike cosine similarity, the dot product is affected by both the angle and the magnitude of the vectors. This makes it useful when magnitude carries meaning, for example, in recommendation systems where a larger magnitude might indicate stronger user preference or item popularity. When vectors are L2-normalized (unit vectors), cosine similarity and dot product produce identical rankings.

Euclidean distance

Euclidean distance measures the straight-line distance between two points in the vector space:

\mathrm{euclidean\_distance}(A, B) = \sqrt{\sum_i (A_i - B_i)^2}

Smaller distances indicate greater similarity. Euclidean distance is sensitive to both direction and magnitude and works well when absolute differences in feature values matter, such as in clustering tasks or when comparing feature vectors with count-based attributes.

Comparison of similarity measures

Metric	Range	Considers magnitude	Best for	Notes
Cosine similarity	-1 to 1	No	Text search, semantic similarity	Most common for text embeddings
Dot product	-inf to +inf	Yes	Recommendations, when magnitude matters	Equivalent to cosine for normalized vectors
Euclidean distance	0 to +inf	Yes	Clustering, spatial analysis	Lower values mean more similar

A practical rule of thumb: match the similarity metric to the one used during the embedding model's training. Most text embedding models are trained with cosine similarity or a contrastive objective that normalizes embeddings, so cosine similarity is usually the right default.

What are embeddings used for?

Embeddings underpin a wide range of modern AI applications.

Semantic search

Traditional keyword search relies on exact or fuzzy string matching, which fails when users phrase queries differently from the stored documents. Semantic search uses embeddings to match queries to documents based on meaning rather than keywords. Both the query and the corpus documents are embedded using the same model, and the documents most similar to the query embedding (by cosine similarity) are returned as results.

Semantic search handles synonyms ("car" matches "automobile"), paraphrases ("how to fix a flat tire" matches a document titled "tire repair instructions"), and even cross-lingual queries when multilingual embeddings are used. Companies like Google, Bing, and Spotify use embedding-based semantic search in their core products.

Recommendation systems

Recommendation systems use embeddings to represent users and items (products, movies, songs, articles) in a shared vector space. Users who have similar tastes end up with similar user embeddings, and items that are frequently consumed together end up with similar item embeddings. Recommendations are generated by finding items whose embeddings are closest to the user's embedding.

Collaborative filtering with embeddings has largely replaced explicit feature engineering in modern recommendation systems. Platforms like Netflix, YouTube, and Amazon use embedding-based approaches to power their recommendation engines.

Clustering and topic modeling

Embeddings enable clustering of documents, sentences, or images by grouping similar vectors together using algorithms like k-means, DBSCAN, or HDBSCAN. This is useful for topic discovery, customer feedback analysis, content organization, and anomaly detection. Because embeddings capture semantic meaning, clusters formed from embeddings tend to be more meaningful than those formed from bag-of-words representations.

Classification

Embeddings serve as input features for classification tasks. Rather than training a model from raw text or images, practitioners encode the data into embeddings and train a lightweight classifier (logistic regression, SVM, or a small neural network) on top. This approach, sometimes called "embedding + head," is fast to train, requires fewer labeled examples (because the embedding model has already learned general representations), and often achieves competitive accuracy.

Retrieval-Augmented Generation (RAG)

RAG is one of the most important applications of embeddings in the era of large language models.^[16] A RAG system combines a retrieval component (powered by embeddings) with a generative LLM to produce answers grounded in external knowledge.^[16] Embeddings are central to the retrieval step.

The RAG pipeline consists of two phases:

Indexing phase (offline):

Collect source documents (knowledge base articles, PDFs, web pages, database records).
Split documents into chunks (typically 256 to 1,024 tokens, with optional overlap).
Embed each chunk using an embedding model.
Store the chunk embeddings and their associated text in a vector database.

Query phase (online):

Receive a user query.
Embed the query using the same embedding model.
Perform a nearest-neighbor search in the vector database to find the most relevant chunks.
Pass the retrieved chunks as context to the LLM along with the user query.
The LLM generates an answer grounded in the retrieved context.

The quality of a RAG system depends heavily on the quality of its embeddings. Better embeddings produce more relevant retrievals, which lead to more accurate and grounded LLM responses.^[16] Organizations deploying RAG systems increasingly fine-tune their embedding models on domain-specific data to improve retrieval quality for their particular use case.

Anomaly detection

Embeddings can identify outliers in datasets by flagging items whose embedding vectors are far from any cluster center or from the distribution of normal examples. This technique is applied in fraud detection, network security, quality control in manufacturing, and monitoring of ML model inputs for data drift.

Deduplication

Near-duplicate detection uses embeddings to find items that are semantically identical or nearly identical, even if they differ in surface form. This is applied in data cleaning, content moderation (detecting reposted content), and dataset curation for training ML models.

Vector databases

As embedding-based applications have grown, so has the need for specialized databases optimized for storing, indexing, and querying high-dimensional vectors at scale. Vector databases (sometimes called vector stores) provide efficient nearest-neighbor search over millions or billions of embedding vectors.

How vector databases work

Vector databases use approximate nearest neighbor (ANN) algorithms to make similarity search fast. Exact nearest-neighbor search requires comparing a query vector against every stored vector, which is O(n) and impractical at scale. ANN algorithms trade a small amount of accuracy for large speed improvements by building index structures that allow the database to quickly narrow down the search space.

Common ANN indexing algorithms include:

HNSW (Hierarchical Navigable Small World): Builds a multi-layer graph structure where each node connects to its nearest neighbors. Queries traverse the graph from top (coarse) to bottom (fine) layers. Widely used for its balance of speed, accuracy, and memory efficiency.
IVF (Inverted File Index): Partitions the vector space into clusters using k-means. At query time, only vectors in the nearest clusters are searched. Often combined with product quantization (IVF-PQ) to reduce memory usage.
Annoy (Approximate Nearest Neighbors Oh Yeah): Uses random projection trees to partition the space. Developed by Spotify. Simple and memory-efficient, but slower than HNSW for large datasets.
ScaNN (Scalable Nearest Neighbors): Developed by Google. Uses anisotropic vector quantization and is optimized for maximum inner product search.

Major vector databases

Database	Type	Language	ANN algorithm	Key strength	Scale
Pinecone	Managed cloud service	-	Proprietary	Fully managed, zero-ops	Billions of vectors
Weaviate	Open source / cloud	Go	HNSW	Hybrid search, GraphQL API	Billions of vectors
Milvus	Open source / cloud (Zilliz)	Go / C++	IVF, HNSW, DiskANN	Distributed, high throughput	Billions of vectors
Chroma	Open source	Python	HNSW	Developer-friendly, lightweight	Millions of vectors
Qdrant	Open source / cloud	Rust	HNSW	Fast filtering, Rust performance	Billions of vectors
pgvector	PostgreSQL extension	C	IVFFlat, HNSW	Integrates with existing Postgres	Tens of millions of vectors
FAISS	Library (not a database)	C++ / Python	IVF, HNSW, PQ	Facebook AI research library, very fast	Billions of vectors

Pinecone is a fully managed vector database that handles scaling, indexing, and infrastructure automatically. It is popular among teams that want to build embedding-based applications without managing database infrastructure. Pinecone supports metadata filtering, namespaces for multi-tenancy, and serverless deployment.

Weaviate is an open-source vector database written in Go that combines vector search with structured filtering and a GraphQL API. Weaviate includes built-in vectorization modules that can automatically embed data on ingestion using models like OpenAI, Cohere, or Hugging Face transformers. Its hybrid search feature combines dense vector search with BM25 keyword search.

Milvus is a distributed open-source vector database originally developed by Zilliz. It is designed for high-throughput scenarios and supports multiple index types (IVF-Flat, IVF-PQ, HNSW, DiskANN). Milvus can handle datasets with billions of vectors through its distributed architecture with separate storage and compute nodes. Zilliz Cloud is the managed version.

Chroma is a lightweight, developer-friendly vector database designed for rapid prototyping and small to medium-scale applications. It runs in-process (embedded mode) or as a standalone server and is a popular choice for developers building RAG applications quickly. Chroma is not designed for billion-scale datasets.

Qdrant is an open-source vector database written in Rust, optimized for performance and reliability. It supports HNSW indexing with quantization, advanced filtering with payload (metadata) indexes, and distributed deployment. Qdrant provides ACID-compliant transactions and horizontal scaling.

pgvector is a PostgreSQL extension that adds vector similarity search capabilities to existing PostgreSQL databases. It supports IVFFlat and HNSW indexing. The main advantage of pgvector is that it allows teams already using PostgreSQL to add vector search without introducing a new database into their stack. However, pgvector is generally suited for datasets up to tens of millions of vectors; larger datasets may require a dedicated vector database.

FAISS (Facebook AI Similarity Search) is a library, not a database, developed by Meta AI Research. It provides highly optimized implementations of several ANN algorithms and is often used as the search engine underlying other vector databases. FAISS supports GPU acceleration and can handle billion-scale datasets.

Fine-tuning embeddings

Pre-trained embedding models produce general-purpose representations that work well across many tasks. However, for specific domains or applications, fine-tuning the embedding model on domain-specific data can yield significant quality improvements. Benchmark studies from 2025 indicate that domain-specialized embedding models can outperform general-purpose models by 12 to 30 percent on industry-specific retrieval tasks.

Why fine-tune

General-purpose embedding models are trained on broad internet-scale data. They may not accurately represent the vocabulary, concepts, or similarity relationships specific to a particular domain. For example:

In a legal application, "consideration" has a specific legal meaning (something of value exchanged in a contract) that differs from its everyday meaning.
In a medical context, drug names, disease codes, and clinical terminology may be underrepresented in general-purpose training data.
In an e-commerce setting, product attributes and brand relationships may not be well-captured by a general model.

Fine-tuning adapts the model's representations to better capture domain-specific semantics.

Fine-tuning approaches

Contrastive learning is the most common approach for fine-tuning embedding models. The model is trained to produce similar embeddings for semantically related items (positive pairs) and dissimilar embeddings for unrelated items (negative pairs). Common contrastive loss functions include:

InfoNCE loss: Used in CLIP and many modern embedding models. For each positive pair, the loss function treats all other items in the batch as negatives.
Triplet loss: Each training example consists of an anchor, a positive example (similar to the anchor), and a negative example (dissimilar). The model learns to place the anchor closer to the positive than to the negative by at least a margin. Triplet loss preserves greater variance within classes, supporting finer-grained distinctions.
Multiple Negatives Ranking Loss (MNRL): Used by Sentence Transformers. Each batch contains positive pairs, and all other positives in the batch serve as in-batch negatives. This is memory-efficient and effective.

Hard negative mining improves fine-tuning quality by selecting negative examples that are challenging for the model (items that are similar but not relevant). Random negatives are often too easy, providing little learning signal. Hard negatives force the model to learn more nuanced distinctions.

Knowledge distillation uses a larger, more capable model (teacher) to generate soft labels that a smaller model (student) learns to replicate. This approach is used by models like Gecko, which distills retrieval knowledge from large language models into a compact embedding model.

Training data requirements

Fine-tuning embeddings typically requires pairs or triplets of examples:

Data format	Structure	Example
Positive pairs	(query, relevant document)	("how to reset password", "Password reset instructions: Go to Settings...")
Triplets	(anchor, positive, negative)	("python list sort", "Use the sort() method...", "Python was created by Guido...")
Scored pairs	(sentence A, sentence B, similarity score)	("The cat sat on the mat", "A feline rested on the rug", 0.85)

Datasets of 10,000 to 100,000 examples are typically sufficient for meaningful improvements, though more data generally helps. The Sentence Transformers library provides built-in support for loading these data formats and training with various loss functions.

Embedding quantization and compression

As embedding-based systems scale to millions or billions of vectors, storage and computation costs become significant. Several techniques reduce these costs.

Scalar quantization

Scalar quantization converts 32-bit floating-point values to lower-precision formats. Converting from float32 to int8 (8-bit integers) reduces storage by 4x with minimal quality loss (typically less than 0.3 percent degradation). Binary quantization (1-bit) provides 32x compression but with more noticeable quality loss. Cohere Embed v4 supports float, int8, uint8, binary, and ubinary output formats natively.

Product quantization

Product quantization (PQ) divides each vector into subvectors and quantizes each subvector independently using a learned codebook. This technique, widely used in FAISS and Milvus, can achieve 8 to 64x compression ratios while maintaining reasonable search accuracy.

Dimensionality reduction

Beyond MRL (described above), traditional dimensionality reduction techniques like PCA (Principal Component Analysis) can be applied post-hoc to reduce embedding dimensions. Combining float8 quantization with PCA (retaining 50 percent of dimensions) can yield 8x total compression with less quality degradation than int8 quantization alone.

Evaluation and benchmarks

Embedding models are evaluated on standardized benchmarks to compare their quality across tasks.

MTEB (Massive Text Embedding Benchmark)

MTEB is the most comprehensive benchmark for text embedding models.^[14] On release in 2023 it spanned 8 task categories across 58 datasets and 112 languages, and the authors benchmarked 33 models, finding that "no particular text embedding method dominates across all tasks."^[14] The 8 task categories are classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitextual mining.^[14] The MTEB leaderboard on Hugging Face is the primary reference for comparing embedding model quality.

BEIR (Benchmarking IR)

BEIR is a heterogeneous benchmark for evaluating information retrieval models across 18 datasets spanning diverse domains (biomedical, financial, scientific, etc.) and task types (question answering, fact checking, entity retrieval).^[15] BEIR specifically tests zero-shot generalization, measuring how well retrieval models perform on domains and tasks not seen during training.^[15]

STS Benchmark

The Semantic Textual Similarity Benchmark measures how well embeddings capture sentence-level semantic similarity. Model pairs of sentences are scored for similarity, and the model's cosine similarity scores are compared to human judgments using Spearman or Pearson correlation.

The embedding pipeline in practice

Building an embedding-based application involves several decisions and steps.

How do you choose an embedding model?

Key considerations when selecting an embedding model include:

Task: Retrieval, classification, clustering, or similarity. Some models are optimized for specific tasks.
Domain: General-purpose or domain-specific. Legal, medical, and code domains may benefit from specialized models.
Language: Monolingual or multilingual. Models like BGE-M3 and Jina v3 support 100+ languages.
Dimensions: Higher dimensions capture more nuance but cost more to store and search. MRL-capable models offer flexibility.
Context length: Longer context windows (8K to 128K tokens) enable embedding entire documents without chunking.
Deployment: Cloud API (OpenAI, Cohere, Voyage) vs. self-hosted open-source (BGE, Nomic, Jina).
Cost: API pricing varies significantly. Open-source models are free to run but require compute infrastructure.

Chunking strategies

For documents longer than the embedding model's context window, text must be split into chunks before embedding. Common chunking strategies include:

Fixed-size chunking: Split text into chunks of a fixed number of tokens (e.g., 512 tokens) with optional overlap (e.g., 50 tokens).
Semantic chunking: Split at sentence or paragraph boundaries to preserve semantic coherence.
Recursive chunking: Attempt to split at paragraph boundaries first, then sentence boundaries, then fixed-size as a fallback.
Document-structure-aware chunking: Use headings, sections, and other structural elements to define chunk boundaries.

Chunk size affects retrieval quality. Smaller chunks are more precise but may lack context. Larger chunks provide more context but may dilute the relevance signal. A chunk size of 256 to 512 tokens with 10 to 20 percent overlap is a common starting point.

Indexing and serving

After embedding, vectors are loaded into a vector database with appropriate metadata (document ID, section title, source URL). An ANN index is built, and the system is ready to serve queries. In production, embedding APIs and vector databases are typically deployed behind a service layer that handles authentication, rate limiting, caching of frequent queries, and result post-processing (reranking, filtering, deduplication).

Challenges and limitations

Despite their utility, embeddings have several known limitations.

Semantic collapse: Embedding models can sometimes map semantically different items to very similar vectors, particularly for out-of-distribution inputs or adversarial examples. This can lead to irrelevant search results.

Lack of interpretability: Individual dimensions of an embedding vector generally do not correspond to human-interpretable features. Understanding why two items have similar embeddings requires additional analysis techniques like probing classifiers or attention visualization.

Bias: Embeddings inherit biases present in their training data. Word2Vec embeddings trained on Google News famously exhibited gender stereotypes (e.g., "man" is to "computer programmer" as "woman" is to "homemaker"). Debiasing techniques exist but do not fully eliminate the problem.

Temporal drift: Embeddings trained on a static corpus do not reflect changes in language, culture, or knowledge that occur after training. The meaning of terms like "COVID" or "GPT" changed significantly over short periods, and static embeddings cannot capture these shifts.

Cross-model incompatibility: Embeddings produced by different models are not interchangeable. You cannot mix embeddings from OpenAI's text-embedding-3-large with embeddings from Cohere Embed v4 in the same vector database, because they occupy different vector spaces.

Future directions

Several trends are shaping the future of embeddings.

Longer context windows: Embedding models are supporting increasingly long inputs (Cohere Embed v4 supports ~128K tokens), reducing or eliminating the need for chunking.

Multimodal unification: Following CLIP and ImageBind, embedding models are expanding to jointly embed text, images, audio, video, and code in shared spaces.

Late interaction and multi-vector approaches: Models like ColBERT represent queries and documents as sets of token-level vectors rather than single vectors, enabling more fine-grained matching at the cost of higher storage. BGE-M3 supports this approach alongside traditional dense retrieval.

Learned sparse embeddings: Combining dense embeddings with learned sparse representations (as in SPLADE and BGE-M3's sparse mode) provides hybrid retrieval that captures both semantic similarity and exact keyword matching.

Embedding-native architectures: New model architectures are being designed specifically for producing high-quality embeddings rather than being adapted from language models trained for text generation.

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1301.3781 ↩
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." Advances in Neural Information Processing Systems 26 (NeurIPS). https://arxiv.org/abs/1310.4546 ↩
Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). https://nlp.stanford.edu/pubs/glove.pdf ↩
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics, 5, 135-146. https://arxiv.org/abs/1607.04606 ↩
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep contextualized word representations." Proceedings of NAACL-HLT 2018. https://arxiv.org/abs/1802.05365 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019. https://arxiv.org/abs/1810.04805 ↩
Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of EMNLP-IJCNLP 2019. https://arxiv.org/abs/1908.10084 ↩
Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & Kurzweil, R. (2018). "Universal Sentence Encoder." https://arxiv.org/abs/1803.11175 ↩
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of ICML 2021. https://arxiv.org/abs/2103.00020 ↩
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., & Misra, I. (2023). "ImageBind: One Embedding Space To Bind Them All." Proceedings of CVPR 2023. https://arxiv.org/abs/2305.05665 ↩
Kusupati, A., Bhatt, G., Rez, A., Wallingford, M., Sinha, A., Ramanujan, V., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., & Farhadi, A. (2022). "Matryoshka Representation Learning." Advances in Neural Information Processing Systems 35 (NeurIPS). https://arxiv.org/abs/2205.13147 ↩
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." Journal of Machine Learning Research, 3, 1137-1155. ↩
Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." https://arxiv.org/abs/2402.03216 ↩
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." Proceedings of EACL 2023. https://arxiv.org/abs/2210.07316 ↩
Thakur, N., Reimers, N., Ruckte, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." Proceedings of NeurIPS 2021 Datasets and Benchmarks Track. https://arxiv.org/abs/2104.08663 ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuettler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems 33 (NeurIPS). https://arxiv.org/abs/2005.11401 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit