Embeddings are dense vector representations of data in a continuous vector space, where semantically similar items are mapped to nearby points. In machine learning and artificial intelligence, embeddings serve as the bridge between raw data (words, sentences, images, audio) and the numerical representations that algorithms can process. Rather than treating data as discrete symbols, embeddings capture meaning, relationships, and context by encoding information into fixed-length arrays of floating-point numbers.
The concept of embeddings has become foundational to modern AI systems. From natural language processing (NLP) to computer vision, recommendation engines to retrieval-augmented generation (RAG), embeddings enable machines to reason about similarity, analogy, and relatedness across virtually any type of data. The development of embedding techniques over the past decade represents one of the most impactful shifts in how AI systems represent and process information.
Before embeddings became standard, machine learning systems typically represented text using sparse, high-dimensional vectors. Bag-of-words models and TF-IDF (term frequency-inverse document frequency) representations created vectors with as many dimensions as there were unique words in the vocabulary, often tens or hundreds of thousands. These sparse representations suffered from several problems: they treated every word as independent of every other word, they could not capture synonymy or polysemy, and they required enormous amounts of memory.
Early work on distributed representations dates back to the 1980s, when Geoffrey Hinton introduced the idea of representing concepts as patterns of activity across multiple processing units in a neural network. Yoshua Bengio and colleagues advanced this idea in 2003 with their neural probabilistic language model, which learned continuous word representations as part of a language modeling task. However, it was not until 2013, when Tomas Mikolov and his team at Google published Word2Vec, that embeddings became practical for large-scale applications.
Word embeddings assign a dense vector to each word in a vocabulary such that words with similar meanings end up close together in the vector space. The major breakthroughs in this area came from three models: Word2Vec, GloVe, and FastText.
Word2Vec was introduced by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean at Google in 2013. The key insight was that a shallow neural network trained on a simple word prediction task could learn rich semantic representations. Word2Vec offered two training architectures:
Both architectures used a vocabulary-sized softmax output layer, which was computationally expensive. Mikolov introduced two approximation techniques to make training feasible on large corpora: hierarchical softmax (organizing the vocabulary as a binary tree) and negative sampling (training the model to distinguish real context pairs from randomly generated negative pairs).
Word2Vec models were typically trained with 100 to 300 dimensions. Google released a pre-trained model with 300-dimensional vectors trained on roughly 100 billion words from Google News, covering a vocabulary of 3 million words and phrases. One of the most celebrated properties of Word2Vec embeddings was their ability to capture analogies through vector arithmetic. The classic example is: vector("king") - vector("man") + vector("woman") produces a vector closest to vector("queen").
GloVe (Global Vectors for Word Representation) was developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford University in 2014. While Word2Vec learned embeddings through local context windows, GloVe combined local context with global corpus statistics.
GloVe constructs a word-word co-occurrence matrix from the entire corpus, where each entry records how frequently two words appear within a specified window of each other. The model then factorizes this co-occurrence matrix using a weighted least-squares objective. The key insight was that ratios of co-occurrence probabilities encode meaning: for instance, the ratio of P(ice | solid) to P(ice | gas) is much larger than 1, while the ratio of P(steam | solid) to P(steam | gas) is much smaller than 1. GloVe's objective function was designed to preserve these ratios in the learned vector space.
The Stanford NLP Group released several pre-trained GloVe models:
| Training corpus | Tokens | Vocabulary size | Available dimensions |
|---|---|---|---|
| Wikipedia 2014 + Gigaword 5 | 6B | 400K | 50, 100, 200, 300 |
| Common Crawl (42B) | 42B | 1.9M | 300 |
| Common Crawl (840B) | 840B | 2.2M | 300 |
| 27B | 1.2M | 25, 50, 100, 200 |
GloVe achieved competitive or superior performance to Word2Vec on word analogy, word similarity, and named entity recognition tasks. The 300-dimensional vectors trained on 840 billion tokens became one of the most widely used pre-trained embedding resources in NLP.
FastText was developed by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov at Facebook AI Research (FAIR) and released in 2016. FastText extended the Word2Vec Skip-gram architecture by incorporating subword information.
Instead of learning a single vector per word, FastText represents each word as a bag of character n-grams (by default, n-grams of length 3 to 6) plus the word itself. The embedding for a word is the sum of the vectors for all its constituent n-grams. This approach provided two major advantages over Word2Vec and GloVe:
Facebook released pre-trained FastText vectors for 157 languages, each with 300 dimensions, trained on Wikipedia and Common Crawl data. FastText also included a text classification component that achieved accuracy competitive with deep learning models while training orders of magnitude faster.
| Model | Year | Developers | Method | Subword support | Typical dimensions | Key strength |
|---|---|---|---|---|---|---|
| Word2Vec | 2013 | Mikolov et al. (Google) | CBOW / Skip-gram | No | 100-300 | Efficient training, analogy properties |
| GloVe | 2014 | Pennington et al. (Stanford) | Co-occurrence matrix factorization | No | 50-300 | Global statistics, strong benchmarks |
| FastText | 2016 | Bojanowski et al. (Facebook) | Skip-gram with char n-grams | Yes | 100-300 | OOV handling, morphological awareness |
Word2Vec, GloVe, and FastText all produce a single fixed vector for each word, regardless of context. The word "bank" receives the same embedding whether it appears in "river bank" or "investment bank." This fundamental limitation, known as the polysemy problem, motivated the development of contextual embeddings.
Contextual embeddings generate different vector representations for the same word depending on the surrounding text. This approach captures polysemy and allows models to represent word meaning in a context-dependent way.
ELMo (Embeddings from Language Models) was introduced by Matthew Peters and colleagues at the Allen Institute for AI in February 2018. ELMo was the first widely adopted model to produce contextual word embeddings.
ELMo uses a two-layer bidirectional LSTM (biLSTM) trained as a language model on a large text corpus (the 1 Billion Word Benchmark). The forward LSTM predicts the next word given the preceding context, while the backward LSTM predicts the previous word given the following context. The two directions are trained independently and their outputs are concatenated.
For each token, ELMo produces three layers of representations: the character-based word embedding (from a character CNN), the first biLSTM layer output, and the second biLSTM layer output. The final ELMo embedding is a task-specific weighted combination of all three layers, where the weights are learned during fine-tuning on downstream tasks. Research showed that lower layers tend to capture syntactic information, while higher layers capture more semantic information.
ELMo embeddings improved the state of the art across six NLP tasks when its paper was published, and the work received the Best Paper Award at NAACL 2018. However, ELMo's bidirectionality was shallow: the forward and backward LSTMs were trained separately and only concatenated, rather than jointly attending to both left and right context at every layer.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google in October 2018, represented a major leap forward in contextual embeddings. Unlike ELMo's separate forward and backward passes, BERT uses a transformer encoder that attends to the full input sequence simultaneously at every layer, producing deeply bidirectional representations.
BERT was pre-trained on two objectives: Masked Language Modeling (MLM), where the model predicts randomly masked tokens from both left and right context, and Next Sentence Prediction (NSP), where the model learns relationships between sentence pairs. The resulting hidden states can be used as contextual embeddings for downstream tasks.
BERT-Base produces 768-dimensional embeddings, while BERT-Large produces 1024-dimensional embeddings. Researchers found that different layers capture different types of information, and a common practice is to average or concatenate the last four hidden layers to produce general-purpose token embeddings.
The success of BERT led to a family of encoder models that produce contextual embeddings, including RoBERTa, ALBERT, ELECTRA, and DeBERTa. These models are widely used as the backbone for embedding-based applications.
While word and token-level embeddings are useful for many tasks, applications like semantic search, document clustering, and sentence similarity require fixed-length representations of entire sentences or documents.
Sentence-BERT was introduced by Nils Reimers and Iryna Gurevych at the Technical University of Darmstadt in 2019. The authors identified a critical limitation of using BERT directly for sentence similarity: comparing two sentences with BERT requires feeding both sentences into the network simultaneously, making it computationally prohibitive at scale. Finding the most similar pair among 10,000 sentences would require roughly 50 million inference computations, taking about 65 hours.
SBERT solved this by fine-tuning BERT using a siamese network architecture. Two identical BERT models (sharing weights) independently encode two sentences, and a pooling operation (typically mean pooling over token embeddings) produces fixed-size sentence vectors. The network is trained using either a classification objective (with a softmax classifier on the concatenated sentence representations) for NLI data, or a regression objective (minimizing the mean squared error between predicted and gold similarity scores) for STS data.
The result was a model that could encode sentences into vectors that are directly comparable using cosine similarity. SBERT reduced the time to find the most similar pair from 65 hours to about 5 seconds, while maintaining accuracy comparable to BERT. The Sentence Transformers library, built on top of Hugging Face Transformers, has become the standard framework for training and using sentence embedding models.
The Universal Sentence Encoder (USE) was published by Daniel Cer and colleagues at Google in 2018. It encodes sentences into 512-dimensional vectors and was designed specifically for transfer learning across a wide range of NLP tasks.
USE offered two model variants:
USE was released through TensorFlow Hub and gained popularity for its simplicity: users could encode any English sentence into a fixed-length vector with a single function call. Google later released a multilingual version supporting 16 languages.
Embeddings are not limited to text. In computer vision, image embeddings represent images as dense vectors that capture visual content, style, and semantic meaning.
Convolutional neural networks (CNNs) trained on large-scale image classification tasks like ImageNet learn hierarchical visual features. Early layers detect edges and textures, middle layers capture parts and patterns, and deeper layers represent high-level semantic concepts. Removing the final classification layer from a pre-trained CNN and using the output of the penultimate layer (the layer before the softmax) as an embedding has been a standard technique since the mid-2010s.
Commonly used CNN architectures for image embeddings include:
| Architecture | Year | Penultimate layer dimensions | Notable feature |
|---|---|---|---|
| VGG-16 / VGG-19 | 2014 | 4096 | Simple, widely used baseline |
| ResNet-50 | 2015 | 2048 | Residual connections, deeper networks |
| Inception v3 | 2015 | 2048 | Multi-scale feature extraction |
| EfficientNet | 2019 | 1280-1792 | Compound scaling, efficient |
These CNN-derived embeddings enable applications like image similarity search, visual recommendation systems, and transfer learning for specialized domains (medical imaging, satellite imagery, industrial inspection).
The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google in 2020, adapted the transformer architecture from NLP to image classification. ViT splits an image into fixed-size patches (typically 16x16 pixels), linearly projects each patch into a vector, and processes the sequence of patch embeddings through a standard transformer encoder. The output of the [CLS] token serves as the image embedding.
ViT-based embeddings have largely supplanted CNN features for many applications, particularly when combined with large-scale pre-training on datasets like LAION-5B or using self-supervised methods like DINOv2.
Multimodal embeddings map data from different modalities (text, images, audio) into a shared vector space, enabling cross-modal comparisons. If a photo of a dog and the text "a photo of a dog" are both embedded into the same space, their vectors should be close together.
CLIP (Contrastive Language-Image Pre-training) was developed by OpenAI and released in January 2021. CLIP jointly trains an image encoder and a text encoder on 400 million image-text pairs collected from the internet, using a contrastive learning objective.
During training, CLIP receives a batch of image-text pairs. The image encoder (a Vision Transformer or ResNet) produces image embeddings, and the text encoder (a transformer) produces text embeddings. The model is trained to maximize the cosine similarity between matching image-text pairs while minimizing similarity between non-matching pairs. Both encoders produce vectors in the same 512-dimensional space (for ViT-B/32) or 768-dimensional space (for ViT-L/14).
CLIP's shared embedding space enables several applications:
CLIP's embeddings became the foundation for many downstream systems, including DALL-E 2 (which used CLIP embeddings to guide image generation) and numerous image search engines.
ImageBind was released by Meta AI in May 2023 as a model that learns a joint embedding space across six modalities: images, text, audio, depth, thermal, and inertial measurement unit (IMU) data. The key insight behind ImageBind is that not all modality pairs need to be trained together. By using images as a "binding" modality (since image-paired data exists for each of the other modalities), ImageBind can align all six modalities into a single vector space.
This approach enables emergent cross-modal capabilities that were never explicitly trained. For example, ImageBind can retrieve audio clips given a text query (even though it was never trained on text-audio pairs directly) because both text and audio are aligned to images in the shared space. ImageBind appeared at CVPR 2023 as a highlighted paper.
The embedding model ecosystem has expanded rapidly since 2023, with both commercial API providers and open-source projects releasing increasingly capable models. Modern embedding models typically support Matryoshka Representation Learning (MRL), which trains embeddings so that the first N dimensions of a vector form a useful lower-dimensional embedding on their own. This allows users to truncate embeddings to smaller sizes with minimal loss in quality, reducing storage and computation costs.
OpenAI released its third-generation embedding models in January 2024:
These models replaced the earlier text-embedding-ada-002 (released in December 2022), which produced 1,536-dimensional embeddings and was one of the first widely adopted commercial embedding APIs.
Cohere's Embed v4, released in 2025, is a multimodal embedding model supporting both text and images. It handles interleaved text and image content, making it suitable for document understanding and visual search. Key specifications include a default dimension of 1,536 (with MRL support for 256, 512, and 1,024 dimensions), a context window of approximately 128,000 tokens, and support for multiple output formats including float, int8, uint8, binary, and ubinary precision.
Voyage AI, acquired by Anthropic, has released several competitive embedding models. Voyage-3-large (January 2025) offers 1,024-dimensional embeddings with MRL support (256, 512, 2,048 dimensions) and has been shown to outperform OpenAI's text-embedding-3-large by an average of 9.74% across evaluated domains. Voyage-3.5 (May 2025) further improved quality with 2,048-dimensional embeddings and a 32,000-token context window, reducing vector database costs by 83% compared to OpenAI's text-embedding-3-large when using int8 quantization.
The BGE series from the Beijing Academy of Artificial Intelligence (BAAI) has been a leading open-source embedding family. BGE-M3 (released January 2024) stands for Multi-linguality, Multi-granularity, and Multi-functionality:
BGE-M3 produces 1,024-dimensional dense embeddings and has ranked at or near the top of the MTEB (Massive Text Embedding Benchmark) leaderboard.
| Model | Developer | Dimensions | Max tokens | Key feature |
|---|---|---|---|---|
| Jina Embeddings v3 | Jina AI | 1024 (MRL: 32-1024) | 8,192 | Task-specific LoRA adapters, 570M params |
| Nomic Embed Text v1.5 | Nomic AI | 768 (MRL: 64-768) | 8,192 | Fully open-source, reproducible training |
| Gemini Embedding | 3,072 | 8,192 | Distilled from Gemini LLM, multimodal v2 | |
| mixedbread mxbai-embed-large | mixedbread.ai | 1,024 | 512 | Top open-source MTEB performer |
The following table summarizes the major embedding models available as of early 2026:
| Model | Provider | Release | Dimensions | Max tokens | Open source | MRL support | Multimodal |
|---|---|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | Jan 2024 | 1,536 | 8,191 | No | Yes | No |
| text-embedding-3-large | OpenAI | Jan 2024 | 3,072 | 8,191 | No | Yes | No |
| Embed v4 | Cohere | 2025 | 1,536 | ~128,000 | No | Yes | Yes |
| Voyage-3-large | Voyage AI | Jan 2025 | 1,024 | 32,000 | No | Yes | No |
| Voyage-3.5 | Voyage AI | May 2025 | 2,048 | 32,000 | No | Yes | No |
| BGE-M3 | BAAI | Jan 2024 | 1,024 | 8,192 | Yes | No | No |
| Jina Embeddings v3 | Jina AI | Sep 2024 | 1,024 | 8,192 | Yes | Yes | No |
| Nomic Embed v1.5 | Nomic AI | 2024 | 768 | 8,192 | Yes | Yes | No |
| Gemini Embedding 001 | 2024 | 3,072 | 8,192 | No | Yes | No | |
| Gemini Embedding 2 | 2025 | 3,072 | 8,192 | No | Yes | Yes |
The dimensionality of an embedding vector is one of the most important design choices in any embedding-based system. It affects retrieval quality, storage costs, computation speed, and memory usage.
Each dimension in an embedding vector represents a learned feature or concept. In a well-trained embedding model, individual dimensions rarely correspond to interpretable human concepts, but collectively the dimensions form a coordinate system where geometric relationships (distances and angles) encode semantic relationships. Higher-dimensional spaces can, in principle, capture finer-grained distinctions between concepts, because there are more axes along which items can differ.
Increasing embedding dimensions generally improves retrieval quality, but with diminishing returns. Research and benchmarks have consistently shown that:
For most production RAG systems, 384 to 768 dimensions deliver a strong balance of accuracy, speed, and cost. The accuracy curve typically flattens somewhere between 768 and 1,024 dimensions for general-purpose retrieval.
Storage and computation costs scale linearly with dimensions. A concrete example: a collection of 10 million vectors costs roughly $3.75 per month at 384 dimensions versus $30 per month at 3,072 dimensions (using float32 storage). Query latency also scales with dimension count; computing similarity between 384-dimensional vectors is roughly 4 times faster than for 1,536-dimensional vectors.
Matryoshka Representation Learning (MRL), introduced by Aditya Kusupati and colleagues at NeurIPS 2022, addresses the dimension tradeoff by training a single model to produce embeddings that are useful at multiple scales simultaneously. The name refers to Russian nesting dolls (matryoshka): the first 256 dimensions of a 1,024-dimensional embedding form a useful 256-dimensional embedding on their own.
MRL works by computing the training loss at multiple dimension checkpoints during training. For example, the model computes the contrastive loss for the first 64, 128, 256, 512, and 1,024 dimensions separately and sums the losses. This forces the model to pack the most important information into the first dimensions.
The practical benefits of MRL are substantial:
Most modern commercial embedding models (OpenAI text-embedding-3, Cohere Embed v4, Voyage, Jina v3, Nomic Embed v1.5) support MRL.
Once data is embedded into vector space, comparing items requires a distance or similarity function. The three most common measures for embeddings are cosine similarity, dot product, and Euclidean distance.
Cosine similarity measures the angle between two vectors, ignoring their magnitudes:
cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
Values range from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality (no similarity). Cosine similarity is the most widely used metric for text embeddings because it normalizes for vector length, making it robust to differences in document length or embedding magnitude. Most embedding model providers recommend cosine similarity as the default metric.
The dot product (inner product) computes the sum of element-wise products of two vectors:
dot_product(A, B) = sum(A_i * B_i)
Unlike cosine similarity, the dot product is affected by both the angle and the magnitude of the vectors. This makes it useful when magnitude carries meaning, for example, in recommendation systems where a larger magnitude might indicate stronger user preference or item popularity. When vectors are L2-normalized (unit vectors), cosine similarity and dot product produce identical rankings.
Euclidean distance measures the straight-line distance between two points in the vector space:
euclidean_distance(A, B) = sqrt(sum((A_i - B_i)^2))
Smaller distances indicate greater similarity. Euclidean distance is sensitive to both direction and magnitude and works well when absolute differences in feature values matter, such as in clustering tasks or when comparing feature vectors with count-based attributes.
| Metric | Range | Considers magnitude | Best for | Notes |
|---|---|---|---|---|
| Cosine similarity | -1 to 1 | No | Text search, semantic similarity | Most common for text embeddings |
| Dot product | -inf to +inf | Yes | Recommendations, when magnitude matters | Equivalent to cosine for normalized vectors |
| Euclidean distance | 0 to +inf | Yes | Clustering, spatial analysis | Lower values mean more similar |
A practical rule of thumb: match the similarity metric to the one used during the embedding model's training. Most text embedding models are trained with cosine similarity or a contrastive objective that normalizes embeddings, so cosine similarity is usually the right default.
Embeddings underpin a wide range of modern AI applications.
Traditional keyword search relies on exact or fuzzy string matching, which fails when users phrase queries differently from the stored documents. Semantic search uses embeddings to match queries to documents based on meaning rather than keywords. Both the query and the corpus documents are embedded using the same model, and the documents most similar to the query embedding (by cosine similarity) are returned as results.
Semantic search handles synonyms ("car" matches "automobile"), paraphrases ("how to fix a flat tire" matches a document titled "tire repair instructions"), and even cross-lingual queries when multilingual embeddings are used. Companies like Google, Bing, and Spotify use embedding-based semantic search in their core products.
Recommendation systems use embeddings to represent users and items (products, movies, songs, articles) in a shared vector space. Users who have similar tastes end up with similar user embeddings, and items that are frequently consumed together end up with similar item embeddings. Recommendations are generated by finding items whose embeddings are closest to the user's embedding.
Collaborative filtering with embeddings has largely replaced explicit feature engineering in modern recommendation systems. Platforms like Netflix, YouTube, and Amazon use embedding-based approaches to power their recommendation engines.
Embeddings enable clustering of documents, sentences, or images by grouping similar vectors together using algorithms like k-means, DBSCAN, or HDBSCAN. This is useful for topic discovery, customer feedback analysis, content organization, and anomaly detection. Because embeddings capture semantic meaning, clusters formed from embeddings tend to be more meaningful than those formed from bag-of-words representations.
Embeddings serve as input features for classification tasks. Rather than training a model from raw text or images, practitioners encode the data into embeddings and train a lightweight classifier (logistic regression, SVM, or a small neural network) on top. This approach, sometimes called "embedding + head," is fast to train, requires fewer labeled examples (because the embedding model has already learned general representations), and often achieves competitive accuracy.
RAG is one of the most important applications of embeddings in the era of large language models. A RAG system combines a retrieval component (powered by embeddings) with a generative LLM to produce answers grounded in external knowledge. Embeddings are central to the retrieval step.
The RAG pipeline consists of two phases:
Indexing phase (offline):
Query phase (online):
The quality of a RAG system depends heavily on the quality of its embeddings. Better embeddings produce more relevant retrievals, which lead to more accurate and grounded LLM responses. Organizations deploying RAG systems increasingly fine-tune their embedding models on domain-specific data to improve retrieval quality for their particular use case.
Embeddings can identify outliers in datasets by flagging items whose embedding vectors are far from any cluster center or from the distribution of normal examples. This technique is applied in fraud detection, network security, quality control in manufacturing, and monitoring of ML model inputs for data drift.
Near-duplicate detection uses embeddings to find items that are semantically identical or nearly identical, even if they differ in surface form. This is applied in data cleaning, content moderation (detecting reposted content), and dataset curation for training ML models.
As embedding-based applications have grown, so has the need for specialized databases optimized for storing, indexing, and querying high-dimensional vectors at scale. Vector databases (sometimes called vector stores) provide efficient nearest-neighbor search over millions or billions of embedding vectors.
Vector databases use approximate nearest neighbor (ANN) algorithms to make similarity search fast. Exact nearest-neighbor search requires comparing a query vector against every stored vector, which is O(n) and impractical at scale. ANN algorithms trade a small amount of accuracy for large speed improvements by building index structures that allow the database to quickly narrow down the search space.
Common ANN indexing algorithms include:
| Database | Type | Language | ANN algorithm | Key strength | Scale |
|---|---|---|---|---|---|
| Pinecone | Managed cloud service | - | Proprietary | Fully managed, zero-ops | Billions of vectors |
| Weaviate | Open source / cloud | Go | HNSW | Hybrid search, GraphQL API | Billions of vectors |
| Milvus | Open source / cloud (Zilliz) | Go / C++ | IVF, HNSW, DiskANN | Distributed, high throughput | Billions of vectors |
| Chroma | Open source | Python | HNSW | Developer-friendly, lightweight | Millions of vectors |
| Qdrant | Open source / cloud | Rust | HNSW | Fast filtering, Rust performance | Billions of vectors |
| pgvector | PostgreSQL extension | C | IVFFlat, HNSW | Integrates with existing Postgres | Tens of millions of vectors |
| FAISS | Library (not a database) | C++ / Python | IVF, HNSW, PQ | Facebook AI research library, very fast | Billions of vectors |
Pinecone is a fully managed vector database that handles scaling, indexing, and infrastructure automatically. It is popular among teams that want to build embedding-based applications without managing database infrastructure. Pinecone supports metadata filtering, namespaces for multi-tenancy, and serverless deployment.
Weaviate is an open-source vector database written in Go that combines vector search with structured filtering and a GraphQL API. Weaviate includes built-in vectorization modules that can automatically embed data on ingestion using models like OpenAI, Cohere, or Hugging Face transformers. Its hybrid search feature combines dense vector search with BM25 keyword search.
Milvus is a distributed open-source vector database originally developed by Zilliz. It is designed for high-throughput scenarios and supports multiple index types (IVF-Flat, IVF-PQ, HNSW, DiskANN). Milvus can handle datasets with billions of vectors through its distributed architecture with separate storage and compute nodes. Zilliz Cloud is the managed version.
Chroma is a lightweight, developer-friendly vector database designed for rapid prototyping and small to medium-scale applications. It runs in-process (embedded mode) or as a standalone server and is a popular choice for developers building RAG applications quickly. Chroma is not designed for billion-scale datasets.
Qdrant is an open-source vector database written in Rust, optimized for performance and reliability. It supports HNSW indexing with quantization, advanced filtering with payload (metadata) indexes, and distributed deployment. Qdrant provides ACID-compliant transactions and horizontal scaling.
pgvector is a PostgreSQL extension that adds vector similarity search capabilities to existing PostgreSQL databases. It supports IVFFlat and HNSW indexing. The main advantage of pgvector is that it allows teams already using PostgreSQL to add vector search without introducing a new database into their stack. However, pgvector is generally suited for datasets up to tens of millions of vectors; larger datasets may require a dedicated vector database.
FAISS (Facebook AI Similarity Search) is a library, not a database, developed by Meta AI Research. It provides highly optimized implementations of several ANN algorithms and is often used as the search engine underlying other vector databases. FAISS supports GPU acceleration and can handle billion-scale datasets.
Pre-trained embedding models produce general-purpose representations that work well across many tasks. However, for specific domains or applications, fine-tuning the embedding model on domain-specific data can yield significant quality improvements. Benchmark studies from 2025 indicate that domain-specialized embedding models can outperform general-purpose models by 12 to 30 percent on industry-specific retrieval tasks.
General-purpose embedding models are trained on broad internet-scale data. They may not accurately represent the vocabulary, concepts, or similarity relationships specific to a particular domain. For example:
Fine-tuning adapts the model's representations to better capture domain-specific semantics.
Contrastive learning is the most common approach for fine-tuning embedding models. The model is trained to produce similar embeddings for semantically related items (positive pairs) and dissimilar embeddings for unrelated items (negative pairs). Common contrastive loss functions include:
Hard negative mining improves fine-tuning quality by selecting negative examples that are challenging for the model (items that are similar but not relevant). Random negatives are often too easy, providing little learning signal. Hard negatives force the model to learn more nuanced distinctions.
Knowledge distillation uses a larger, more capable model (teacher) to generate soft labels that a smaller model (student) learns to replicate. This approach is used by models like Gecko, which distills retrieval knowledge from large language models into a compact embedding model.
Fine-tuning embeddings typically requires pairs or triplets of examples:
| Data format | Structure | Example |
|---|---|---|
| Positive pairs | (query, relevant document) | ("how to reset password", "Password reset instructions: Go to Settings...") |
| Triplets | (anchor, positive, negative) | ("python list sort", "Use the sort() method...", "Python was created by Guido...") |
| Scored pairs | (sentence A, sentence B, similarity score) | ("The cat sat on the mat", "A feline rested on the rug", 0.85) |
Datasets of 10,000 to 100,000 examples are typically sufficient for meaningful improvements, though more data generally helps. The Sentence Transformers library provides built-in support for loading these data formats and training with various loss functions.
As embedding-based systems scale to millions or billions of vectors, storage and computation costs become significant. Several techniques reduce these costs.
Scalar quantization converts 32-bit floating-point values to lower-precision formats. Converting from float32 to int8 (8-bit integers) reduces storage by 4x with minimal quality loss (typically less than 0.3 percent degradation). Binary quantization (1-bit) provides 32x compression but with more noticeable quality loss. Cohere Embed v4 supports float, int8, uint8, binary, and ubinary output formats natively.
Product quantization (PQ) divides each vector into subvectors and quantizes each subvector independently using a learned codebook. This technique, widely used in FAISS and Milvus, can achieve 8 to 64x compression ratios while maintaining reasonable search accuracy.
Beyond MRL (described above), traditional dimensionality reduction techniques like PCA (Principal Component Analysis) can be applied post-hoc to reduce embedding dimensions. Combining float8 quantization with PCA (retaining 50 percent of dimensions) can yield 8x total compression with less quality degradation than int8 quantization alone.
Embedding models are evaluated on standardized benchmarks to compare their quality across tasks.
MTEB is the most comprehensive benchmark for text embedding models. It covers 8 task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitextual mining. MTEB evaluates models across dozens of datasets in multiple languages. The MTEB leaderboard on Hugging Face is the primary reference for comparing embedding model quality.
BEIR is a heterogeneous benchmark for evaluating information retrieval models across 18 datasets spanning diverse domains (biomedical, financial, scientific, etc.) and task types (question answering, fact checking, entity retrieval). BEIR specifically tests zero-shot generalization, measuring how well retrieval models perform on domains and tasks not seen during training.
The Semantic Textual Similarity Benchmark measures how well embeddings capture sentence-level semantic similarity. Model pairs of sentences are scored for similarity, and the model's cosine similarity scores are compared to human judgments using Spearman or Pearson correlation.
Building an embedding-based application involves several decisions and steps.
Key considerations when selecting an embedding model include:
For documents longer than the embedding model's context window, text must be split into chunks before embedding. Common chunking strategies include:
Chunk size affects retrieval quality. Smaller chunks are more precise but may lack context. Larger chunks provide more context but may dilute the relevance signal. A chunk size of 256 to 512 tokens with 10 to 20 percent overlap is a common starting point.
After embedding, vectors are loaded into a vector database with appropriate metadata (document ID, section title, source URL). An ANN index is built, and the system is ready to serve queries. In production, embedding APIs and vector databases are typically deployed behind a service layer that handles authentication, rate limiting, caching of frequent queries, and result post-processing (reranking, filtering, deduplication).
Despite their utility, embeddings have several known limitations.
Semantic collapse: Embedding models can sometimes map semantically different items to very similar vectors, particularly for out-of-distribution inputs or adversarial examples. This can lead to irrelevant search results.
Lack of interpretability: Individual dimensions of an embedding vector generally do not correspond to human-interpretable features. Understanding why two items have similar embeddings requires additional analysis techniques like probing classifiers or attention visualization.
Bias: Embeddings inherit biases present in their training data. Word2Vec embeddings trained on Google News famously exhibited gender stereotypes (e.g., "man" is to "computer programmer" as "woman" is to "homemaker"). Debiasing techniques exist but do not fully eliminate the problem.
Temporal drift: Embeddings trained on a static corpus do not reflect changes in language, culture, or knowledge that occur after training. The meaning of terms like "COVID" or "GPT" changed significantly over short periods, and static embeddings cannot capture these shifts.
Cross-model incompatibility: Embeddings produced by different models are not interchangeable. You cannot mix embeddings from OpenAI's text-embedding-3-large with embeddings from Cohere Embed v4 in the same vector database, because they occupy different vector spaces.
Several trends are shaping the future of embeddings.
Longer context windows: Embedding models are supporting increasingly long inputs (Cohere Embed v4 supports ~128K tokens), reducing or eliminating the need for chunking.
Multimodal unification: Following CLIP and ImageBind, embedding models are expanding to jointly embed text, images, audio, video, and code in shared spaces.
Late interaction and multi-vector approaches: Models like ColBERT represent queries and documents as sets of token-level vectors rather than single vectors, enabling more fine-grained matching at the cost of higher storage. BGE-M3 supports this approach alongside traditional dense retrieval.
Learned sparse embeddings: Combining dense embeddings with learned sparse representations (as in SPLADE and BGE-M3's sparse mode) provides hybrid retrieval that captures both semantic similarity and exact keyword matching.
Embedding-native architectures: New model architectures are being designed specifically for producing high-quality embeddings rather than being adapted from language models trained for text generation.