# Embedding vector

> Source: https://aiwiki.ai/wiki/embedding_vector
> Updated: 2026-07-11
> Categories: Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

An **embedding vector** is a dense, fixed-length array of real numbers that represents a discrete object (such as a word, sentence, image, audio clip, or graph node) as a point in a continuous vector space. Unlike sparse representations like [one-hot encoding](/wiki/one-hot_encoding) or bag-of-words, where most entries are zero, embedding vectors pack semantic information into every dimension. Objects with similar meanings or properties are mapped to nearby points in the [embedding space](/wiki/embedding_space), enabling algorithms to reason about similarity, perform arithmetic on meanings, and generalize across related inputs.

Embedding vectors have become one of the most important building blocks in modern artificial intelligence. They serve as the primary interface between raw data and [neural network](/wiki/neural_network) models across [natural language processing](/wiki/natural_language_processing), computer vision, speech, recommendation systems, and graph analytics. Whenever a model needs to compare two objects, search a large corpus, cluster items by topic, or condition a generation step on retrieved context, an embedding vector is almost always doing the work in the background. The same underlying object is referred to in different settings as a feature vector, a latent representation, a hidden state, or simply an [embedding](/wiki/embeddings); the term "embedding vector" emphasizes the concrete numerical artifact stored in memory. Modern embedding vectors typically range from 50 to 4,096 dimensions, and a single 1,536-dimensional float32 vector occupies about 6 kilobytes; the underlying technique dates to the 1986 "distributed representations" chapter of *Parallel Distributed Processing* by Geoffrey Hinton, James McClelland, and David Rumelhart, who argued that knowledge in connectionist networks should be encoded as "patterns of activity" across many simple units rather than by single dedicated nodes.[1]

## Explain like I'm 5

Imagine you have a big collection of LEGO bricks, and each brick represents a word, a picture, or some other thing. You want a robot to understand which bricks are similar. So you invent a secret code: every brick gets a short list of numbers, like [0.3, 0.8, 0.1]. Bricks that are alike (a red fire truck and a red car) get codes with numbers that are close together, and bricks that are very different (a fire truck and a banana) get codes that are far apart. The robot reads the codes and instantly knows how things are related, without ever needing to look at the bricks themselves. That list of numbers is the embedding vector.

## Definition and notation

Formally, an embedding is a function $$f: X \to \mathbb{R}^d$$ that maps every object $$x$$ in some discrete or structured set $$X$$ (a vocabulary, a set of users, a collection of images, a graph) into a real-valued vector of length $$d$$. The integer $$d$$ is called the embedding dimension. The image $$f(x)$$ is the embedding vector for $$x$$. The vector space $$\mathbb{R}^d$$, together with the distribution of all $$f(x)$$ for $$x \in X$$, is called the embedding space.

A few notational conventions are common. Vectors are written in lowercase bold ($$v$$, $$u$$) or with a vector arrow. The $$i$$-th component is written $$v_i$$. The L2 norm is written $$\lVert v \rVert$$. The dot product of two vectors is written $$v \cdot u$$. Cosine similarity is written $$\cos(v, u)$$. For batched computation, embedding vectors are usually stacked into a matrix $$E$$ of shape $$(N, d)$$, where $$N$$ is the number of items and each row is a single embedding. This matrix layout matches how vectors are stored in [vector databases](/wiki/vector_database) and how they are loaded into GPU memory for similarity search.

Embedding vectors are typically stored as 32-bit floating-point numbers (float32), although modern systems frequently use lower-precision formats (float16, bfloat16, int8, or even single-bit binary) to save memory. A single 1,536-dimensional float32 vector occupies about 6 kilobytes; storing one million such vectors costs roughly 6 gigabytes of RAM if no compression is applied.

## How do embedding vectors differ from sparse representations?

Before embedding vectors became the standard, most machine learning systems represented discrete objects using sparse encodings.

| Property | Sparse representation | Dense embedding vector |
|---|---|---|
| Example schemes | One-hot encoding, bag-of-words, TF-IDF | Word2Vec, GloVe, BERT, CLIP |
| Typical dimensionality | Equals vocabulary size (10,000 to 1,000,000+) | Fixed and compact (50 to 4,096) |
| Non-zero entries | 1 or very few per vector | Most or all entries are non-zero |
| Similarity information | All pairs are equidistant (orthogonal) | Similar items are nearby; dissimilar items are far apart |
| Storage efficiency | Wasteful for large vocabularies | Compact and memory-friendly |
| Learned from data | No (hand-designed) | Yes (trained via neural networks or matrix factorization) |
| Interpretability | Each dimension is a single token | Dimensions usually do not correspond to human concepts |

A vocabulary of 50,000 words represented with one-hot encoding produces 50,000-dimensional vectors where all pairs are orthogonal, meaning the representation treats "dog" and "puppy" as equally unrelated as "dog" and "skyscraper." An embedding vector of 300 dimensions, by contrast, places "dog" and "puppy" close together while pushing "skyscraper" far away. This structure is learned automatically from data and encodes rich semantic relationships.

Sparse representations remain useful when exact term matches matter (legal search, code search for variable names) or when interpretability is paramount. In modern retrieval pipelines, sparse and dense vectors are often combined in a hybrid search that takes the best of both worlds.

## When were embedding vectors invented?

The idea that meaning can be carried by patterns of activity over many units, rather than by single dedicated units, predates the deep learning era by decades.

**Distributed representations (1986).** The conceptual foundation appears in chapter 3 of *Parallel Distributed Processing*, where Geoffrey Hinton, James McClelland, and David Rumelhart argued that knowledge in connectionist networks should be encoded as patterns of activation across many simple units.[1] In their framing, similar concepts share overlapping activation patterns, and generalization to novel inputs is automatic because nearby patterns trigger similar downstream behavior. This is the conceptual ancestor of every modern embedding method.

**Latent semantic analysis (1990).** Deerwester and colleagues introduced LSA, which applied truncated singular value decomposition to a term-document matrix to produce dense vectors for both words and documents.[2] LSA was the first widely adopted method for treating semantic similarity as geometric distance.

**Neural language models (2003).** Yoshua Bengio and collaborators trained a feedforward language model that jointly learned word embeddings and a probability distribution over the next word.[3] The network's input layer used a small dense lookup table, anticipating the embedding layer that became standard a decade later.

**Word2Vec (2013).** Tomas Mikolov and colleagues at Google released two efficient training algorithms (Skip-gram and Continuous Bag of Words) that could learn 300-dimensional word vectors from billions of tokens in hours.[4] Word2Vec popularized the idea that simple linear arithmetic on embeddings can capture analogies (king minus man plus woman is approximately queen) and triggered an explosion of follow-up work.

**GloVe (2014).** Jeffrey Pennington, Richard Socher, and Christopher Manning proposed Global Vectors, which factorize a global word-word co-occurrence matrix using a weighted log-bilinear loss.[5] GloVe combined the global statistics of LSA-style methods with the sliding-window structure of Word2Vec.

**FastText (2017).** Piotr Bojanowski and colleagues at Facebook AI Research extended Word2Vec to operate over character n-grams, allowing the model to produce embeddings for out-of-vocabulary words and to share information across morphologically related forms.[6]

**ELMo (2018).** Matthew Peters and collaborators at the Allen Institute introduced deep contextualized word representations using a bidirectional LSTM language model.[8] ELMo produced different embeddings for the same word in different contexts, addressing the polysemy problem inherent in static embeddings.

**BERT (2018).** Jacob Devlin and colleagues at Google released the [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) transformer encoder pretrained with masked language modeling.[9] BERT could be fine-tuned to produce contextual word and sentence embeddings that dominated the NLP leaderboards for years.

**Sentence-BERT (2019).** Nils Reimers and Iryna Gurevych adapted BERT into a siamese network that produces sentence-level embeddings comparable with cosine similarity.[10] Sentence-BERT cut the time needed to find the most similar sentence pair in a 10,000-sentence collection from about 65 hours (using a BERT cross-encoder) to roughly 5 seconds.[10]

**CLIP (2021).** Alec Radford and colleagues at OpenAI trained a dual-encoder system that maps images and text into a shared 512-dimensional space using contrastive learning on 400 million image-caption pairs.[11] CLIP enabled zero-shot image classification and cross-modal retrieval at large scale.

**OpenAI text-embedding-3 (January 2024).** OpenAI released text-embedding-3-small (1,536 dimensions) and text-embedding-3-large (3,072 dimensions) on January 25, 2024.[25] Both models support [Matryoshka representation learning](https://arxiv.org/abs/2205.13147), allowing callers to truncate vectors to lower dimensions while keeping most of the quality. text-embedding-3-small is priced at $0.02 per million tokens and text-embedding-3-large at $0.13 per million tokens.[25]

**The 2024 to 2026 model wave.** The two years following text-embedding-3 saw a flood of competitive embedding models: BGE-M3 from the Beijing Academy of Artificial Intelligence (released January 28, 2024),[22] NV-Embed-v2 from NVIDIA (which reached the top of the MTEB leaderboard in August 2024 with a score of 72.31),[23] Jina Embeddings v3 (released September 18, 2024, with 89-language support and late chunking),[24] Voyage AI's voyage-3 family (the voyage-3-large model released January 7, 2025),[26] Cohere's Embed v4 (multimodal text plus image with Matryoshka dimensions of 256, 512, 1024, and 1536), and ColPali and ColQwen, which extended late-interaction retrieval to PDF page images.[21]

## Key properties of embedding vectors

### Semantic similarity

The defining property of a well-trained embedding space is that objects with similar meanings or functions occupy nearby regions. In a [word embedding](/wiki/word_embedding) space, synonyms like "big" and "large" have high cosine similarity, while unrelated words like "big" and "molecule" are distant. This property extends to images (photos of cats cluster together), users (people with similar tastes cluster together), and graph nodes (nodes in the same community cluster together).

### Vector arithmetic

Embedding vectors support meaningful arithmetic operations that capture relational structure. The most famous example from word embeddings is:

> vector("king") - vector("man") + vector("woman") is approximately vector("queen")

The offset between "king" and "man" captures the concept of royalty independent of gender, and applying that offset to "woman" yields a vector close to "queen."[4] Similar analogies work for geography (Paris minus France plus Italy is approximately Rome) and morphology (bigger minus big plus small is approximately smaller). This property arises because the training process encodes consistent relational patterns as approximately linear directions in the vector space. The same kind of arithmetic shows up in image embeddings: averaging the CLIP embeddings of "sunset" and "beach" produces a vector that retrieves photos of sunsets at beaches.[11]

### Magnitude and direction

Every embedding vector has both a direction and a magnitude. In most modern text embedding pipelines, the direction is what carries the semantic content, and vectors are L2-normalized to unit length before storage. After normalization, cosine similarity, dot product, and 1 minus half the squared Euclidean distance all produce identical rankings, which is why production search systems often store normalized vectors and use the dot product (which is the cheapest operation on modern hardware).

When magnitude is preserved, it can encode confidence, frequency, or popularity. Some recommendation systems intentionally leave item vectors unnormalized so that popular items have larger norms and naturally rank higher than obscure items with the same direction.

### Clustering and manifold structure

Embedding vectors naturally form clusters that correspond to meaningful categories, even when no category labels are provided during training. Countries group together, animals group together, and verbs of motion group together. More broadly, the [manifold hypothesis](https://en.wikipedia.org/wiki/Manifold_hypothesis) suggests that real-world data concentrates near lower-dimensional manifolds within the high-dimensional embedding space, and good embeddings learn to map data onto these manifolds. The intrinsic dimensionality of natural data (the dimension of the manifold it occupies) is typically much smaller than the ambient dimension of the embedding space, which is why dimension-reduction methods like PCA, t-SNE, and UMAP can produce useful 2D visualizations.

### Anisotropy

A subtle but important property of many embedding spaces is anisotropy: vectors are not uniformly distributed over the unit sphere but instead cluster in a narrow cone. This means that even random or unrelated pairs tend to have moderately high cosine similarity, and the absolute value of a similarity score is less informative than the relative ranking. Modern training recipes (whitening, isotropy regularization, contrastive objectives) aim to reduce anisotropy and make similarity scores more interpretable.

## Embedding layers in neural networks

In practice, embedding vectors are often produced by a dedicated **embedding layer** at the input of a neural network. In PyTorch, this is implemented as `torch.nn.Embedding(num_embeddings, embedding_dim)`, which creates a learnable lookup table. Each row in the table corresponds to one item in the vocabulary, and each row is a vector of length `embedding_dim`.

When the network receives an input index (for example, the integer ID for the word "cat"), the embedding layer looks up the corresponding row and returns its vector. During training, backpropagation adjusts these vectors to minimize the loss function, so the embedding layer learns representations that are useful for the task at hand. Embedding layers are equivalent to multiplying a one-hot vector by a weight matrix, but the lookup implementation is far more efficient because it avoids the explicit matrix multiplication.

Frameworks like TensorFlow provide the same functionality through `tf.keras.layers.Embedding`. Both implementations support features like padding indices (assigning a zero vector to padding tokens) and optional L2 normalization. Hugging Face Transformers exposes a feature-extraction pipeline that returns the hidden states of any pretrained model as embedding vectors, and the `sentence-transformers` library wraps this functionality with mean pooling and L2 normalization to produce a single sentence vector per input.

## Types of embedding vectors

| Modality | Representative methods | Typical dimensions |
|---|---|---|
| Static word embeddings | Word2Vec, GloVe, FastText | 100 to 300 |
| Contextual word embeddings | ELMo, BERT, RoBERTa, T5 encoder | 768 to 4,096 |
| Sentence embeddings | Sentence-BERT, MPNet, MiniLM, GTE | 384 to 1,536 |
| Document embeddings | Doc2Vec, BGE-M3, late-chunking models | 768 to 1,536 |
| Image embeddings | ResNet features, DINO, CLIP image encoder | 512 to 2,048 |
| Multimodal embeddings | CLIP, ALIGN, SigLIP, Cohere Embed v4 | 512 to 1,536 |
| Code embeddings | CodeBERT, voyage-code-3, OpenAI code embeddings | 768 to 1,536 |
| Audio embeddings | wav2vec 2.0, HuBERT, Whisper encoder | 768 to 1,280 |
| Speaker embeddings | x-vectors, ECAPA-TDNN | 192 to 512 |
| Graph node embeddings | Node2Vec, GraphSAGE, DeepWalk | 64 to 256 |
| User and item embeddings | Matrix factorization, two-tower retrievers | 32 to 512 |

### Word embeddings

Word embeddings map individual words to dense vectors by training on large text corpora. The three foundational algorithms are:

- **Word2Vec** (Mikolov et al., 2013) uses shallow neural networks with two architectures: Skip-gram, which predicts context words from a target word, and Continuous Bag of Words (CBOW), which predicts a target word from context.[4] Negative sampling makes training efficient by updating only a small subset of weights per example.
- **GloVe** (Pennington et al., 2014) constructs a global word-word co-occurrence matrix from the corpus and factorizes it using a log-bilinear regression model, combining the benefits of global statistics and local context.[5]
- **FastText** (Bojanowski et al., 2017) extends Word2Vec by representing words as bags of character n-grams, allowing it to produce embeddings for out-of-vocabulary words and to capture morphological relationships.[6]

| Method | Year | Approach | OOV handling | Typical dimensions |
|---|---|---|---|---|
| [Word2Vec](/wiki/word2vec) | 2013 | Shallow neural net on local context | None | 100 to 300 |
| [GloVe](/wiki/glove) | 2014 | Global co-occurrence matrix factorization | None | 50 to 300 |
| [FastText](/wiki/fasttext) | 2017 | Subword character n-grams | Yes | 100 to 300 |

These are called **static** embeddings because each word receives a single vector regardless of context. The word "bank" has the same embedding whether it appears in "river bank" or "savings bank."

### Contextual word embeddings

Contextual models address the polysemy problem by giving each token a different vector depending on its surrounding context. ELMo (Peters et al., 2018) used a three-layer bidirectional LSTM language model that produced 1,024-dimensional embeddings as a learned weighted sum of layer activations.[8] The Peters paper found that lower layers captured syntax while higher layers captured semantics, a pattern that BERT and later transformer models reproduced.[8]

[BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) (Devlin et al., 2018) replaced LSTMs with the transformer encoder and produced 768-dimensional (BERT-base) or 1,024-dimensional (BERT-large) contextual embeddings.[9] The vector for the special [CLS] token at the start of every input is sometimes used as a sentence embedding, although mean pooling over all token vectors usually performs better. Successor models including RoBERTa, ALBERT, ELECTRA, DeBERTa, and the encoder of T5 follow the same recipe with refinements to the training objective and architecture.

### Sentence and document embeddings

Many applications require a single vector for an entire sentence, paragraph, or document rather than individual words.

- **Doc2Vec** (Le and Mikolov, 2014) extends Word2Vec by introducing a paragraph vector that is trained alongside word vectors, producing fixed-length representations for variable-length text.[7]
- **Sentence-BERT** (Reimers and Gurevych, 2019) fine-tunes BERT using a siamese network architecture to produce sentence embeddings optimized for [similarity measure](/wiki/similarity_measure) tasks like semantic textual similarity and information retrieval.[10] The companion `sentence-transformers` library is the most widely used open-source toolkit for sentence embeddings.
- **MPNet, MiniLM, and GTE** are widely used open-source families that produce 384 to 768-dimensional vectors with strong MTEB scores.
- **Late chunking** (Jina AI, 2024) is a recent technique in which an entire long document is encoded once, and then mean pooling is applied within chunk boundaries afterward.[24] This preserves cross-chunk context that disappears when each chunk is embedded independently.
- **Simple aggregation** methods (averaging or max-pooling word vectors across a sentence) provide a surprisingly strong baseline for tasks like document classification, although they discard word order.

### Image embeddings

In computer vision, embedding vectors represent images as points in a continuous space for tasks like classification, retrieval, and generation.

- **CNN feature vectors.** Convolutional [neural networks](/wiki/convolutional_neural_network) like ResNet, VGG, and EfficientNet produce embedding vectors at their penultimate layer. A ResNet-50, for instance, outputs a 2,048-dimensional vector before its final classification head. These vectors capture visual features ranging from edges and textures in early layers to high-level object parts in deeper layers.
- **DINO and DINOv2** (Caron et al., 2021; Oquab et al., 2023) train Vision Transformers with self-supervised distillation and produce image embeddings that work well for retrieval, segmentation, and depth estimation without any labels.
- **CLIP image encoder.** [CLIP](/wiki/clip) (Contrastive Language-Image Pretraining), developed by OpenAI in 2021, uses a dual-encoder architecture with a Vision [Transformer](/wiki/transformer) (or ResNet) for images and a Transformer for text.[11] Both encoders map their inputs into a shared 512-dimensional embedding space, trained on 400 million image-text pairs using contrastive learning. CLIP enables zero-shot image classification and cross-modal retrieval, where a text query can find semantically matching images.
- **SigLIP** (Zhai et al., 2023) replaces CLIP's softmax contrastive loss with a sigmoid loss that scales better to larger batches and produces stronger zero-shot transfer.

### Audio and speech embeddings

Self-supervised speech models like wav2vec 2.0 and HuBERT produce frame-level embeddings (typically 768 or 1,024 dimensions) that capture phonetic and prosodic information. The encoder of OpenAI's Whisper model is widely repurposed as an audio embedding extractor for music tagging, speaker identification, and audio retrieval. Specialized speaker embedding models such as x-vectors and ECAPA-TDNN produce 192 to 512-dimensional vectors that are nearly identical for two recordings of the same speaker and easily distinguishable across different speakers, which underlies modern speaker verification systems.

### Multimodal embeddings

Multimodal embedding models project two or more modalities into a single shared space so that similarity can be computed across modalities. CLIP and ALIGN aligned text with images. Subsequent work has expanded this to text plus video (VideoCLIP, ImageBind), text plus audio (CLAP), and text plus PDF page screenshots (ColPali, ColQwen, Cohere Embed v4). The hallmark of a multimodal embedding is that a text query and a relevant image (or audio clip, or PDF page) end up close to each other in the same vector space, enabling cross-modal search with a single similarity computation.

### Code embeddings

Code-specific embedding models (CodeBERT, GraphCodeBERT, CodeT5, voyage-code-3, and OpenAI's code embeddings) are trained on programming language corpora and tuned to put semantically equivalent snippets near each other regardless of variable names or formatting. They power code search inside IDEs, duplicate-code detection, and the retrieval step of coding agents.

### Graph embeddings

Graph embeddings represent nodes, edges, or entire graphs as vectors, capturing structural relationships in networks.

- **DeepWalk** (Perozzi et al., 2014) uses uniform random walks and Skip-gram, serving as a precursor to Node2Vec.
- **Node2Vec** (Grover and Leskovec, 2016) extends Word2Vec to graphs by simulating biased random walks that balance between breadth-first (exploring local neighborhoods) and depth-first (exploring distant parts of the graph) strategies.[12] The resulting node sequences are treated like sentences and fed into a Skip-gram model. Node2Vec is transductive, meaning it cannot produce embeddings for nodes unseen during training.
- **GraphSAGE** (Hamilton et al., 2017) is an inductive framework that generates embeddings by sampling and aggregating features from a node's local neighborhood using functions like mean, LSTM, or max-pooling.[13] Because GraphSAGE learns an aggregation function rather than fixed per-node embeddings, it generalizes to new nodes and even new graphs.

Graph embeddings are applied to social network analysis, knowledge graph completion, drug-protein interaction prediction, and fraud detection.

## Embedding dimensionality

The number of dimensions in an embedding vector determines its capacity to represent information. Choosing the right dimensionality involves a trade-off between representational power and computational cost.

| Dimension range | Characteristics | Typical use cases |
|---|---|---|
| 50 to 128 | Compact, fast, low memory | Keyword matching, simple retrieval, visualization |
| 256 to 384 | Good balance for lightweight models | Mobile search, Sentence Transformers MiniLM |
| 512 to 768 | Strong for most NLP tasks | Semantic search, BERT-base, Sentence-BERT MPNet |
| 1,024 to 1,536 | High-quality representations | Enterprise retrieval, OpenAI ada-002, BGE-large |
| 2,048 to 4,096 | Maximum expressiveness | OpenAI text-embedding-3-large, NV-Embed-v2, research models |

Higher dimensions improve the model's ability to capture fine-grained distinctions but increase memory usage, latency, and the risk of overfitting with limited data. A 1,024-dimensional embedding for one million items requires approximately 4 GB of storage using 32-bit floats, while 256 dimensions would require roughly 1 GB.

### Matryoshka representation learning

[Matryoshka representation learning](https://arxiv.org/abs/2205.13147) (MRL), introduced by Aditya Kusupati and colleagues at NeurIPS 2022, trains a single embedding so that arbitrary leading slices also work as valid lower-dimensional embeddings.[17] A 2,048-dimensional MRL vector contains a usable 1,024-dimensional vector in its first half, a usable 512-dimensional vector in its first quarter, and so on, hence the comparison to nested Russian dolls. Truncation costs almost no quality but yields large savings in storage and search latency: the original paper reports up to 14 times smaller embeddings at the same ImageNet-1K accuracy and up to 14 times faster large-scale retrieval.[17]

MRL has been adopted across the industry. OpenAI's text-embedding-3 family lets callers request 256, 512, 1024, or 1536 dimensions from text-embedding-3-small and any dimension up to 3,072 from text-embedding-3-large.[25] Voyage AI's voyage-3-large produces 256, 512, 1024, or 2048-dimensional vectors from a single model.[26] Cohere Embed v4 supports 256, 512, 1024, and 1536-dimensional outputs. Google's Gemini Embedding family also uses MRL. The practical effect is that a single API call can serve high-quality "big" vectors for re-ranking and small "sketch" vectors for first-pass retrieval at no additional inference cost.

### Quantized embeddings

A complementary approach to dimension reduction is precision reduction. Instead of using 32-bit floats, embeddings can be stored as 16-bit floats, 8-bit integers, or even single-bit binary values. Cohere announced native support for int8 and binary embeddings in March 2024, reporting 4x and 32x reductions in memory and up to 40x faster vector search while keeping 90 to 98 percent of the original retrieval quality.[27] For Wikipedia at scale, this brings the storage of 42 million 1,024-dimensional vectors from roughly 160 GB (float32) down to around 5 GB (binary).[27] Voyage AI's voyage-3-large reports that 512-dimensional binary embeddings outperform full-precision 3,072-dimensional OpenAI vectors while requiring 200 times less storage.[26] The combination of MRL and quantization-aware training has effectively decoupled embedding quality from storage cost.

## How do you measure similarity between embedding vectors?

Measuring the distance or similarity between embedding vectors is central to almost every application. Three measures dominate in practice, with two more appearing in specialized settings.

### Cosine similarity

[Cosine similarity](/wiki/cosine_similarity) measures the angle between two vectors, ignoring their magnitudes:

$$
\cos(A, B) = \frac{A \cdot B}{\lVert A \rVert \, \lVert B \rVert}
$$

It ranges from -1 (opposite directions) to 1 (identical direction), with 0 indicating orthogonality. Cosine similarity is the default metric for text embeddings because it focuses on the direction of the vector (which encodes meaning) rather than its length (which can vary with document length or word frequency).

### Dot product

The [dot product](/wiki/dot_product) multiplies corresponding elements and sums them:

$$
\mathrm{dot}(A, B) = \sum_i a_i b_i
$$

Unlike cosine similarity, the dot product is sensitive to vector magnitude. When both direction and magnitude carry meaningful information (for example, when longer vectors indicate higher confidence or popularity), the dot product is appropriate. When vectors are L2-normalized to unit length, the dot product and cosine similarity produce identical rankings. Most production retrieval systems normalize once at insert time and then use the dot product at query time, because matrix-multiply hardware is heavily optimized for this operation.

### Euclidean distance

[Euclidean distance](/wiki/euclidean_distance) measures the straight-line distance between two points:

$$
d(A, B) = \sqrt{\sum_i (a_i - b_i)^2}
$$

It is sensitive to both direction and magnitude. Euclidean distance is useful in clustering scenarios (such as k-means) and when the absolute position in the space matters.

### Manhattan and Hamming distance

Manhattan distance (also called L1 or taxicab distance) sums the absolute differences across dimensions and is occasionally used in high-dimensional retrieval where individual feature differences matter more than the squared overall distance. Hamming distance counts the number of differing bits and is the natural metric for binary quantized embeddings; modern hardware can compute Hamming distance over 1,024-bit vectors with a single XOR plus popcount instruction, which is what makes binary embeddings so fast.

| Metric | Considers magnitude | Range | Best for |
|---|---|---|---|
| Cosine similarity | No | $$[-1, 1]$$ | Text similarity, semantic search |
| Dot product | Yes | $$(-\infty, +\infty)$$ | Recommendation, ranking with confidence |
| Euclidean distance | Yes | $$[0, +\infty)$$ | Clustering, spatial analysis |
| Manhattan distance | Yes | $$[0, +\infty)$$ | Robust distance with outliers |
| Hamming distance | n/a | $$[0, d]$$ | Binary quantized embeddings |

When vectors are normalized, cosine similarity, the dot product, and Euclidean distance produce equivalent rankings, so the choice matters most when embeddings are not normalized.

## What is the best embedding model?

The embedding landscape has advanced rapidly beyond Word2Vec and GloVe. Modern models are typically based on [transformer](/wiki/transformer) architectures, trained with contrastive objectives, and evaluated on the Massive Text Embedding Benchmark (MTEB).

| Model | Provider | Dimensions | MTEB / benchmark | Open source | Notable features |
|---|---|---|---|---|---|
| text-embedding-3-large | [OpenAI](/wiki/openai) | 3,072 (Matryoshka) | MTEB 64.6 | No | 8,191 token input, $0.13/M tokens |
| text-embedding-3-small | OpenAI | 1,536 (Matryoshka) | MTEB 62.3 | No | $0.02/M tokens |
| Embed v4 | Cohere | 1,536, 1024, 512, 256 | Multimodal benchmark leader | No | Text + image, 128k context, int8 and binary outputs |
| BGE-M3 | BAAI | 1,024 | MIRACL state-of-the-art | Yes | 100+ languages, dense + sparse + multi-vector |
| NV-Embed-v2 | NVIDIA | 4,096 | MTEB 72.31 | Yes | Mistral-7B base, latent-attention pooling |
| jina-embeddings-v3 | Jina AI | 1,024 (Matryoshka) | MTEB 65 (sub-1B) | Yes | 89 languages, late chunking, 8,192 tokens |
| voyage-3-large | Voyage AI | 2,048 (Matryoshka) | +9.7% over OpenAI v3 large | No | int8 + binary outputs, 32k context |
| voyage-code-3 | Voyage AI | 1,024 (Matryoshka) | Code retrieval benchmark | No | Code-specialized, quantization-aware |
| E5-Mistral-7B-instruct | Microsoft | 4,096 | MTEB 66.6 | Yes | Instruction-tuned, Mistral 7B base |
| all-MiniLM-L6-v2 | Sentence-Transformers | 384 | MTEB 56.3 | Yes | 22M parameters, runs on CPU |
| Gemini Embedding | Google | up to 3,072 (Matryoshka) | Top of MTEB multilingual | No | Multilingual, gemini-embedding-001 |

As of the public MTEB leaderboard in early 2026, Google's gemini-embedding-001 held the top overall position with a multilingual task mean of 68.32, ahead of the next best model by about 5 points.[29] The Gemini Embedding technical report states that the model "substantially outperforms prior state-of-the-art models" and "achieves state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks."[29] Open-source models continue to close the gap: NVIDIA's NV-Embed-v2 reached the top of MTEB in August 2024 with a score of 72.31,[23] showing how quickly the leaderboard turns over.

Key trends in the modern embedding model landscape include instruction-tuned embeddings (where the query includes a task description like "Retrieve passages about..."), multilingual and multimodal support, Matryoshka and quantization for storage, and the steady closing of the gap between open-source models and proprietary APIs.

## Vector databases and indexes

As embedding vectors have become central to AI applications, a new category of infrastructure called **vector databases** has emerged to store, index, and search over large collections of embeddings efficiently.

| System | Type | Key strengths | Typical scale |
|---|---|---|---|
| [Pinecone](/wiki/pinecone) | Managed cloud service | Ease of use, automatic scaling, serverless billing | Millions to billions |
| [Milvus](/wiki/milvus) | Open source (with Zilliz Cloud) | High throughput, distributed architecture | Billions of vectors |
| [Weaviate](/wiki/weaviate) | Open source | Hybrid search (vector + keyword), built-in modules | Millions to billions |
| [Chroma](/wiki/chroma) | Open source | Lightweight, easy local development, popular with LangChain | Thousands to millions |
| Qdrant | Open source | Rust-based, high performance, payload filtering | Millions to billions |
| pgvector | PostgreSQL extension | Integrates with existing Postgres infrastructure | Millions |
| Elasticsearch / OpenSearch | Search engine extension | Combines BM25 with dense vector search | Millions to billions |
| Vespa | Open source | Multi-vector and tensor support (good for ColBERT) | Billions |
| [FAISS](/wiki/faiss) | Library (Meta) | In-memory ANN search, GPU-accelerated | Millions to billions |
| ScaNN | Library (Google) | Anisotropic vector quantization | Millions to billions |

Vector databases use approximate nearest neighbor (ANN) algorithms to search through millions or billions of vectors in milliseconds rather than performing brute-force comparisons. The dominant algorithm in production is HNSW (Hierarchical Navigable Small World), introduced by Yury Malkov and Dmitry Yashunin in 2016, which incrementally builds a multi-layer proximity graph and achieves logarithmic-complexity search by descending from the top layer downward.[18] HNSW is the default index in Milvus, Weaviate, Qdrant, pgvector, Elasticsearch, and OpenSearch, among others.

Other widely used algorithms include IVF (Inverted File Index), which partitions the space into Voronoi cells using k-means and probes the nearest cells at query time; product quantization (PQ), which compresses vectors by splitting them into subvectors and quantizing each subvector independently; and ScaNN, which combines a learned anisotropic quantizer with optimized SIMD scoring. Most production systems combine these methods (for example, IVF-PQ or HNSW-PQ) to balance recall, latency, and memory.

Most vector databases also support metadata filtering, allowing queries like "find the 10 most similar documents to this query vector that were published after 2023 and tagged as health." Hybrid search combines dense vector search with sparse keyword search (BM25 or SPLADE) and a fusion step (typically reciprocal rank fusion) to recover exact term matches that pure dense retrieval can miss.

## What are embedding vectors used for?

Embedding vectors power a broad range of practical AI systems.

**Semantic search.** Traditional keyword search fails when the query and document use different words for the same concept. Embedding-based search converts both queries and documents into vectors and finds documents whose vectors are closest to the query vector, enabling results based on meaning rather than exact keyword overlap. A search for "how to fix a leaky faucet" can return a document titled "Repairing a dripping tap" without any literal word overlap.

**Retrieval-augmented generation.** In [retrieval-augmented generation](/wiki/retrieval_augmented_generation) (RAG) systems, a user's question is converted to an embedding vector, the most relevant documents are retrieved from a vector database, and those documents are passed as context to a large [language model](/wiki/language_model) for answer generation. RAG reduces hallucination by grounding the model's output in factual retrieved content and is the standard architecture for building chatbots over private corpora, internal documentation, customer support knowledge bases, and legal or medical archives.

**Recommendation systems.** Users and items (products, movies, songs) are embedded in the same vector space. Recommendations are generated by finding items whose embeddings are closest to the user's embedding or to items the user has previously engaged with. Two-tower retrieval architectures, used at YouTube, TikTok, Pinterest, and Spotify, train a user encoder and an item encoder jointly so that the dot product of their outputs predicts engagement.

**Clustering and topic modeling.** Embedding vectors enable unsupervised grouping of documents, images, or users by applying clustering algorithms like k-means or DBSCAN directly in the embedding space. The combination of an embedding model with HDBSCAN clustering and class-based TF-IDF (BERTopic) has become a popular replacement for older topic models like LDA.

**Classification and zero-shot learning.** A simple linear classifier trained on top of frozen embeddings often matches or beats much more complex end-to-end models on small datasets. CLIP-style multimodal embeddings allow zero-shot classification: at inference time, the embedding of an input image is compared to the embeddings of candidate label phrases ("a photo of a cat", "a photo of a dog"), and the closest label wins, with no labeled training data required.

**Anomaly detection.** Data points whose embedding vectors are distant from all clusters may represent anomalies or novel inputs, making embedding-based approaches useful for fraud detection, network intrusion monitoring, and quality assurance in manufacturing.

**Cross-modal retrieval.** Multimodal embeddings (such as those from CLIP, SigLIP, or Cohere Embed v4) allow searching for images using text queries or finding text descriptions that match a given image, because both modalities share the same embedding space. Modern document retrieval systems built on ColPali and ColQwen extend this idea to PDF page images, eliminating the need for OCR and chunking pipelines.

**Memory for AI agents.** Many [AI agent](/wiki/ai_agent) frameworks store past observations, tool results, and conversational history as embedding vectors in a vector database, then retrieve the most relevant memories at each step. This gives the agent a form of long-term memory that scales beyond the model's context window.

**Bioinformatics and chemistry.** Protein language models (ESM, ProtT5) produce per-residue embeddings that have largely replaced hand-crafted sequence features for tasks like contact prediction and function annotation. Molecular embedding models (Mol2Vec, ChemBERTa, MolFormer) play an analogous role in drug discovery.

## Quality benchmarks

Embedding model quality is increasingly evaluated on standardized benchmarks rather than narrow downstream tasks.

**MTEB (Massive Text Embedding Benchmark).** Introduced by Niklas Muennighoff and colleagues in 2022 and published at EACL 2023, MTEB covers 8 task families (classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, summarization, and bitext mining) across 58 datasets and 112 languages.[16] The original paper benchmarked 33 models and concluded that no single model dominated across all tasks.[16] MTEB has since become the de facto leaderboard for text embeddings, hosted by Hugging Face and continuously updated. As of August 2024, NVIDIA's NV-Embed-v2 reached the top spot with a score of 72.31; the leaderboard continues to evolve as new models are submitted.[23]

**MIRACL.** A multilingual retrieval benchmark covering 18 languages, used to evaluate models for non-English search. BGE-M3 reported state-of-the-art MIRACL performance at its release.[22]

**BEIR.** A zero-shot information retrieval benchmark with 18 datasets covering question answering, fact checking, citation prediction, and other domains. BEIR is now folded into the retrieval portion of MTEB.

**LoCo.** A long-context retrieval benchmark used to evaluate embedding models like jina-embeddings-v3 and Voyage's long-context offerings.[24]

**MMTEB.** The Massive Multilingual Text Embedding Benchmark extends MTEB to dozens of additional languages and is the venue where models like Llama-Embed-Nemotron and the Gemini Embedding family report their strongest claims.

**Code retrieval.** CodeSearchNet and the more recent CoIR (Code Information Retrieval) benchmarks measure code-search quality and motivate code-specific models like voyage-code-3.

## Modern advances

Beyond the steady march of larger and better models, several specific innovations have shaped embedding research between 2022 and 2026.

**Late interaction (ColBERTv2, ColPali, ColQwen).** Standard embedding models pool a sequence of token vectors into a single vector before comparison. ColBERT (Khattab and Zaharia, 2020) instead keeps one vector per token and scores a query-document pair using the MaxSim operator, summing the maximum dot product between each query token and any document token.[19] ColBERTv2 (Santhanam et al., 2022) added denoised supervision and residual compression, reducing the index size by 6x to 10x while improving quality.[20] ColPali (Faysse et al., 2024) extended the same idea to vision-language models by treating each PDF page as an image and producing per-patch vectors, allowing PDF retrieval without OCR or layout parsing.[21] ColQwen replaces the underlying VLM with Qwen2 for multilingual support.

**Multi-vector embeddings.** Late-interaction is one example of a broader move away from single-vector representations. Multi-vector retrievers store several vectors per document (one per token, one per chunk, or one per aspect) and aggregate similarities at query time. Vector databases like Vespa and Milvus added native support for multi-vector documents to enable these workflows.

**Instruction-tuned embeddings.** Models like InstructOR, E5-Mistral, and the BGE instruction series accept a task description as part of the input ("Retrieve passages that answer this scientific question: ..."). The same backbone can specialize on retrieval, classification, clustering, or symmetric similarity by changing the instruction, which improves transfer to new tasks without retraining.

**LLM-based embedding models.** A 2024 trend was to bootstrap embedding models from large decoder-only LLMs like Mistral and Llama, often using contrastive fine-tuning with synthetic queries generated by another LLM. NV-Embed-v2, E5-Mistral, gte-Qwen2-7B-instruct, and Llama-Embed-Nemotron all follow this recipe.[23] The resulting models score substantially higher on MTEB than encoder-only baselines but require more memory at inference.

**Domain-adapted embeddings.** General-purpose embeddings often leave 5 to 15 percentage points of retrieval accuracy on the table for specialized domains like medicine, law, finance, and proprietary code. Contrastive fine-tuning on a few thousand domain pairs (or LLM-generated synthetic pairs) usually closes most of this gap.

**Quantization-aware training.** Rather than quantizing embeddings as a post-processing step, models like voyage-3-large and Cohere Embed v4 are trained from the start to produce vectors that survive int8 or binary quantization with minimal loss. Combined with Matryoshka, this lets a single model serve a range of cost-quality trade-offs.

## Operations on embedding vectors

Applications often manipulate embedding vectors directly using a small set of standard operations.

- **L2 normalization.** Dividing a vector by its norm produces a unit vector. Most retrieval systems normalize once and then use the dot product as a fast proxy for cosine similarity.
- **Mean pooling.** Averaging the vectors of multiple tokens, sentences, or items produces a representative vector for the group. This is the standard way to get a sentence vector from a stack of token vectors.
- **Max pooling.** Taking the elementwise maximum across a set of vectors. Used in late-interaction scoring (MaxSim) and in some image retrieval pipelines.
- **Concatenation.** Joining two or more vectors end-to-end produces a single longer vector, often used to combine modalities (text plus image plus metadata).
- **Element-wise operations.** Element-wise sums, differences, and products are used to build features for downstream classifiers (for example, the sentence-pair feature $$[u, v, \lvert u-v \rvert, u \odot v]$$ used in Sentence-BERT inference).
- **Truncation.** Slicing the first k components of a Matryoshka-trained vector yields a valid lower-dimensional embedding.
- **Whitening and isotropy correction.** Linear transformations that decorrelate dimensions and rescale variance, often used to improve similarity calibration.
- **Quantization.** Mapping float32 components to int8 or single-bit binary values for storage and faster comparison.

## Visualizing embedding vectors

High-dimensional embedding vectors cannot be directly plotted, so [dimension reduction](/wiki/dimension_reduction) techniques are used to project them into two or three dimensions for visualization.

- **t-SNE** (t-Distributed Stochastic Neighbor Embedding), introduced by van der Maaten and Hinton in 2008, is a nonlinear technique that preserves local neighborhood structure.[14] It excels at revealing clusters in the data but has $$O(n^2)$$ time complexity and may distort global relationships between distant clusters.
- **UMAP** (Uniform Manifold Approximation and Projection), introduced by McInnes et al. in 2018, is another nonlinear technique that preserves both local and global structure better than t-SNE while running significantly faster.[15] UMAP has become the preferred method for large-scale embedding visualization.
- **PCA.** [Principal component analysis](/wiki/principal_component_analysis) is a linear method that projects onto the directions of maximum variance. It is fast and deterministic but may miss nonlinear structure.
- **PaCMAP and TriMap** are newer methods that aim to preserve a balance of local and global structure with stable, reproducible outputs.

| Method | Preserves local structure | Preserves global structure | Speed | Best for |
|---|---|---|---|---|
| t-SNE | Excellent | Poor | Slow ($$O(n^2)$$) | Small to medium datasets, cluster discovery |
| UMAP | Excellent | Good | Fast | Large datasets, interactive exploration |
| PCA | Moderate | Good | Very fast | Quick overview, preprocessing step |
| PaCMAP | Good | Good | Medium | Reproducible, balanced visualization |

Visualization is useful for quality assurance (checking that semantically similar items cluster together), dataset exploration, and communicating results to non-technical stakeholders. Tools like the TensorFlow Embedding Projector, Atlas (Nomic AI), and Weights and Biases' built-in projector make interactive UMAP and PCA exploration straightforward.

## Fine-tuning embeddings

Pre-trained embedding models provide strong general-purpose representations, but fine-tuning on domain-specific data can significantly improve performance for specialized applications.

**Contrastive fine-tuning** is the most common approach. The model is trained on pairs (or triplets) of examples: positive pairs should be pulled closer together in the embedding space, and negative pairs should be pushed apart. For a legal document retrieval system, for example, positive pairs might be (legal question, relevant statute) while negative pairs are (legal question, irrelevant statute). Common losses include InfoNCE, multiple-negatives ranking loss, triplet loss, and the Matryoshka loss for nested-dimensional training.

**When to fine-tune.** General-purpose embeddings often fall short in domains with specialized vocabulary, such as medicine, law, finance, or technical engineering. Studies have shown that domain-specific fine-tuning can improve retrieval accuracy by 5 to 15 percentage points with as few as a few thousand training examples.

**Synthetic data for fine-tuning.** When labeled training pairs are scarce, large language models can generate synthetic query-document pairs for contrastive training. This approach, sometimes called LLM-augmented retrieval, has proven effective for bootstrapping domain-specific embedding models. The same recipe also produces hard negatives by perturbing positive examples in plausible-but-wrong ways.

**Distillation.** Smaller models can be trained to mimic the embedding outputs of larger teacher models, producing fast student models with much of the teacher's quality. The MiniLM and bge-small families were trained this way.

## Implementation

A modern embedding pipeline involves a small number of widely used libraries and APIs.

- **sentence-transformers** (Python) is the de facto open-source library for sentence and document embeddings. A typical workflow is `model = SentenceTransformer("BAAI/bge-large-en-v1.5"); vectors = model.encode(texts, normalize_embeddings=True)`.
- **Hugging Face Transformers** exposes a feature-extraction pipeline and gives access to any pretrained encoder for custom pooling.
- **OpenAI Python SDK** provides a one-call interface: `client.embeddings.create(model="text-embedding-3-large", input=texts, dimensions=1024)`. The optional `dimensions` argument leverages Matryoshka truncation.
- **Cohere, Voyage, Jina, and Anthropic SDKs** offer similar one-call interfaces with model-specific options for input type, encoding format (float, int8, binary), and instruction prefixes.
- **LangChain and LlamaIndex** wrap dozens of embedding providers behind a common interface and integrate with vector stores for retrieval pipelines.
- **FAISS, hnswlib, ScaNN, and Annoy** are battle-tested libraries for in-process ANN search; managed databases like Pinecone, Weaviate, Milvus, Qdrant, Vespa, pgvector, and Elasticsearch handle the same job at the service level.

## Embedding spaces

An [embedding space](/wiki/embedding_space) is the continuous vector space in which embedding vectors reside. The geometry of this space encodes relationships between the objects being represented.

Well-trained embedding spaces exhibit several structural properties. Linear directions in the space correspond to semantic relationships (the "king minus man plus woman is approximately queen" phenomenon). Distances between points reflect semantic similarity. Subspaces may correspond to specific attributes (gender, tense, formality, sentiment polarity). These properties are related to the manifold hypothesis, which posits that high-dimensional data tends to concentrate near lower-dimensional manifolds, and good embedding models learn to map data onto these manifolds.

The quality of an embedding space depends on the training objective, the diversity and size of the training data, and the model architecture. Contrastive learning objectives (such as those used in CLIP and modern text embedding models) tend to produce well-structured spaces where similarity-based retrieval works reliably, while pure language modeling objectives produce more anisotropic spaces that often need additional whitening or contrastive fine-tuning before they work well for retrieval.

## What are the limitations of embedding vectors?

Despite their ubiquity, embedding vectors have several important limitations.

- **Bias.** Embedding vectors trained on human-generated data inherit the biases present in that data, including gender, racial, and cultural stereotypes. Bolukbasi et al. (2016) famously demonstrated that Word2Vec encodes the analogy "man is to programmer as woman is to homemaker."[28] Debiasing techniques exist (projecting out specific bias directions, augmenting training data, post-hoc calibration) but do not fully eliminate the problem.
- **Interpretability.** Individual dimensions of an embedding vector rarely correspond to human-interpretable features. Sparse autoencoders and dictionary-learning techniques have started to recover human-readable features from embedding spaces, but interpretability remains an open research area.
- **Domain mismatch.** Embeddings trained on one domain (for example, news articles) may perform poorly on another domain (for example, biomedical text or social media) without fine-tuning.
- **Static vs. contextualized.** Static embedding methods (Word2Vec, GloVe) assign a single vector per word, failing to distinguish senses of polysemous words. Contextualized models like BERT address this but at significantly higher computational cost.
- **Dimensionality-accuracy trade-off.** Lower dimensions reduce cost but sacrifice representational capacity. Higher dimensions improve accuracy but increase storage, latency, and memory requirements. Matryoshka and quantization mitigate this trade-off but do not eliminate it.
- **Lost information.** Compressing a long document into a single 1,024-dimensional vector necessarily discards detail. Long documents often need to be chunked and embedded separately, and the resulting chunk vectors may lose cross-chunk context unless techniques like late chunking are used.
- **Catastrophic forgetting.** Fine-tuning embeddings for a new domain can degrade performance on previously learned domains unless careful techniques like continual learning, replay, or multi-task training are applied.
- **Anisotropy and similarity calibration.** Many embedding spaces concentrate vectors in a narrow cone, so absolute similarity values are hard to interpret across models or even across query types within the same model. Threshold-based decisions usually need to be calibrated per use case.
- **Adversarial vulnerability.** Like other neural network outputs, embedding vectors can be manipulated by adversarial inputs (small perturbations to text or images that change the embedding substantially), which has security implications for embedding-based search and moderation systems.

## See also

- [Embeddings](/wiki/embeddings)
- [Embedding space](/wiki/embedding_space)
- [Word embedding](/wiki/word_embedding)
- [Word2Vec](/wiki/word2vec)
- [GloVe](/wiki/glove)
- [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)
- [Sentence-BERT](/wiki/sentence-bert)
- [CLIP](/wiki/clip)
- [Cosine similarity](/wiki/cosine_similarity)
- [Vector database](/wiki/vector_database)
- [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
- [FAISS](/wiki/faiss)
- [Pinecone](/wiki/pinecone)

## References

1. Hinton, G.E., McClelland, J.L., & Rumelhart, D.E. (1986). "Distributed Representations." In *Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1* (chapter 3). MIT Press.
2. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6), pp. 391-407.
3. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, pp. 1137-1155.
4. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*.
5. Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP 2014*, pp. 1532-1543.
6. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." *Transactions of the Association for Computational Linguistics*, 5, pp. 135-146.
7. Le, Q., & Mikolov, T. (2014). "Distributed Representations of Sentences and Documents." *Proceedings of ICML 2014*.
8. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL-HLT 2018*.
9. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv:1810.04805*.
10. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of EMNLP-IJCNLP 2019*.
11. Radford, A., Kim, J.W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of ICML 2021*.
12. Grover, A., & Leskovec, J. (2016). "node2vec: Scalable Feature Learning for Networks." *Proceedings of KDD 2016*, pp. 855-864.
13. Hamilton, W.L., Ying, R., & Leskovec, J. (2017). "Inductive Representation Learning on Large Graphs." *Advances in Neural Information Processing Systems*, 30.
14. van der Maaten, L., & Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, pp. 2579-2605.
15. McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv:1802.03426*.
16. Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." *Proceedings of EACL 2023*.
17. Kusupati, A., Bhatt, G., Rege, A., et al. (2022). "Matryoshka Representation Learning." *Advances in Neural Information Processing Systems*, 35.
18. Malkov, Y.A., & Yashunin, D.A. (2016). "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." *arXiv:1603.09320*.
19. Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." *Proceedings of SIGIR 2020*.
20. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." *Proceedings of NAACL 2022*.
21. Faysse, M., Sibille, H., Wu, T., et al. (2024). "ColPali: Efficient Document Retrieval with Vision Language Models." *arXiv:2407.01449*.
22. Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. (2024). "BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation." *arXiv:2402.03216*.
23. Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., & Ping, W. (2024). "NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models." *arXiv:2405.17428*.
24. Sturua, S., Mohr, I., Akram, M.K., et al. (2024). "jina-embeddings-v3: Multilingual Embeddings With Task LoRA." *arXiv:2409.10173*.
25. OpenAI. (2024). "New embedding models and API updates." Released January 25, 2024.
26. Voyage AI. (2025). "voyage-3-large: the new state-of-the-art general-purpose embedding model." Released January 7, 2025.
27. Cohere. (2024). "Cohere int8 and binary Embeddings: Scale Your Vector Database to Large Datasets." March 26, 2024.
28. Bolukbasi, T., Chang, K.-W., Zou, J.Y., Saligrama, V., & Kalai, A.T. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." *Advances in Neural Information Processing Systems*, 29.
29. Lee, J., Chen, F., Dua, S., et al. (2025). "Gemini Embedding: Generalizable Embeddings from Gemini." *arXiv:2503.07891*. Leaderboard score from the public MTEB leaderboard, Hugging Face, accessed 2026.