Embedding Space

An embedding space is a continuous, typically high-dimensional vector space in which data objects (words, sentences, images, users, audio clips, code, or other entities) are represented as dense numerical vectors called embedding vectors. The core principle is that the geometric relationships between vectors in this space encode meaningful semantic, structural, or functional relationships between the objects they represent. Items that are similar in some task-relevant sense are mapped to nearby points, while dissimilar items are mapped far apart.

Embedding spaces are fundamental to modern machine learning and underpin applications ranging from natural language processing and computer vision to recommendation systems, semantic search, and retrieval-augmented generation. Rather than working with raw, sparse, or symbolic data, models project inputs into a shared continuous space where distances and directions carry meaning. This enables efficient computation of similarity measures, supports generalization to unseen data, and allows different data modalities to be compared directly.

The field has changed substantially since the original Word2Vec paper in 2013. Where early embeddings produced static 100 to 300 dimensional vectors trained on co-occurrence statistics, modern systems generate context-aware vectors with dimensions ranging from 384 to 4,096, sometimes with Matryoshka representations that allow truncation to smaller sizes without retraining. Commercial embedding APIs from OpenAI, Voyage AI, Cohere, Google, and others now serve trillions of inference calls a year, mostly in service of RAG and search workloads.

Explain like I'm 5

Imagine you have a huge toy box full of different Lego pieces. Some are red, some are blue; some are big, some are small; some are flat, some are tall. Now imagine you could create a magical map where every Lego piece gets its own spot. Pieces that are alike (same color, same shape) sit close together on the map, and pieces that are very different sit far apart.

In machine learning, an embedding space is that magical map. Instead of Lego pieces, a computer places words, pictures, or songs onto the map. The word "happy" would sit near "joyful" but far from "sad." A photo of a cat would sit near other cat photos but far from pictures of trucks. The computer uses these maps to understand that things close together are related, which helps it do jobs like translating languages, recommending movies, or searching for similar images.

Formal definition

Mathematically, an embedding is a function f: X to R^d that maps elements from a discrete or high-dimensional input space X into a d-dimensional real-valued vector space R^d. The dimensionality d is typically much smaller than the original input dimensionality, though the term "embedding space" applies regardless of whether dimensionality is reduced. Common embedding dimensions include 128, 256, 384, 512, 768, 1,024, 1,536, 3,072, and 4,096, depending on the model architecture and task.

The embedding function f is usually learned through training a neural network on a task-specific objective. During training, the network adjusts the mapping so that the resulting vector space satisfies desired properties, such as placing semantically similar inputs near each other according to cosine similarity or Euclidean distance. For text, the dominant architecture is now a transformer encoder fine-tuned with contrastive losses on query-document pairs, with attention layers producing token-level vectors that get pooled into a single fixed-size representation.

Distance metrics

The choice of distance metric defines what "close" means in an embedding space. Three measures dominate practical use.

Cosine similarity measures the cosine of the angle between two vectors. It is bounded between -1 and 1, ignores vector magnitude, and is the default for almost all modern text embeddings. Most production embedding APIs return L2-normalized vectors, which makes cosine similarity equivalent to a dot product and lets approximate nearest neighbor (ANN) indexes use either metric interchangeably.

Euclidean distance (L2 distance) measures the straight-line distance between two points in the space. It is sensitive to vector magnitude. For unit-norm embeddings, Euclidean distance and cosine similarity rank pairs in the same order, so the choice rarely matters. For non-normalized embeddings (some image features, some user-item collaborative filtering models), Euclidean distance can capture meaningful magnitude differences that cosine ignores.

Dot product is the simplest measure: the sum of element-wise products. For unit vectors it equals cosine similarity. Many vector databases use raw dot product as their fastest scoring function. Some embedding models (like OpenAI's text-embedding-ada-002 and the text-embedding-3 family) explicitly optimize for inner product on normalized vectors.

Less common but useful in specific settings: Manhattan distance (L1) for sparse robustness, Hamming distance for binary or quantized embeddings, and Jaccard similarity for set-valued representations.

Key properties

Embedding spaces exhibit several important properties that make them useful for machine learning.

Semantic clustering

Similar items cluster together in embedding space. For example, in a word embedding space trained on English text, words like "king," "queen," "prince," and "princess" form a cluster distinct from words like "car," "truck," and "bicycle." This clustering emerges automatically from the training objective without explicit supervision about word categories.

Meaningful distances

The distance between two points in an embedding space reflects their degree of similarity or relatedness. Cosine similarity is the more widely used metric in practice because it is invariant to vector magnitude and focuses on directional similarity.

Vector arithmetic

Well-trained embedding spaces support meaningful arithmetic operations on vectors. The most famous example comes from Word2Vec: the vector operation king - man + woman yields a vector close to queen. This property, sometimes called the "parallelogram rule," shows that embedding spaces can encode relational concepts as consistent vector offsets. The relationship "male to female" is captured by approximately the same direction in the space regardless of which word pair is considered. Modern contextual embeddings exhibit this less cleanly than static word vectors, because their representations depend on context, but the underlying intuition still holds for averaged or pooled vectors.

Continuity and smoothness

Embedding spaces are continuous, meaning that small movements in the space correspond to small, gradual changes in the represented concept. This continuity is what enables interpolation between points in generative models and supports the generalization ability of downstream classifiers.

The manifold hypothesis

The effectiveness of embedding spaces is closely related to the manifold hypothesis, which states that real-world high-dimensional data tends to lie on or near low-dimensional manifolds embedded within the higher-dimensional ambient space. For instance, the set of all natural images occupies only a tiny fraction of the space of all possible pixel arrangements. Embedding models learn to identify and parameterize these low-dimensional manifolds, mapping data to a space where the intrinsic structure is made explicit.

This perspective explains why dimension reduction techniques work in practice. The apparent high dimensionality of raw data (millions of pixels, tens of thousands of vocabulary tokens) masks a much lower intrinsic dimensionality dictated by the underlying factors of variation. Neural networks learn embedding functions that capture these factors, discarding noise and irrelevant variation.

Types of embedding spaces

Different domains and tasks produce embedding spaces with distinct characteristics.

Word embedding spaces

Word embedding models like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) map individual words to dense vectors, typically of 100 to 300 dimensions. Word2Vec learns embeddings by predicting context words surrounding a target word (skip-gram) or predicting a target word from its context (CBOW). GloVe takes a different approach: it factorizes the global word co-occurrence matrix so that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. FastText (Bojanowski et al., 2017) extended Word2Vec with subword n-gram features, producing better embeddings for morphologically rich languages and out-of-vocabulary words.

Both methods produce spaces where semantic relationships are encoded geometrically. Synonyms cluster together, and analogical relationships appear as parallel vector offsets. These word embedding spaces laid the groundwork for modern NLP, though they have been largely superseded by contextual embeddings from models like BERT and GPT.

Model	Year	Training approach	Key property	Typical dimensions
Word2Vec	2013	Predict context words (skip-gram/CBOW)	Local context patterns; vector analogies	100-300
GloVe	2014	Factorize global co-occurrence matrix	Captures global statistics; log-bilinear model	50-300
FastText	2017	Subword n-gram skip-gram	Handles out-of-vocabulary words via subword information	100-300

Sentence and document embedding spaces

Sentence-BERT (Reimers and Gurevych, 2019) and similar sentence transformer models extend word-level embeddings to full sentences. Sentence-BERT uses a siamese network architecture with a pre-trained BERT backbone to produce fixed-size sentence embeddings where cosine similarity directly corresponds to semantic similarity. This makes operations like semantic search and clustering computationally efficient: finding the most similar sentence in a collection of 10,000 sentences drops from roughly 65 hours with cross-encoder BERT to about 5 seconds with Sentence-BERT, while maintaining comparable accuracy.

The Sentence-BERT design (encode each item independently, compare with cosine) is the template for almost every modern text embedding model. The training recipe has evolved: contrastive losses with hard negatives, multi-stage training on web pairs followed by labeled triples, and the InfoNCE loss are now standard. The popular open-source sentence-transformers library on Hugging Face hosts thousands of variants fine-tuned for specific languages and domains.

Image embedding spaces

In computer vision, convolutional neural networks and vision transformers learn hierarchical feature representations that form image embedding spaces. The penultimate layer of a trained image classifier (before the classification head) typically serves as a general-purpose image embedding. Models like ResNet, EfficientNet, and Vision Transformers produce embeddings where visually and semantically similar images are nearby. These embeddings power reverse image search, visual recommendation, and few-shot image classification.

More recent self-supervised vision models produce stronger image embeddings without label supervision. DINOv2 (Oquab et al., 2023, Meta) trains a vision transformer with a teacher-student distillation objective on a curated 142-million-image dataset. Its embeddings work well out of the box for retrieval, segmentation, and depth estimation without fine-tuning. SigLIP (Zhai et al., 2023, Google) replaced CLIP's softmax-based contrastive loss with a sigmoid loss, which scales better and produced stronger zero-shot classification at modest compute. JinaCLIP from Jina AI offers a unified text-image encoder optimized for multimodal retrieval.

Joint and multimodal embedding spaces

CLIP (Radford et al., 2021) introduced a joint embedding space for images and text by training an image encoder and a text encoder simultaneously with a contrastive loss. The training objective maximizes cosine similarity between matched image-text pairs while minimizing it for mismatched pairs. The resulting space allows direct comparison between images and text: a photo of a dog is close to the text description "a photo of a dog."

Meta's ImageBind (Girdhar et al., 2023) extended this concept to six modalities: images, text, audio, depth, thermal, and IMU (inertial measurement unit) data. A key insight of ImageBind is that images naturally co-occur with many other modalities, so using images as a "binding" modality allows all six to be aligned into a single embedding space without requiring paired data for every combination. This enables emergent cross-modal capabilities, such as retrieving audio clips using text queries or generating images from audio inputs.

System	Modalities	Training approach	Notable capability
CLIP	Image, text	Contrastive learning on 400M image-text pairs	Zero-shot image classification
ALIGN	Image, text	Contrastive learning on 1.8B noisy image-text pairs	Robust to noisy training data
SigLIP / SigLIP 2	Image, text	Sigmoid contrastive loss	Better scaling, higher zero-shot accuracy
ImageBind	Image, text, audio, depth, thermal, IMU	Image-paired contrastive learning	Cross-modal retrieval across six modalities
CLAP	Audio, text	Contrastive audio-language pre-training	Zero-shot audio classification
DINOv2	Image (self-supervised)	Teacher-student distillation	Strong dense features without labels

Latent spaces in generative models

Autoencoders and variational autoencoders (VAEs) learn latent spaces that serve as compressed embedding spaces for their training data. A VAE encoder maps inputs to a probability distribution over the latent space, and the decoder maps samples from this distribution back to the data space. Two key properties make VAE latent spaces useful for generation. First, continuity: nearby points in the latent space decode to similar outputs. Second, completeness: any point sampled from the latent space decodes to a plausible output.

These properties enable smooth interpolation between data points. For example, in a VAE trained on face images, interpolating between the latent vectors of two faces produces a smooth morphing sequence. Similarly, in the latent space of a text-to-image diffusion model, interpolating between the embeddings of two text prompts produces images that gradually blend the two concepts.

Modern text embedding models

The text embedding market has consolidated around a handful of commercial APIs and a long tail of open-source checkpoints on Hugging Face. Most production systems pick a model based on three factors: retrieval quality on the MTEB benchmark, max input length, and price per million tokens.

Model	Provider	Released	Dimensions	Max tokens	Price (per 1M tokens)
text-embedding-ada-002	OpenAI	Dec 2022	1,536	8,191	$0.10
text-embedding-3-small	OpenAI	Jan 2024	1,536 (Matryoshka, down to 256)	8,191	$0.02
text-embedding-3-large	OpenAI	Jan 2024	3,072 (Matryoshka, down to 256)	8,191	$0.13
voyage-3	Voyage AI	Sep 2024	1,024 (Matryoshka 256/512/1024/2048)	32,000	$0.06
voyage-3-large	Voyage AI	Jan 2025	1,024 (Matryoshka)	32,000	$0.18
voyage-code-3	Voyage AI	Dec 2024	1,024	32,000	$0.18
voyage-multimodal-3	Voyage AI	2024	1,024	32,000	$0.12
Embed v3 (English)	Cohere	Nov 2023	1,024	512	$0.10
Embed v3 (Multilingual)	Cohere	Nov 2023	1,024	512	$0.10
Embed v4	Cohere	Apr 2025	256/512/1024/1536 (Matryoshka)	128,000	$0.12
gemini-embedding-001	Google	2025	768/1536/3072 (Matryoshka)	8,192	$0.15
text-embedding-005 (Vertex)	Google	2024	768	2,048	$0.025
Mistral Embed	Mistral AI	2024	1,024	8,192	$0.10
nomic-embed-text-v1.5	Nomic	Feb 2024	768 (Matryoshka 64/128/256/512/768)	8,192	Open weights
jina-embeddings-v3	Jina AI	Sep 2024	1,024 (Matryoshka, down to 32)	8,192	Open + API
BGE-M3	BAAI	2024	1,024	8,192	Open weights
GTE-large-en-v1.5	Alibaba	2024	1,024	8,192	Open weights
Snowflake Arctic Embed L 2.0	Snowflake	2024	1,024 (Matryoshka)	8,192	Open weights

A few notes on the table. Pricing is current as of early 2026 and changes frequently; check provider pricing pages before budgeting. Max token windows refer to the encoder's input length; some APIs accept longer inputs and truncate silently, which is a common source of silent quality regression. The Matryoshka column indicates which sizes can be safely truncated without retraining.

The practical pattern in production: start with a hosted API for ease of use, switch to an open model on dedicated GPUs once volumes pass roughly 100 million embeddings per month, and revisit the choice when a new MTEB leader appears with meaningfully better domain accuracy.

OpenAI text-embedding-3 family

The text-embedding-3-small and text-embedding-3-large models, released in January 2024, replaced the long-running text-embedding-ada-002. Both use Matryoshka representation learning: the model is trained so that any prefix of the output vector is itself a valid (lower quality) embedding. Users can request a shorter output (down to 256 dimensions) to save storage and bandwidth without retraining or re-embedding. text-embedding-3-large hit 64.6% on MTEB at launch and remains a strong general-purpose baseline. The pricing drop from $0.10 per million tokens (ada-002) to $0.02 per million tokens (3-small) reflected the broader collapse in inference costs across 2023-2024.

Voyage AI

Voyage AI, acquired by MongoDB in early 2025, ships the voyage-3 family along with code-specialized (voyage-code-3) and multimodal (voyage-multimodal-3) variants. As of MTEB v2 in early 2025, voyage-3-large led most retrieval categories, often by 5 to 10 percentage points over OpenAI text-embedding-3-large, particularly on long-context retrieval and code search. The 32,000 token context window is unusually long for an embedding model and is well-matched to chunking schemes that keep whole documents intact.

Cohere Embed v3 and v4

Cohere Embed v3 (November 2023) introduced compression-aware training: embeddings remained accurate when quantized to int8 or binary, which let Cohere offer compressed variants directly through the API. Embed v4 (April 2025) extended the family to 128,000 token context, multilingual (over 100 languages), and Matryoshka outputs at 256/512/1024/1536. Embed v4 added native multimodal support, encoding images and text into the same space without a separate vision encoder call.

Google Vertex and gemini-embedding-001

Google ships embeddings through Vertex AI. The textembedding-gecko line evolved through several versions; the current production model, text-embedding-005, is fine-tuned for retrieval and outputs 768-dimensional vectors. The newer gemini-embedding-001 model, derived from Gemini, offers Matryoshka outputs at 768/1536/3072 dimensions and led MTEB at launch in 2025.

Open-source families

Four open-source families dominate the practitioner conversation: BGE (BAAI, China), GTE (Alibaba), E5 (Microsoft), and Nomic Embed. BGE-M3 stands out for its support of dense, sparse, and multi-vector outputs in a single model. GTE-large-en-v1.5 is a popular default for English RAG when budget rules out paid APIs. Snowflake Arctic Embed and Jina embeddings v3 round out the top tier. The upside of open weights is obvious: no per-token cost, full control over the inference path, and freedom to fine-tune. The downside is that you carry the GPU bill and the engineering work of operating a serving cluster, and the leader board moves every few months.

Specialized embeddings

Several domains have benefited from specialized embedding models. Code embeddings (voyage-code-3, CodeT5+, jina-embeddings-v2-base-code) are trained on source code with adjacent natural language descriptions, producing better retrieval for code search and copilot grounding. Scientific embeddings (SciBERT, SPECTER for academic papers) are tuned on research text. Medical and legal embeddings (BioBERT, LegalBERT, voyage-law-2, Cohere Embed v3 for legal) are common in regulated domains where general models miss specialist vocabulary. Multilingual embeddings (LaBSE, multilingual-e5, jina-embeddings-v3, Cohere Embed Multilingual v3) cover cross-lingual retrieval where queries and documents may live in different languages.

Vector databases

A vector database stores embedding vectors and supports fast nearest-neighbor search at scale. The category did not exist as a distinct product in 2018; by 2026 it includes a dozen well-known names, several venture-funded startups, and vector indexes added to existing relational and search engines.

Database	Type	Indexing algorithms	Hosting	Notable users / scale
Pinecone	Managed cloud-native	Proprietary (HNSW-derived)	Fully managed SaaS	Notion, Gong, Shopify; petabyte scale
Weaviate	Open source + cloud	HNSW, flat	Self-hosted or managed	Stack Overflow, Unbody
Qdrant	Open source + cloud	HNSW with payload filtering	Self-hosted or managed	Cloudflare AutoRAG backend
Milvus / Zilliz	Open source + cloud	HNSW, IVF, DiskANN, GPU CAGRA	Self-hosted or managed	Walmart, IKEA, eBay
Chroma	Open source + cloud	HNSW	Embedded or hosted	LangChain default backend
pgvector	Postgres extension	HNSW, IVFFlat	Wherever Postgres runs	Supabase, Neon, Timescale
LanceDB	Embedded / serverless	IVF-PQ, HNSW	Local file or S3	Roblox, character.ai (reportedly)
Vespa	Open source search engine	HNSW + structured filters	Self-hosted or Vespa Cloud	Yahoo, Spotify
Elasticsearch / OpenSearch	Search engine + vectors	HNSW (Lucene)	Self-hosted or Elastic Cloud	Wikipedia search, GitHub Enterprise
Redis Vector	In-memory database	FLAT, HNSW	Redis Cloud or self-host	Used for low-latency RAG caches
MongoDB Atlas Vector Search	Document database	HNSW	Atlas managed	Bundles with Voyage AI embeddings
Apache Cassandra (Astra DB)	Distributed NoSQL	DiskANN	DataStax managed	Netflix, Capital One

The vector database market is fragmented and pricing varies widely. Public estimates of market share are unreliable, so I will avoid quoting specific percentages; what is clear from job postings, GitHub stars, and conference talks is that Pinecone, Milvus / Zilliz, Qdrant, and Weaviate dominate dedicated vector workloads, while pgvector and Atlas Vector Search dominate "vectors next to my existing data" workloads. Elasticsearch and OpenSearch dominate when teams already operate a Lucene cluster and want hybrid keyword-plus-vector search.

How to choose

The practical decision tree is short. If your data already lives in Postgres and your collection is under roughly 50 million vectors, pgvector with HNSW is hard to beat on operational cost. If you need horizontal scaling past a billion vectors, Milvus or Pinecone are battle-tested. If you need rich metadata filtering and you are starting fresh, Qdrant has the cleanest filter pushdown story. If you want hybrid lexical-plus-vector search out of the box, Vespa, Elasticsearch, or OpenSearch beat the specialized vector databases on this dimension.

Indexing algorithms

Exact nearest-neighbor search over millions of high-dimensional vectors is too slow for interactive use. Approximate nearest neighbor (ANN) algorithms trade a small drop in recall for a large speedup. Five families dominate.

Algorithm	Year	Approach	Strengths	Trade-offs
HNSW (Hierarchical Navigable Small World)	2016	Multi-layer proximity graph	Best recall-speed Pareto for in-memory data; default in most VDBs	Memory hungry; slow to build; no easy delete
IVF (Inverted File)	2010s	Partition vectors with k-means, search top-N partitions	Simple, GPU friendly, works well with PQ	Lower recall than HNSW at the same speed
Product Quantization (PQ)	2010 (Jegou)	Compress vectors into product of subquantizers	10-100x memory reduction; combines with IVF	Loss in recall; not great alone
DiskANN	2019 (Microsoft)	Single-pass graph index that lives on SSD	Billion-scale on a single machine; cheap storage	Higher latency than in-memory HNSW
ScaNN	2020 (Google)	Anisotropic quantization + IVF	Strong recall at very low latency on Google hardware	Less common in third-party VDBs

In practice, HNSW is the default for vector databases that fit in RAM (Pinecone serverless, Qdrant, Weaviate, Chroma, pgvector). DiskANN is the default for systems that want to scale past memory (Milvus, Cassandra). PQ is bolted onto IVF or HNSW to compress the vectors themselves; binary quantization (1 bit per dimension) is increasingly common for embedding models that survive the compression, including Cohere Embed v3 and the text-embedding-3 family at lower Matryoshka sizes.

The FAISS library from Meta deserves its own mention. It implements all of these algorithms, runs on CPU and GPU, and powers many of the vector databases above either directly or as the inspiration for their indexing code.

RAG context

Retrieval-augmented generation is the application that turned embeddings from a research curiosity into a billion-dollar API category. The pattern is simple to state: chunk the corpus, embed each chunk, store in a vector database, embed the user query at runtime, retrieve the top-k closest chunks, and prepend them to the prompt for a large language model. The details are where most teams spend their time.

Chunking

Most embedding models have a max context window between 512 and 32,000 tokens, but retrieval quality is usually better with smaller chunks because each chunk represents a tighter semantic unit. Common chunk sizes range from 256 to 1,024 tokens. Chunking strategies include fixed-size sliding windows (with overlap of 10 to 20 percent), recursive character splitting (split by paragraph, then sentence, then word until chunks fit), semantic chunking (split where embedding similarity drops between adjacent sentences), and structured chunking (one chunk per markdown section, code function, or PDF page). The right choice depends on the source documents. Legal contracts and code want structured chunking; informal customer support transcripts respond well to semantic chunking.

Hybrid search

Dense embedding retrieval is strong on paraphrase and synonym handling but weak on rare terms, exact identifiers, and recent named entities. Lexical retrieval (BM25, TF-IDF) is the opposite: strong on exact match, weak on paraphrase. Hybrid search combines both with a fusion step. The two common fusion methods are reciprocal rank fusion (RRF), which sums 1/(k + rank) across the two lists, and weighted score fusion. Cohere, Pinecone, Weaviate, Vespa, and Elasticsearch all ship hybrid search out of the box; on benchmarks like BEIR, hybrid usually beats dense alone by 3 to 8 points.

Reranking

A bi-encoder embedding model produces independent vectors for query and document, which means it cannot model interaction between specific query terms and specific document spans. A cross-encoder reranker concatenates query and document and runs a full transformer pass, scoring each pair directly. This is too expensive to run over the whole corpus, but it works well on the top 50 to 100 results from a vector or hybrid search. Cohere Rerank, Voyage rerank-2, and the open BGE Reranker family are common choices. Adding a reranker typically lifts NDCG@10 by 5 to 15 points over vector search alone.

Practical pipeline

A modern production RAG pipeline looks roughly like this:

Ingest: parse documents, chunk, optionally summarize each chunk for indexing.
Embed: send chunks through an embedding model in batches.
Index: write to a vector database with metadata (source, timestamp, permissions).
Query: embed the user query (sometimes multiple rephrasings), run hybrid search.
Rerank: pass the top 50 results to a cross-encoder reranker.
Generate: take the top 5 to 10 reranked chunks, concatenate into the LLM prompt, generate answer with citations.

Each step in this pipeline has its own embedding model decisions. The embedding model used for indexing must match the one used at query time, which is a real constraint when the underlying API releases a new version with a different vector geometry.

Multimodal embedding spaces

Multimodal retrieval has lagged text retrieval by several years, but the gap closed quickly in 2024 and 2025.

The earliest practical multimodal embeddings were CLIP and its successors (OpenCLIP, MetaCLIP, EVA-CLIP, SigLIP, SigLIP 2). These produce a shared image-text space that supports zero-shot classification ("a photo of a dog" versus the image), reverse image search, and image-conditioned text retrieval. They are now standard inside larger vision-language models, where the CLIP-style image encoder feeds patch embeddings into a language model.

Dedicated multimodal embedding APIs followed in 2024. Voyage multimodal-3 produces a single vector for an arbitrary mix of text and images, including interleaved documents like PDFs with figures and PowerPoint slides. Cohere Embed v4 (April 2025) is multimodal by default, encoding text and images into one space without a separate vision call. Vertex AI multimodal embeddings (multimodalembedding@001) accept text, image, and short video clips, returning a 1,408-dimensional shared vector.

The killer application for multimodal embeddings is multimodal RAG over enterprise documents. Real-world PDFs are full of charts, screenshots, and diagrams whose meaning is lost when you OCR them to plain text. Multimodal embeddings let the figure itself be retrieved and shown to a vision-capable LLM, which then explains it. Anecdotally, this pattern has driven much of the growth in Voyage and Cohere's multimodal embedding usage through 2025.

Matryoshka embeddings

Matryoshka representation learning, introduced by Kusupati et al. (2022), trains an embedding model so that any prefix of the output vector is itself a valid embedding, with quality degrading gracefully as the prefix shrinks. The training objective adds a weighted sum of contrastive losses computed at multiple truncation lengths, typically 64, 128, 256, 512, 768, and 1024 dimensions. At inference, users pick the shortest length that meets their accuracy bar.

The practical impact is significant. A team can store 256-dimensional vectors for a fast first-stage retrieval over billions of chunks, then re-rank the top candidates using the full 1,024 or 3,072 dimensional vectors. Storage and memory drop by 4 to 12x; recall stays close to the full-precision baseline. OpenAI text-embedding-3, Voyage voyage-3 and voyage-3-large, Cohere Embed v4, Nomic Embed, and Snowflake Arctic Embed all use Matryoshka training. Combined with binary quantization (one bit per dimension), Matryoshka outputs can compress a 1,024-dimensional vector to as little as 32 bytes with single-digit recall loss.

Geometry and structure

Isotropy and anisotropy

An isotropic embedding space is one where vectors are uniformly distributed across all directions, meaning no direction is preferred over another. In practice, many pre-trained language models produce highly anisotropic embedding spaces. Ethayarajh (2019) demonstrated that embeddings from BERT, ELMo, and GPT-2 occupy a narrow cone in the vector space rather than being spread uniformly. This "cone effect" means that randomly sampled word embeddings have unexpectedly high cosine similarity, which degrades the usefulness of cosine similarity as a semantic measure.

Several techniques have been proposed to address anisotropy. Whitening transformations can redistribute embeddings more uniformly, and post-processing methods like normalizing flows can map the anisotropic distribution to a more isotropic one. These corrections improve performance on semantic similarity benchmarks. Modern contrastive training reduces but does not eliminate anisotropy; well-trained sentence embedding models like the text-embedding-3 family are noticeably more isotropic than raw BERT.

Hyperbolic embedding spaces

Standard embedding spaces use Euclidean geometry, which works well for data without strong hierarchical structure. However, tree-like and hierarchical data (such as taxonomies, organizational charts, or knowledge graphs) can be more faithfully represented in hyperbolic space. Nickel and Kiela (2017) introduced Poincare embeddings, which embed data into the Poincare ball model of hyperbolic space.

Hyperbolic space expands exponentially with distance from the origin, much like a tree expands exponentially with depth. This means that hierarchical structures that require a high-dimensional Euclidean space can be embedded with low distortion in a low-dimensional hyperbolic space. In experiments, Poincare embeddings in just 5 dimensions outperformed Euclidean embeddings in 200 dimensions for representing the WordNet noun hierarchy.

Geometry	Best suited for	Key advantage	Example application
Euclidean	Flat, non-hierarchical data	Simple distance computations; well-understood optimization	Word similarity, image retrieval
Hyperbolic	Tree-like, hierarchical data	Exponential volume growth matches tree branching	Taxonomy embedding, knowledge graphs
Spherical	Data with periodic or directional structure	Natural for cosine similarity; unit-norm constraints	Sentence embeddings, CLIP

Embedding space alignment

Cross-lingual alignment

Different languages trained independently produce separate embedding spaces with similar internal structures but incompatible coordinate systems. Cross-lingual alignment maps these spaces into a shared space so that translations are nearby. Facebook's MUSE library (Conneau et al., 2018) aligns monolingual fastText embeddings for 30 languages using either a small bilingual dictionary (supervised) or adversarial training (unsupervised). The alignment is typically an orthogonal transformation, which preserves the internal structure of each monolingual space while rotating and reflecting them into agreement.

This enables training a classifier in one language and applying it directly to another. For example, a sentiment classifier trained on English data can classify German text if both languages share an aligned embedding space. Modern multilingual embedding models (LaBSE, multilingual-e5, Cohere Embed Multilingual v3, jina-embeddings-v3) skip the post-hoc alignment by training jointly on parallel corpora across languages.

Cross-modal alignment brings different data types (text, images, audio) into a shared embedding space. CLIP achieves this through contrastive training on image-text pairs. However, research has shown that CLIP's embedding space contains a "modality gap," where image embeddings and text embeddings cluster in separate regions of the hypersphere rather than fully interleaving. Recent work, such as AlignCLIP, addresses this gap through shared encoder parameters and regularized training objectives.

Visualization of embedding spaces

Because embedding spaces typically have hundreds of dimensions, visualization requires projecting them into two or three dimensions. The two most popular techniques for this are t-SNE and UMAP.

t-SNE (t-Distributed Stochastic Neighbor Embedding), developed by van der Maaten and Hinton (2008), converts high-dimensional pairwise distances into probability distributions and minimizes the KL divergence between the high-dimensional and low-dimensional distributions. t-SNE excels at preserving local neighborhood structure, making it effective for revealing clusters. However, it does not reliably preserve global distances; clusters that appear far apart in a t-SNE plot may not actually be far apart in the original space.

UMAP (Uniform Manifold Approximation and Projection), developed by McInnes et al. (2018), is grounded in topological data analysis and Riemannian geometry. UMAP is significantly faster than t-SNE, scales better to large datasets, and tends to preserve more global structure while still capturing local clusters. It has become the preferred tool for exploratory visualization of embedding spaces in many applications.

PCA (Principal Component Analysis) is the simplest option. It is a linear projection that maximizes preserved variance. PCA is fast and reversible but cannot capture nonlinear structure, which makes it less useful than t-SNE or UMAP for clustering visualization. It still has its place as a preprocessing step before more expensive nonlinear methods, and as a quick sanity check on the spread of a new embedding model.

Method	Preserves local structure	Preserves global structure	Speed	Scalability
t-SNE	Excellent	Limited	Slow for large datasets	Moderate
UMAP	Excellent	Good	Fast	High
PCA	Moderate	Good (linear only)	Very fast	Very high

A recurring caution: these methods are for visualization, not retrieval. The 2D projection throws away most of the high-dimensional structure that made the embedding useful in the first place. Do not run nearest-neighbor search on UMAP outputs.

Benchmarks

Embedding models are evaluated on a small set of standardized benchmarks. The most influential is MTEB.

MTEB

The Massive Text Embedding Benchmark (MTEB, Muennighoff et al., 2022) covers 56 datasets across eight task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitext mining. It became the de facto leaderboard for English text embeddings within months of release. The Hugging Face MTEB leaderboard now lists hundreds of models; the top of the board churns roughly every quarter as new open and closed models ship.

The benchmark family has expanded. C-MTEB covers Chinese, MTEB-French, MTEB-Polish and others cover specific languages, and MMTEB (Massive Multilingual Text Embedding Benchmark, 2024) extends the methodology to over 250 languages. As of early 2026, the top open-weight models on MTEB v2 retrieval (English) typically include voyage-3-large, gemini-embedding-001, NV-Embed-v2 (Nvidia), Linq-Embed-Mistral, and SFR-Embedding-2 from Salesforce. The exact ordering shifts month to month and the deltas between top models are usually under one point on the average score, so I will avoid pinning down a fixed number-one. Pick the model based on your specific task category, not the average.

BEIR

BEIR (Thakur et al., 2021) is a heterogeneous retrieval benchmark covering 18 datasets, including BioASQ, NaturalQuestions, MS MARCO, FEVER, and several specialty domains. It predates MTEB and remains the standard for retrieval-only evaluation. BEIR introduced the now-common practice of zero-shot evaluation: train on MS MARCO, evaluate everywhere else. This format exposes domain transfer failures that in-domain training conceals.

Code and domain benchmarks

For code retrieval, CoIR (Code Information Retrieval) and CSN (CodeSearchNet) are common. For long-context retrieval, LoCo and BRIGHT (with reasoning-heavy queries) probe weaknesses that BEIR misses. Domain-specific benchmarks exist for legal (LegalBench), medical (MedQA retrieval splits), and scientific (SciDocs, SciFact) text.

Applications

Embedding spaces enable a wide range of practical applications across machine learning.

Semantic search and retrieval. Documents, queries, and passages are embedded into a shared space, and retrieval is performed by finding the nearest neighbors to the query embedding. This approach, known as dense retrieval, powers modern search engines and retrieval-augmented generation systems.

Retrieval-augmented generation. As described above, RAG is the largest single use of embeddings in production today.

Recommendation systems. Users and items (movies, products, songs) are embedded into the same space. Recommendations are generated by finding items whose embeddings are closest to a user's embedding. Two-tower neural architectures, used at YouTube, TikTok, Pinterest, and Spotify, are direct descendants of the Sentence-BERT siamese network design.

Clustering and topic modeling. Embedding text documents and then applying clustering algorithms (such as k-means, HDBSCAN, or BERTopic) to the resulting vectors is a common approach for discovering topics, grouping similar documents, and performing unsupervised categorization.

Classification with few examples. Pre-trained embeddings serve as input features for lightweight classifiers, often a single linear layer trained on a few hundred examples. This pattern is common for content moderation, intent detection, and ticket routing.

Deduplication and near-duplicate detection. Embedding similarity catches paraphrased duplicates that exact-match hashing misses. This shows up in dataset cleaning for LLM training, in plagiarism detection, and in near-duplicate user-generated content filtering.

Anomaly detection. In an embedding space trained on normal data, anomalous inputs map to regions far from the dense clusters of normal data. This distance-based approach to anomaly detection is used in fraud detection, manufacturing quality control, and cybersecurity.

Transfer learning. Pre-trained embedding spaces serve as a foundation for downstream tasks. Rather than training a model from scratch, practitioners use embeddings from models like BERT, CLIP, or ResNet as input features for task-specific classifiers or regressors. This transfer of learned representations dramatically reduces the amount of task-specific training data required.

Embedding distillation

The largest text embedding models have billions of parameters and produce 3,072 or 4,096 dimensional vectors, which can be expensive to serve at scale. Distillation produces a smaller student model that mimics the teacher's embeddings on a curated set of inputs. The classic recipe trains the student with a mean-squared-error loss against teacher vectors, sometimes paired with a contrastive loss against the original triples. The result is a smaller, faster model whose embeddings are interchangeable with the teacher's for similarity tasks.

Many of the popular small embedding models (all-MiniLM-L6-v2, paraphrase-MiniLM-L3-v2, gtr-t5-base) are distilled from larger teachers. Cohere, Voyage, and OpenAI all ship distilled "small" or "lite" tiers built this way.

Limitations

Despite their utility, embedding spaces present several challenges.

The curse of dimensionality affects nearest-neighbor search in very high-dimensional spaces, where distances between points become increasingly uniform. Approximate nearest-neighbor algorithms (such as HNSW and IVF) mitigate this but introduce a speed-accuracy tradeoff.

Out-of-domain performance can drop sharply. An embedding model trained on web text may struggle with legal contracts, scientific papers, or proprietary internal jargon. The standard mitigation is fine-tuning on in-domain pairs, which usually requires a few thousand labeled query-document examples. Voyage and Cohere both offer managed fine-tuning services for this reason.

Long-text handling remains hard. Most embedding models truncate or chunk inputs longer than their context window, which loses information. Even models with 32,000 token windows show retrieval quality drops on very long inputs because the pooled vector cannot represent every part of the document equally. Late interaction models like ColBERT and the multi-vector mode of BGE-M3 partially address this by storing token-level vectors and computing fine-grained similarity at query time, at the cost of much higher storage.

Cross-lingual challenges. Multilingual embedding models do well on high-resource languages and degrade on low-resource ones. Code-switched text, mixed-script documents, and rare languages remain weak spots.

Hallucinated similarity. Two documents may have high cosine similarity for surface reasons (shared boilerplate, similar formatting) without sharing real semantic content. Reranking and grounding citations help, but the underlying issue is that cosine similarity is a learned signal, not a guarantee.

Embedding model drift across versions. When a provider releases a new embedding model, the vector geometry usually changes, which means a corpus indexed with the old model is incompatible with queries from the new model. Re-embedding terabytes of documents is expensive. Some teams pin to a specific model version and only upgrade in coordinated rollouts; others maintain side-by-side indexes during transitions.

Bias and fairness. Embedding spaces inherit and can amplify biases present in training data. Word embedding spaces trained on web text have been shown to encode gender, racial, and other social biases as geometric relationships. Debiasing techniques exist but remain an active area of research.

Interpretability. Unlike hand-crafted features, individual dimensions of a learned embedding typically do not correspond to identifiable concepts, making it difficult to explain why two items are considered similar.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." *Advances in Neural Information Processing Systems (NeurIPS) 26*.
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP 2014*.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). "Enriching Word Vectors with Subword Information." *Transactions of the Association for Computational Linguistics*, 5, 135-146.
Reimers, N. and Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of EMNLP 2019*.
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." *Proceedings of ICML 2021*.
Girdhar, R., El-Nouby, A., Liu, Z., et al. (2023). "ImageBind: One Embedding Space To Bind Them All." *Proceedings of CVPR 2023*.
Oquab, M., Darcet, T., Moutakanni, T., et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *arXiv:2304.07193*.
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. (2023). "Sigmoid Loss for Language Image Pre-Training." *Proceedings of ICCV 2023*.
Nickel, M. and Kiela, D. (2017). "Poincare Embeddings for Learning Hierarchical Representations." *NeurIPS 30*.
Ethayarajh, K. (2019). "How Contextual are Contextualized Word Representations?" *EMNLP 2019*.
van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605.
McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv:1802.03426*.
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jegou, H. (2018). "Word Translation Without Parallel Data." *Proceedings of ICLR 2018*.
Kusupati, A., Bhatt, G., Rege, A., et al. (2022). "Matryoshka Representation Learning." *NeurIPS 35*.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." *arXiv:2210.07316*.
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." *NeurIPS Datasets and Benchmarks 2021*.
Malkov, Y. A. and Yashunin, D. A. (2018). "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." *IEEE Transactions on Pattern Analysis and Machine Intelligence*.
Jegou, H., Douze, M., and Schmid, C. (2010). "Product Quantization for Nearest Neighbor Search." *IEEE TPAMI*.
Subramanya, S. J., Devvrit, F., Kadekodi, R., Krishaswamy, R., and Simhadri, H. V. (2019). "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node." *NeurIPS 32*.
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. (2020). "Accelerating Large-Scale Inference with Anisotropic Vector Quantization." *Proceedings of ICML 2020*.
OpenAI (2024). "New embedding models and API updates." Announcement, January 25, 2024.
Cohere (2025). "Introducing Embed v4." Cohere blog, April 2025.
Voyage AI (2024-2025). "voyage-3, voyage-3-large, voyage-multimodal-3, voyage-code-3 model cards."

Explain like I'm 5

Formal definition

Distance metrics

Key properties

Semantic clustering

Meaningful distances

Vector arithmetic

Continuity and smoothness

The manifold hypothesis

Types of embedding spaces

Word embedding spaces

Sentence and document embedding spaces

Image embedding spaces

Joint and multimodal embedding spaces

Latent spaces in generative models

Modern text embedding models

OpenAI text-embedding-3 family

Voyage AI

Cohere Embed v3 and v4

Google Vertex and gemini-embedding-001

Open-source families

Specialized embeddings

Vector databases

How to choose

Indexing algorithms

RAG context

Chunking

Hybrid search

Reranking

Practical pipeline

Multimodal embedding spaces

Matryoshka embeddings

Geometry and structure

Isotropy and anisotropy

Hyperbolic embedding spaces

Embedding space alignment

Cross-lingual alignment

Cross-modal alignment

Visualization of embedding spaces

Benchmarks

MTEB

BEIR

Code and domain benchmarks

Applications

Embedding distillation

Limitations

See also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

Explain like I'm 5

Formal definition

Distance metrics

Key properties

Semantic clustering

Meaningful distances

Vector arithmetic

Continuity and smoothness

The manifold hypothesis

Types of embedding spaces

Word embedding spaces

Sentence and document embedding spaces

Image embedding spaces

Joint and multimodal embedding spaces

Latent spaces in generative models

Modern text embedding models

OpenAI text-embedding-3 family

Voyage AI

Cohere Embed v3 and v4

Google Vertex and gemini-embedding-001

Open-source families

Specialized embeddings

Vector databases

How to choose