Embedding Space
Last reviewed
May 8, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 · 7,177 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 · 7,177 words
Add missing citations, update stale details, or suggest a clearer explanation.
An embedding space is a continuous, typically high-dimensional vector space in which data objects (words, sentences, images, users, audio clips, code, or other entities) are represented as dense numerical vectors called embedding vectors. The core principle is that the geometric relationships between vectors in this space encode meaningful semantic, structural, or functional relationships between the objects they represent. Items that are similar in some task-relevant sense are mapped to nearby points, while dissimilar items are mapped far apart.
Embedding spaces are fundamental to modern machine learning and underpin applications ranging from natural language processing and computer vision to recommendation systems, semantic search, and retrieval-augmented generation. Rather than working with raw, sparse, or symbolic data, models project inputs into a shared continuous space where distances and directions carry meaning. This enables efficient computation of similarity measures, supports generalization to unseen data, and allows different data modalities to be compared directly.
The field has changed substantially since the original Word2Vec paper in 2013. Where early embeddings produced static 100 to 300 dimensional vectors trained on co-occurrence statistics, modern systems generate context-aware vectors with dimensions ranging from 384 to 4,096, sometimes with Matryoshka representations that allow truncation to smaller sizes without retraining. Commercial embedding APIs from OpenAI, Voyage AI, Cohere, Google, and others now serve trillions of inference calls a year, mostly in service of RAG and search workloads.
Imagine you have a huge toy box full of different Lego pieces. Some are red, some are blue; some are big, some are small; some are flat, some are tall. Now imagine you could create a magical map where every Lego piece gets its own spot. Pieces that are alike (same color, same shape) sit close together on the map, and pieces that are very different sit far apart.
In machine learning, an embedding space is that magical map. Instead of Lego pieces, a computer places words, pictures, or songs onto the map. The word "happy" would sit near "joyful" but far from "sad." A photo of a cat would sit near other cat photos but far from pictures of trucks. The computer uses these maps to understand that things close together are related, which helps it do jobs like translating languages, recommending movies, or searching for similar images.
Mathematically, an embedding is a function f: X to R^d that maps elements from a discrete or high-dimensional input space X into a d-dimensional real-valued vector space R^d. The dimensionality d is typically much smaller than the original input dimensionality, though the term "embedding space" applies regardless of whether dimensionality is reduced. Common embedding dimensions include 128, 256, 384, 512, 768, 1,024, 1,536, 3,072, and 4,096, depending on the model architecture and task.
The embedding function f is usually learned through training a neural network on a task-specific objective. During training, the network adjusts the mapping so that the resulting vector space satisfies desired properties, such as placing semantically similar inputs near each other according to cosine similarity or Euclidean distance. For text, the dominant architecture is now a transformer encoder fine-tuned with contrastive losses on query-document pairs, with attention layers producing token-level vectors that get pooled into a single fixed-size representation.
The choice of distance metric defines what "close" means in an embedding space. Three measures dominate practical use.
Cosine similarity measures the cosine of the angle between two vectors. It is bounded between -1 and 1, ignores vector magnitude, and is the default for almost all modern text embeddings. Most production embedding APIs return L2-normalized vectors, which makes cosine similarity equivalent to a dot product and lets approximate nearest neighbor (ANN) indexes use either metric interchangeably.
Euclidean distance (L2 distance) measures the straight-line distance between two points in the space. It is sensitive to vector magnitude. For unit-norm embeddings, Euclidean distance and cosine similarity rank pairs in the same order, so the choice rarely matters. For non-normalized embeddings (some image features, some user-item collaborative filtering models), Euclidean distance can capture meaningful magnitude differences that cosine ignores.
Dot product is the simplest measure: the sum of element-wise products. For unit vectors it equals cosine similarity. Many vector databases use raw dot product as their fastest scoring function. Some embedding models (like OpenAI's text-embedding-ada-002 and the text-embedding-3 family) explicitly optimize for inner product on normalized vectors.
Less common but useful in specific settings: Manhattan distance (L1) for sparse robustness, Hamming distance for binary or quantized embeddings, and Jaccard similarity for set-valued representations.
Embedding spaces exhibit several important properties that make them useful for machine learning.
Similar items cluster together in embedding space. For example, in a word embedding space trained on English text, words like "king," "queen," "prince," and "princess" form a cluster distinct from words like "car," "truck," and "bicycle." This clustering emerges automatically from the training objective without explicit supervision about word categories.
The distance between two points in an embedding space reflects their degree of similarity or relatedness. Cosine similarity is the more widely used metric in practice because it is invariant to vector magnitude and focuses on directional similarity.
Well-trained embedding spaces support meaningful arithmetic operations on vectors. The most famous example comes from Word2Vec: the vector operation king - man + woman yields a vector close to queen. This property, sometimes called the "parallelogram rule," shows that embedding spaces can encode relational concepts as consistent vector offsets. The relationship "male to female" is captured by approximately the same direction in the space regardless of which word pair is considered. Modern contextual embeddings exhibit this less cleanly than static word vectors, because their representations depend on context, but the underlying intuition still holds for averaged or pooled vectors.
Embedding spaces are continuous, meaning that small movements in the space correspond to small, gradual changes in the represented concept. This continuity is what enables interpolation between points in generative models and supports the generalization ability of downstream classifiers.
The effectiveness of embedding spaces is closely related to the manifold hypothesis, which states that real-world high-dimensional data tends to lie on or near low-dimensional manifolds embedded within the higher-dimensional ambient space. For instance, the set of all natural images occupies only a tiny fraction of the space of all possible pixel arrangements. Embedding models learn to identify and parameterize these low-dimensional manifolds, mapping data to a space where the intrinsic structure is made explicit.
This perspective explains why dimension reduction techniques work in practice. The apparent high dimensionality of raw data (millions of pixels, tens of thousands of vocabulary tokens) masks a much lower intrinsic dimensionality dictated by the underlying factors of variation. Neural networks learn embedding functions that capture these factors, discarding noise and irrelevant variation.
Different domains and tasks produce embedding spaces with distinct characteristics.
Word embedding models like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) map individual words to dense vectors, typically of 100 to 300 dimensions. Word2Vec learns embeddings by predicting context words surrounding a target word (skip-gram) or predicting a target word from its context (CBOW). GloVe takes a different approach: it factorizes the global word co-occurrence matrix so that the dot product of two word vectors approximates the logarithm of their co-occurrence probability. FastText (Bojanowski et al., 2017) extended Word2Vec with subword n-gram features, producing better embeddings for morphologically rich languages and out-of-vocabulary words.
Both methods produce spaces where semantic relationships are encoded geometrically. Synonyms cluster together, and analogical relationships appear as parallel vector offsets. These word embedding spaces laid the groundwork for modern NLP, though they have been largely superseded by contextual embeddings from models like BERT and GPT.
| Model | Year | Training approach | Key property | Typical dimensions |
|---|---|---|---|---|
| Word2Vec | 2013 | Predict context words (skip-gram/CBOW) | Local context patterns; vector analogies | 100-300 |
| GloVe | 2014 | Factorize global co-occurrence matrix | Captures global statistics; log-bilinear model | 50-300 |
| FastText | 2017 | Subword n-gram skip-gram | Handles out-of-vocabulary words via subword information | 100-300 |
Sentence-BERT (Reimers and Gurevych, 2019) and similar sentence transformer models extend word-level embeddings to full sentences. Sentence-BERT uses a siamese network architecture with a pre-trained BERT backbone to produce fixed-size sentence embeddings where cosine similarity directly corresponds to semantic similarity. This makes operations like semantic search and clustering computationally efficient: finding the most similar sentence in a collection of 10,000 sentences drops from roughly 65 hours with cross-encoder BERT to about 5 seconds with Sentence-BERT, while maintaining comparable accuracy.
The Sentence-BERT design (encode each item independently, compare with cosine) is the template for almost every modern text embedding model. The training recipe has evolved: contrastive losses with hard negatives, multi-stage training on web pairs followed by labeled triples, and the InfoNCE loss are now standard. The popular open-source sentence-transformers library on Hugging Face hosts thousands of variants fine-tuned for specific languages and domains.
In computer vision, convolutional neural networks and vision transformers learn hierarchical feature representations that form image embedding spaces. The penultimate layer of a trained image classifier (before the classification head) typically serves as a general-purpose image embedding. Models like ResNet, EfficientNet, and Vision Transformers produce embeddings where visually and semantically similar images are nearby. These embeddings power reverse image search, visual recommendation, and few-shot image classification.
More recent self-supervised vision models produce stronger image embeddings without label supervision. DINOv2 (Oquab et al., 2023, Meta) trains a vision transformer with a teacher-student distillation objective on a curated 142-million-image dataset. Its embeddings work well out of the box for retrieval, segmentation, and depth estimation without fine-tuning. SigLIP (Zhai et al., 2023, Google) replaced CLIP's softmax-based contrastive loss with a sigmoid loss, which scales better and produced stronger zero-shot classification at modest compute. JinaCLIP from Jina AI offers a unified text-image encoder optimized for multimodal retrieval.
CLIP (Radford et al., 2021) introduced a joint embedding space for images and text by training an image encoder and a text encoder simultaneously with a contrastive loss. The training objective maximizes cosine similarity between matched image-text pairs while minimizing it for mismatched pairs. The resulting space allows direct comparison between images and text: a photo of a dog is close to the text description "a photo of a dog."
Meta's ImageBind (Girdhar et al., 2023) extended this concept to six modalities: images, text, audio, depth, thermal, and IMU (inertial measurement unit) data. A key insight of ImageBind is that images naturally co-occur with many other modalities, so using images as a "binding" modality allows all six to be aligned into a single embedding space without requiring paired data for every combination. This enables emergent cross-modal capabilities, such as retrieving audio clips using text queries or generating images from audio inputs.
| System | Modalities | Training approach | Notable capability |
|---|---|---|---|
| CLIP | Image, text | Contrastive learning on 400M image-text pairs | Zero-shot image classification |
| ALIGN | Image, text | Contrastive learning on 1.8B noisy image-text pairs | Robust to noisy training data |
| SigLIP / SigLIP 2 | Image, text | Sigmoid contrastive loss | Better scaling, higher zero-shot accuracy |
| ImageBind | Image, text, audio, depth, thermal, IMU | Image-paired contrastive learning | Cross-modal retrieval across six modalities |
| CLAP | Audio, text | Contrastive audio-language pre-training | Zero-shot audio classification |
| DINOv2 | Image (self-supervised) | Teacher-student distillation | Strong dense features without labels |
Autoencoders and variational autoencoders (VAEs) learn latent spaces that serve as compressed embedding spaces for their training data. A VAE encoder maps inputs to a probability distribution over the latent space, and the decoder maps samples from this distribution back to the data space. Two key properties make VAE latent spaces useful for generation. First, continuity: nearby points in the latent space decode to similar outputs. Second, completeness: any point sampled from the latent space decodes to a plausible output.
These properties enable smooth interpolation between data points. For example, in a VAE trained on face images, interpolating between the latent vectors of two faces produces a smooth morphing sequence. Similarly, in the latent space of a text-to-image diffusion model, interpolating between the embeddings of two text prompts produces images that gradually blend the two concepts.
The text embedding market has consolidated around a handful of commercial APIs and a long tail of open-source checkpoints on Hugging Face. Most production systems pick a model based on three factors: retrieval quality on the MTEB benchmark, max input length, and price per million tokens.
| Model | Provider | Released | Dimensions | Max tokens | Price (per 1M tokens) |
|---|---|---|---|---|---|
| text-embedding-ada-002 | OpenAI | Dec 2022 | 1,536 | 8,191 | $0.10 |
| text-embedding-3-small | OpenAI | Jan 2024 | 1,536 (Matryoshka, down to 256) | 8,191 | $0.02 |
| text-embedding-3-large | OpenAI | Jan 2024 | 3,072 (Matryoshka, down to 256) | 8,191 | $0.13 |
| voyage-3 | Voyage AI | Sep 2024 | 1,024 (Matryoshka 256/512/1024/2048) | 32,000 | $0.06 |
| voyage-3-large | Voyage AI | Jan 2025 | 1,024 (Matryoshka) | 32,000 | $0.18 |
| voyage-code-3 | Voyage AI | Dec 2024 | 1,024 | 32,000 | $0.18 |
| voyage-multimodal-3 | Voyage AI | 2024 | 1,024 | 32,000 | $0.12 |
| Embed v3 (English) | Cohere | Nov 2023 | 1,024 | 512 | $0.10 |
| Embed v3 (Multilingual) | Cohere | Nov 2023 | 1,024 | 512 | $0.10 |
| Embed v4 | Cohere | Apr 2025 | 256/512/1024/1536 (Matryoshka) | 128,000 | $0.12 |
| gemini-embedding-001 | 2025 | 768/1536/3072 (Matryoshka) | 8,192 | $0.15 | |
| text-embedding-005 (Vertex) | 2024 | 768 | 2,048 | $0.025 | |
| Mistral Embed | Mistral AI | 2024 | 1,024 | 8,192 | $0.10 |
| nomic-embed-text-v1.5 | Nomic | Feb 2024 | 768 (Matryoshka 64/128/256/512/768) | 8,192 | Open weights |
| jina-embeddings-v3 | Jina AI | Sep 2024 | 1,024 (Matryoshka, down to 32) | 8,192 | Open + API |
| BGE-M3 | BAAI | 2024 | 1,024 | 8,192 | Open weights |
| GTE-large-en-v1.5 | Alibaba | 2024 | 1,024 | 8,192 | Open weights |
| Snowflake Arctic Embed L 2.0 | Snowflake | 2024 | 1,024 (Matryoshka) | 8,192 | Open weights |
A few notes on the table. Pricing is current as of early 2026 and changes frequently; check provider pricing pages before budgeting. Max token windows refer to the encoder's input length; some APIs accept longer inputs and truncate silently, which is a common source of silent quality regression. The Matryoshka column indicates which sizes can be safely truncated without retraining.
The practical pattern in production: start with a hosted API for ease of use, switch to an open model on dedicated GPUs once volumes pass roughly 100 million embeddings per month, and revisit the choice when a new MTEB leader appears with meaningfully better domain accuracy.
The text-embedding-3-small and text-embedding-3-large models, released in January 2024, replaced the long-running text-embedding-ada-002. Both use Matryoshka representation learning: the model is trained so that any prefix of the output vector is itself a valid (lower quality) embedding. Users can request a shorter output (down to 256 dimensions) to save storage and bandwidth without retraining or re-embedding. text-embedding-3-large hit 64.6% on MTEB at launch and remains a strong general-purpose baseline. The pricing drop from $0.10 per million tokens (ada-002) to $0.02 per million tokens (3-small) reflected the broader collapse in inference costs across 2023-2024.
Voyage AI, acquired by MongoDB in early 2025, ships the voyage-3 family along with code-specialized (voyage-code-3) and multimodal (voyage-multimodal-3) variants. As of MTEB v2 in early 2025, voyage-3-large led most retrieval categories, often by 5 to 10 percentage points over OpenAI text-embedding-3-large, particularly on long-context retrieval and code search. The 32,000 token context window is unusually long for an embedding model and is well-matched to chunking schemes that keep whole documents intact.
Cohere Embed v3 (November 2023) introduced compression-aware training: embeddings remained accurate when quantized to int8 or binary, which let Cohere offer compressed variants directly through the API. Embed v4 (April 2025) extended the family to 128,000 token context, multilingual (over 100 languages), and Matryoshka outputs at 256/512/1024/1536. Embed v4 added native multimodal support, encoding images and text into the same space without a separate vision encoder call.
Google ships embeddings through Vertex AI. The textembedding-gecko line evolved through several versions; the current production model, text-embedding-005, is fine-tuned for retrieval and outputs 768-dimensional vectors. The newer gemini-embedding-001 model, derived from Gemini, offers Matryoshka outputs at 768/1536/3072 dimensions and led MTEB at launch in 2025.
Four open-source families dominate the practitioner conversation: BGE (BAAI, China), GTE (Alibaba), E5 (Microsoft), and Nomic Embed. BGE-M3 stands out for its support of dense, sparse, and multi-vector outputs in a single model. GTE-large-en-v1.5 is a popular default for English RAG when budget rules out paid APIs. Snowflake Arctic Embed and Jina embeddings v3 round out the top tier. The upside of open weights is obvious: no per-token cost, full control over the inference path, and freedom to fine-tune. The downside is that you carry the GPU bill and the engineering work of operating a serving cluster, and the leader board moves every few months.
Several domains have benefited from specialized embedding models. Code embeddings (voyage-code-3, CodeT5+, jina-embeddings-v2-base-code) are trained on source code with adjacent natural language descriptions, producing better retrieval for code search and copilot grounding. Scientific embeddings (SciBERT, SPECTER for academic papers) are tuned on research text. Medical and legal embeddings (BioBERT, LegalBERT, voyage-law-2, Cohere Embed v3 for legal) are common in regulated domains where general models miss specialist vocabulary. Multilingual embeddings (LaBSE, multilingual-e5, jina-embeddings-v3, Cohere Embed Multilingual v3) cover cross-lingual retrieval where queries and documents may live in different languages.
A vector database stores embedding vectors and supports fast nearest-neighbor search at scale. The category did not exist as a distinct product in 2018; by 2026 it includes a dozen well-known names, several venture-funded startups, and vector indexes added to existing relational and search engines.
| Database | Type | Indexing algorithms | Hosting | Notable users / scale |
|---|---|---|---|---|
| Pinecone | Managed cloud-native | Proprietary (HNSW-derived) | Fully managed SaaS | Notion, Gong, Shopify; petabyte scale |
| Weaviate | Open source + cloud | HNSW, flat | Self-hosted or managed | Stack Overflow, Unbody |
| Qdrant | Open source + cloud | HNSW with payload filtering | Self-hosted or managed | Cloudflare AutoRAG backend |
| Milvus / Zilliz | Open source + cloud | HNSW, IVF, DiskANN, GPU CAGRA | Self-hosted or managed | Walmart, IKEA, eBay |
| Chroma | Open source + cloud | HNSW | Embedded or hosted | LangChain default backend |
| pgvector | Postgres extension | HNSW, IVFFlat | Wherever Postgres runs | Supabase, Neon, Timescale |
| LanceDB | Embedded / serverless | IVF-PQ, HNSW | Local file or S3 | Roblox, character.ai (reportedly) |
| Vespa | Open source search engine | HNSW + structured filters | Self-hosted or Vespa Cloud | Yahoo, Spotify |
| Elasticsearch / OpenSearch | Search engine + vectors | HNSW (Lucene) | Self-hosted or Elastic Cloud | Wikipedia search, GitHub Enterprise |
| Redis Vector | In-memory database | FLAT, HNSW | Redis Cloud or self-host | Used for low-latency RAG caches |
| MongoDB Atlas Vector Search | Document database | HNSW | Atlas managed | Bundles with Voyage AI embeddings |
| Apache Cassandra (Astra DB) | Distributed NoSQL | DiskANN | DataStax managed | Netflix, Capital One |
The vector database market is fragmented and pricing varies widely. Public estimates of market share are unreliable, so I will avoid quoting specific percentages; what is clear from job postings, GitHub stars, and conference talks is that Pinecone, Milvus / Zilliz, Qdrant, and Weaviate dominate dedicated vector workloads, while pgvector and Atlas Vector Search dominate "vectors next to my existing data" workloads. Elasticsearch and OpenSearch dominate when teams already operate a Lucene cluster and want hybrid keyword-plus-vector search.
The practical decision tree is short. If your data already lives in Postgres and your collection is under roughly 50 million vectors, pgvector with HNSW is hard to beat on operational cost. If you need horizontal scaling past a billion vectors, Milvus or Pinecone are battle-tested. If you need rich metadata filtering and you are starting fresh, Qdrant has the cleanest filter pushdown story. If you want hybrid lexical-plus-vector search out of the box, Vespa, Elasticsearch, or OpenSearch beat the specialized vector databases on this dimension.
Exact nearest-neighbor search over millions of high-dimensional vectors is too slow for interactive use. Approximate nearest neighbor (ANN) algorithms trade a small drop in recall for a large speedup. Five families dominate.
| Algorithm | Year | Approach | Strengths | Trade-offs |
|---|---|---|---|---|
| HNSW (Hierarchical Navigable Small World) | 2016 | Multi-layer proximity graph | Best recall-speed Pareto for in-memory data; default in most VDBs | Memory hungry; slow to build; no easy delete |
| IVF (Inverted File) | 2010s | Partition vectors with k-means, search top-N partitions | Simple, GPU friendly, works well with PQ | Lower recall than HNSW at the same speed |
| Product Quantization (PQ) | 2010 (Jegou) | Compress vectors into product of subquantizers | 10-100x memory reduction; combines with IVF | Loss in recall; not great alone |
| DiskANN | 2019 (Microsoft) | Single-pass graph index that lives on SSD | Billion-scale on a single machine; cheap storage | Higher latency than in-memory HNSW |
| ScaNN | 2020 (Google) | Anisotropic quantization + IVF | Strong recall at very low latency on Google hardware | Less common in third-party VDBs |
In practice, HNSW is the default for vector databases that fit in RAM (Pinecone serverless, Qdrant, Weaviate, Chroma, pgvector). DiskANN is the default for systems that want to scale past memory (Milvus, Cassandra). PQ is bolted onto IVF or HNSW to compress the vectors themselves; binary quantization (1 bit per dimension) is increasingly common for embedding models that survive the compression, including Cohere Embed v3 and the text-embedding-3 family at lower Matryoshka sizes.
The FAISS library from Meta deserves its own mention. It implements all of these algorithms, runs on CPU and GPU, and powers many of the vector databases above either directly or as the inspiration for their indexing code.
Retrieval-augmented generation is the application that turned embeddings from a research curiosity into a billion-dollar API category. The pattern is simple to state: chunk the corpus, embed each chunk, store in a vector database, embed the user query at runtime, retrieve the top-k closest chunks, and prepend them to the prompt for a large language model. The details are where most teams spend their time.
Most embedding models have a max context window between 512 and 32,000 tokens, but retrieval quality is usually better with smaller chunks because each chunk represents a tighter semantic unit. Common chunk sizes range from 256 to 1,024 tokens. Chunking strategies include fixed-size sliding windows (with overlap of 10 to 20 percent), recursive character splitting (split by paragraph, then sentence, then word until chunks fit), semantic chunking (split where embedding similarity drops between adjacent sentences), and structured chunking (one chunk per markdown section, code function, or PDF page). The right choice depends on the source documents. Legal contracts and code want structured chunking; informal customer support transcripts respond well to semantic chunking.
Dense embedding retrieval is strong on paraphrase and synonym handling but weak on rare terms, exact identifiers, and recent named entities. Lexical retrieval (BM25, TF-IDF) is the opposite: strong on exact match, weak on paraphrase. Hybrid search combines both with a fusion step. The two common fusion methods are reciprocal rank fusion (RRF), which sums 1/(k + rank) across the two lists, and weighted score fusion. Cohere, Pinecone, Weaviate, Vespa, and Elasticsearch all ship hybrid search out of the box; on benchmarks like BEIR, hybrid usually beats dense alone by 3 to 8 points.
A bi-encoder embedding model produces independent vectors for query and document, which means it cannot model interaction between specific query terms and specific document spans. A cross-encoder reranker concatenates query and document and runs a full transformer pass, scoring each pair directly. This is too expensive to run over the whole corpus, but it works well on the top 50 to 100 results from a vector or hybrid search. Cohere Rerank, Voyage rerank-2, and the open BGE Reranker family are common choices. Adding a reranker typically lifts NDCG@10 by 5 to 15 points over vector search alone.
A modern production RAG pipeline looks roughly like this:
Each step in this pipeline has its own embedding model decisions. The embedding model used for indexing must match the one used at query time, which is a real constraint when the underlying API releases a new version with a different vector geometry.
Multimodal retrieval has lagged text retrieval by several years, but the gap closed quickly in 2024 and 2025.
The earliest practical multimodal embeddings were CLIP and its successors (OpenCLIP, MetaCLIP, EVA-CLIP, SigLIP, SigLIP 2). These produce a shared image-text space that supports zero-shot classification ("a photo of a dog" versus the image), reverse image search, and image-conditioned text retrieval. They are now standard inside larger vision-language models, where the CLIP-style image encoder feeds patch embeddings into a language model.
Dedicated multimodal embedding APIs followed in 2024. Voyage multimodal-3 produces a single vector for an arbitrary mix of text and images, including interleaved documents like PDFs with figures and PowerPoint slides. Cohere Embed v4 (April 2025) is multimodal by default, encoding text and images into one space without a separate vision call. Vertex AI multimodal embeddings (multimodalembedding@001) accept text, image, and short video clips, returning a 1,408-dimensional shared vector.
The killer application for multimodal embeddings is multimodal RAG over enterprise documents. Real-world PDFs are full of charts, screenshots, and diagrams whose meaning is lost when you OCR them to plain text. Multimodal embeddings let the figure itself be retrieved and shown to a vision-capable LLM, which then explains it. Anecdotally, this pattern has driven much of the growth in Voyage and Cohere's multimodal embedding usage through 2025.
Matryoshka representation learning, introduced by Kusupati et al. (2022), trains an embedding model so that any prefix of the output vector is itself a valid embedding, with quality degrading gracefully as the prefix shrinks. The training objective adds a weighted sum of contrastive losses computed at multiple truncation lengths, typically 64, 128, 256, 512, 768, and 1024 dimensions. At inference, users pick the shortest length that meets their accuracy bar.
The practical impact is significant. A team can store 256-dimensional vectors for a fast first-stage retrieval over billions of chunks, then re-rank the top candidates using the full 1,024 or 3,072 dimensional vectors. Storage and memory drop by 4 to 12x; recall stays close to the full-precision baseline. OpenAI text-embedding-3, Voyage voyage-3 and voyage-3-large, Cohere Embed v4, Nomic Embed, and Snowflake Arctic Embed all use Matryoshka training. Combined with binary quantization (one bit per dimension), Matryoshka outputs can compress a 1,024-dimensional vector to as little as 32 bytes with single-digit recall loss.
An isotropic embedding space is one where vectors are uniformly distributed across all directions, meaning no direction is preferred over another. In practice, many pre-trained language models produce highly anisotropic embedding spaces. Ethayarajh (2019) demonstrated that embeddings from BERT, ELMo, and GPT-2 occupy a narrow cone in the vector space rather than being spread uniformly. This "cone effect" means that randomly sampled word embeddings have unexpectedly high cosine similarity, which degrades the usefulness of cosine similarity as a semantic measure.
Several techniques have been proposed to address anisotropy. Whitening transformations can redistribute embeddings more uniformly, and post-processing methods like normalizing flows can map the anisotropic distribution to a more isotropic one. These corrections improve performance on semantic similarity benchmarks. Modern contrastive training reduces but does not eliminate anisotropy; well-trained sentence embedding models like the text-embedding-3 family are noticeably more isotropic than raw BERT.
Standard embedding spaces use Euclidean geometry, which works well for data without strong hierarchical structure. However, tree-like and hierarchical data (such as taxonomies, organizational charts, or knowledge graphs) can be more faithfully represented in hyperbolic space. Nickel and Kiela (2017) introduced Poincare embeddings, which embed data into the Poincare ball model of hyperbolic space.
Hyperbolic space expands exponentially with distance from the origin, much like a tree expands exponentially with depth. This means that hierarchical structures that require a high-dimensional Euclidean space can be embedded with low distortion in a low-dimensional hyperbolic space. In experiments, Poincare embeddings in just 5 dimensions outperformed Euclidean embeddings in 200 dimensions for representing the WordNet noun hierarchy.
| Geometry | Best suited for | Key advantage | Example application |
|---|---|---|---|
| Euclidean | Flat, non-hierarchical data | Simple distance computations; well-understood optimization | Word similarity, image retrieval |
| Hyperbolic | Tree-like, hierarchical data | Exponential volume growth matches tree branching | Taxonomy embedding, knowledge graphs |
| Spherical | Data with periodic or directional structure | Natural for cosine similarity; unit-norm constraints | Sentence embeddings, CLIP |
Different languages trained independently produce separate embedding spaces with similar internal structures but incompatible coordinate systems. Cross-lingual alignment maps these spaces into a shared space so that translations are nearby. Facebook's MUSE library (Conneau et al., 2018) aligns monolingual fastText embeddings for 30 languages using either a small bilingual dictionary (supervised) or adversarial training (unsupervised). The alignment is typically an orthogonal transformation, which preserves the internal structure of each monolingual space while rotating and reflecting them into agreement.
This enables training a classifier in one language and applying it directly to another. For example, a sentiment classifier trained on English data can classify German text if both languages share an aligned embedding space. Modern multilingual embedding models (LaBSE, multilingual-e5, Cohere Embed Multilingual v3, jina-embeddings-v3) skip the post-hoc alignment by training jointly on parallel corpora across languages.
Cross-modal alignment brings different data types (text, images, audio) into a shared embedding space. CLIP achieves this through contrastive training on image-text pairs. However, research has shown that CLIP's embedding space contains a "modality gap," where image embeddings and text embeddings cluster in separate regions of the hypersphere rather than fully interleaving. Recent work, such as AlignCLIP, addresses this gap through shared encoder parameters and regularized training objectives.
Because embedding spaces typically have hundreds of dimensions, visualization requires projecting them into two or three dimensions. The two most popular techniques for this are t-SNE and UMAP.
t-SNE (t-Distributed Stochastic Neighbor Embedding), developed by van der Maaten and Hinton (2008), converts high-dimensional pairwise distances into probability distributions and minimizes the KL divergence between the high-dimensional and low-dimensional distributions. t-SNE excels at preserving local neighborhood structure, making it effective for revealing clusters. However, it does not reliably preserve global distances; clusters that appear far apart in a t-SNE plot may not actually be far apart in the original space.
UMAP (Uniform Manifold Approximation and Projection), developed by McInnes et al. (2018), is grounded in topological data analysis and Riemannian geometry. UMAP is significantly faster than t-SNE, scales better to large datasets, and tends to preserve more global structure while still capturing local clusters. It has become the preferred tool for exploratory visualization of embedding spaces in many applications.
PCA (Principal Component Analysis) is the simplest option. It is a linear projection that maximizes preserved variance. PCA is fast and reversible but cannot capture nonlinear structure, which makes it less useful than t-SNE or UMAP for clustering visualization. It still has its place as a preprocessing step before more expensive nonlinear methods, and as a quick sanity check on the spread of a new embedding model.
| Method | Preserves local structure | Preserves global structure | Speed | Scalability |
|---|---|---|---|---|
| t-SNE | Excellent | Limited | Slow for large datasets | Moderate |
| UMAP | Excellent | Good | Fast | High |
| PCA | Moderate | Good (linear only) | Very fast | Very high |
A recurring caution: these methods are for visualization, not retrieval. The 2D projection throws away most of the high-dimensional structure that made the embedding useful in the first place. Do not run nearest-neighbor search on UMAP outputs.
Embedding models are evaluated on a small set of standardized benchmarks. The most influential is MTEB.
The Massive Text Embedding Benchmark (MTEB, Muennighoff et al., 2022) covers 56 datasets across eight task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), summarization, and bitext mining. It became the de facto leaderboard for English text embeddings within months of release. The Hugging Face MTEB leaderboard now lists hundreds of models; the top of the board churns roughly every quarter as new open and closed models ship.
The benchmark family has expanded. C-MTEB covers Chinese, MTEB-French, MTEB-Polish and others cover specific languages, and MMTEB (Massive Multilingual Text Embedding Benchmark, 2024) extends the methodology to over 250 languages. As of early 2026, the top open-weight models on MTEB v2 retrieval (English) typically include voyage-3-large, gemini-embedding-001, NV-Embed-v2 (Nvidia), Linq-Embed-Mistral, and SFR-Embedding-2 from Salesforce. The exact ordering shifts month to month and the deltas between top models are usually under one point on the average score, so I will avoid pinning down a fixed number-one. Pick the model based on your specific task category, not the average.
BEIR (Thakur et al., 2021) is a heterogeneous retrieval benchmark covering 18 datasets, including BioASQ, NaturalQuestions, MS MARCO, FEVER, and several specialty domains. It predates MTEB and remains the standard for retrieval-only evaluation. BEIR introduced the now-common practice of zero-shot evaluation: train on MS MARCO, evaluate everywhere else. This format exposes domain transfer failures that in-domain training conceals.
For code retrieval, CoIR (Code Information Retrieval) and CSN (CodeSearchNet) are common. For long-context retrieval, LoCo and BRIGHT (with reasoning-heavy queries) probe weaknesses that BEIR misses. Domain-specific benchmarks exist for legal (LegalBench), medical (MedQA retrieval splits), and scientific (SciDocs, SciFact) text.
Embedding spaces enable a wide range of practical applications across machine learning.
Semantic search and retrieval. Documents, queries, and passages are embedded into a shared space, and retrieval is performed by finding the nearest neighbors to the query embedding. This approach, known as dense retrieval, powers modern search engines and retrieval-augmented generation systems.
Retrieval-augmented generation. As described above, RAG is the largest single use of embeddings in production today.
Recommendation systems. Users and items (movies, products, songs) are embedded into the same space. Recommendations are generated by finding items whose embeddings are closest to a user's embedding. Two-tower neural architectures, used at YouTube, TikTok, Pinterest, and Spotify, are direct descendants of the Sentence-BERT siamese network design.
Clustering and topic modeling. Embedding text documents and then applying clustering algorithms (such as k-means, HDBSCAN, or BERTopic) to the resulting vectors is a common approach for discovering topics, grouping similar documents, and performing unsupervised categorization.
Classification with few examples. Pre-trained embeddings serve as input features for lightweight classifiers, often a single linear layer trained on a few hundred examples. This pattern is common for content moderation, intent detection, and ticket routing.
Deduplication and near-duplicate detection. Embedding similarity catches paraphrased duplicates that exact-match hashing misses. This shows up in dataset cleaning for LLM training, in plagiarism detection, and in near-duplicate user-generated content filtering.
Anomaly detection. In an embedding space trained on normal data, anomalous inputs map to regions far from the dense clusters of normal data. This distance-based approach to anomaly detection is used in fraud detection, manufacturing quality control, and cybersecurity.
Transfer learning. Pre-trained embedding spaces serve as a foundation for downstream tasks. Rather than training a model from scratch, practitioners use embeddings from models like BERT, CLIP, or ResNet as input features for task-specific classifiers or regressors. This transfer of learned representations dramatically reduces the amount of task-specific training data required.
The largest text embedding models have billions of parameters and produce 3,072 or 4,096 dimensional vectors, which can be expensive to serve at scale. Distillation produces a smaller student model that mimics the teacher's embeddings on a curated set of inputs. The classic recipe trains the student with a mean-squared-error loss against teacher vectors, sometimes paired with a contrastive loss against the original triples. The result is a smaller, faster model whose embeddings are interchangeable with the teacher's for similarity tasks.
Many of the popular small embedding models (all-MiniLM-L6-v2, paraphrase-MiniLM-L3-v2, gtr-t5-base) are distilled from larger teachers. Cohere, Voyage, and OpenAI all ship distilled "small" or "lite" tiers built this way.
Despite their utility, embedding spaces present several challenges.
The curse of dimensionality affects nearest-neighbor search in very high-dimensional spaces, where distances between points become increasingly uniform. Approximate nearest-neighbor algorithms (such as HNSW and IVF) mitigate this but introduce a speed-accuracy tradeoff.
Out-of-domain performance can drop sharply. An embedding model trained on web text may struggle with legal contracts, scientific papers, or proprietary internal jargon. The standard mitigation is fine-tuning on in-domain pairs, which usually requires a few thousand labeled query-document examples. Voyage and Cohere both offer managed fine-tuning services for this reason.
Long-text handling remains hard. Most embedding models truncate or chunk inputs longer than their context window, which loses information. Even models with 32,000 token windows show retrieval quality drops on very long inputs because the pooled vector cannot represent every part of the document equally. Late interaction models like ColBERT and the multi-vector mode of BGE-M3 partially address this by storing token-level vectors and computing fine-grained similarity at query time, at the cost of much higher storage.
Cross-lingual challenges. Multilingual embedding models do well on high-resource languages and degrade on low-resource ones. Code-switched text, mixed-script documents, and rare languages remain weak spots.
Hallucinated similarity. Two documents may have high cosine similarity for surface reasons (shared boilerplate, similar formatting) without sharing real semantic content. Reranking and grounding citations help, but the underlying issue is that cosine similarity is a learned signal, not a guarantee.
Embedding model drift across versions. When a provider releases a new embedding model, the vector geometry usually changes, which means a corpus indexed with the old model is incompatible with queries from the new model. Re-embedding terabytes of documents is expensive. Some teams pin to a specific model version and only upgrade in coordinated rollouts; others maintain side-by-side indexes during transitions.
Bias and fairness. Embedding spaces inherit and can amplify biases present in training data. Word embedding spaces trained on web text have been shown to encode gender, racial, and other social biases as geometric relationships. Debiasing techniques exist but remain an active area of research.
Interpretability. Unlike hand-crafted features, individual dimensions of a learned embedding typically do not correspond to identifiable concepts, making it difficult to explain why two items are considered similar.