Cosine similarity

Machine Learning Natural Language Processing

18 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 3,622 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Cosine similarity is a measure of similarity between two non-zero vectors that calculates the cosine of the angle between them, defined as the dot product of the vectors divided by the product of their magnitudes and producing a value between -1 and 1, where 1 means the vectors point in exactly the same direction, 0 means they are orthogonal, and -1 means they point in opposite directions. In artificial intelligence and machine learning, cosine similarity has become one of the most widely used similarity metrics, particularly for comparing text embeddings, measuring semantic relatedness, powering vector search in retrieval-augmented generation (RAG) systems, and defining training objectives in contrastive learning. Its popularity stems from a key property: it measures the orientation (direction) of vectors while being invariant to their magnitude, making it well-suited for comparing learned representations where direction encodes meaning. Major embedding models and vector databases such as text-embedding-3-large, Sentence-BERT, Pinecone, and Qdrant use cosine similarity as their default comparison function ^[4]^[7].

Definition and Formula

Given two vectors A and B in n-dimensional space, cosine similarity is defined as:

\cos(\theta) = \frac{A \cdot B}{\lVert A \rVert \, \lVert B \rVert}

where:

$A \cdot B = \sum_{i=1}^{n} A_i B_i$ is the dot product of A and B
$\lVert A \rVert = \sqrt{\sum_i A_i^2}$ is the Euclidean norm ( $L_2$ norm) of A
$\lVert B \rVert = \sqrt{\sum_i B_i^2}$ is the Euclidean norm of B
$\theta$ is the angle between the two vectors

Expanding the formula for two n-dimensional vectors:

\cos(\theta) = \frac{A_1 B_1 + A_2 B_2 + \cdots + A_n B_n}{\sqrt{A_1^2 + A_2^2 + \cdots + A_n^2} \cdot \sqrt{B_1^2 + B_2^2 + \cdots + B_n^2}}

What range of values does cosine similarity produce?

Cosine Value	Angle	Interpretation
1.0	0 degrees	Vectors point in exactly the same direction (identical orientation)
0.5 to 0.9	26 to 60 degrees	Moderately to highly similar
0.0	90 degrees	Vectors are orthogonal (no similarity)
-0.5 to -0.1	96 to 120 degrees	Moderately dissimilar
-1.0	180 degrees	Vectors point in exactly opposite directions

For most applications with neural network embeddings, similarity values cluster in the positive range (roughly 0.0 to 1.0), because embedding spaces tend to use only a portion of the available directional space.

A related quantity, cosine distance, is defined as $1 - \cos(\theta)$ , and ranges from 0 (identical) to 2 (opposite). Cosine distance is used when a distance metric (lower is more similar) is needed rather than a similarity metric (higher is more similar).

Why is cosine similarity used in AI?

Magnitude Invariance

The most important property of cosine similarity for AI applications is its invariance to vector magnitude. Two vectors that point in the same direction have cosine similarity 1.0, regardless of their lengths. This matters because in many representation learning settings, the direction of a vector encodes semantic meaning while the magnitude may reflect irrelevant factors such as document length, word frequency, or arbitrary scale differences in the embedding model.

For example, in traditional TF-IDF document representations, a document that is twice as long as another might have TF-IDF vectors with roughly double the magnitude, even if both documents discuss the same topics. Cosine similarity correctly identifies them as similar by focusing on the proportions of terms rather than their absolute counts ^[1].

High-Dimensional Behavior

In high-dimensional spaces (hundreds or thousands of dimensions, as is typical for neural embeddings), Euclidean distance becomes less discriminative because all points tend to be roughly equidistant. This phenomenon, known as the "curse of dimensionality," makes absolute distance values difficult to interpret. Aggarwal, Hinneburg, and Keim (2001) showed that lower-norm distance metrics (such as the $L_1$ norm and fractional norms) and angle-based measures remain more meaningful than the $L_2$ norm as dimensionality grows, providing part of the theoretical justification for preferring cosine similarity over Euclidean distance in high dimensions ^[2].

Computational Efficiency

Cosine similarity requires only a dot product and two norm computations, all of which are highly optimized on modern hardware. When vectors are pre-normalized to unit length ( $\lVert A \rVert = \lVert B \rVert = 1$ ), cosine similarity reduces to a simple dot product $A \cdot B$ , which can be computed extremely efficiently using SIMD instructions, GPU tensor cores, or specialized vector database hardware.

Use in Embeddings

Cosine similarity is the standard metric for comparing vector representations (embeddings) across natural language processing, information retrieval, and recommendation systems.

Word Embeddings

The use of cosine similarity for measuring word relatedness became widespread with word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). These models map words to dense vectors where semantic relationships are encoded as geometric relationships ^[3].

Classic examples of cosine similarity in word embedding spaces:

Word Pair	Typical Cosine Similarity	Relationship
king, queen	~0.75	Semantic (gender variant)
cat, dog	~0.76	Semantic (both animals)
car, automobile	~0.85	Near-synonym
king, carrot	~0.15	Unrelated
good, bad	~0.45	Antonyms (still somewhat similar because they appear in similar contexts)

A celebrated property of word2vec embeddings is that semantic analogies correspond to vector arithmetic. The classic example "king - man + woman is close to queen" works because cosine similarity in the embedding space captures these regular semantic relationships. Concretely, the analogy is answered by selecting the word whose vector has the maximum cosine similarity to the result of vector("king") - vector("man") + vector("woman"), with the three input words excluded from the candidates ^[3].

Sentence and Document Embeddings

Modern embedding models such as Sentence-BERT (Reimers and Gurevych, 2019), E5 (Wang et al., 2022), and OpenAI's text-embedding models produce fixed-size vectors for entire sentences or documents. These models are specifically trained so that cosine similarity between embeddings reflects semantic similarity between the corresponding texts ^[4].

Embedding Model	Dimensions	Typical Use	Similarity Metric
text-embedding-3-large (OpenAI)	3072	General-purpose text similarity	Cosine
E5-large-v2	1024	Search and retrieval	Cosine
all-MiniLM-L6-v2 (SBERT)	384	Lightweight semantic similarity	Cosine
BGE-large-en-v1.5	1024	Search and retrieval	Cosine
Cohere embed-v3	1024	Multilingual search	Cosine or dot product

OpenAI's text-embedding-3-large produces 3072-dimensional vectors and scored 64.6% on the MTEB benchmark, while the smaller text-embedding-3-small produces 1536-dimensional vectors and scored 62.3%, both improving on the earlier text-embedding-ada-002 ^[11]. These models are typically evaluated on benchmarks like MTEB (Massive Text Embedding Benchmark), where cosine similarity is used as the default similarity function for retrieval, clustering, and semantic textual similarity tasks. As introduced by Muennighoff et al. (2022), MTEB "spans 8 embedding tasks covering a total of 58 datasets and 112 languages," and its initial release benchmarked 33 models, establishing what the authors called "the most comprehensive benchmark of text embeddings to date" ^[12].

Use in Contrastive Learning

Contrastive learning methods train models to produce embeddings where similar items have high cosine similarity and dissimilar items have low cosine similarity. The training objective directly optimizes cosine similarity (or equivalently, the dot product of L2-normalized vectors).

CLIP

CLIP (Contrastive Language-Image Pre-training), developed by OpenAI (Radford et al., 2021), learns a shared embedding space for images and text from a dataset of 400 million (image, text) pairs collected from the internet ^[5]. During training, CLIP processes batches of (image, text) pairs and computes the cosine similarity between every image embedding and every text embedding in the batch. As the paper describes, CLIP is "jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the $N^2 - N$ incorrect pairings," using a symmetric cross-entropy loss over the resulting $N \times N$ similarity matrix ^[5].

At inference time, CLIP enables zero-shot image classification by computing the cosine similarity between an image embedding and text embeddings of candidate class descriptions (e.g., "a photo of a cat," "a photo of a dog"). The class with the highest cosine similarity to the image is selected as the prediction ^[5].

SimCLR

SimCLR (Chen et al., 2020) is a framework for self-supervised visual representation learning that uses cosine similarity as the core component of its NT-Xent (Normalized Temperature-scaled Cross-Entropy) loss ^[6]. Given a batch of images, SimCLR creates two augmented views of each image, encodes them, and projects them to a lower-dimensional space. The loss function for a positive pair (i, j) is:

\text{loss}_{ij} = -\log\frac{\exp(\mathrm{sim}(z_i, z_j) / \tau)}{\sum_{k \ne i} \exp(\mathrm{sim}(z_i, z_k) / \tau)}

where, in the paper's notation, $\mathrm{sim}(u, v) = \frac{u^\top v}{\lVert u \rVert \, \lVert v \rVert}$ "denote[s] the dot product between l2 normalized u and v (i.e. cosine similarity)," and $\tau$ is a temperature parameter that controls the sharpness of the distribution ^[6].

The temperature parameter is critical: lower temperatures make the model focus more on the hardest negative examples, while higher temperatures smooth the distribution. In the paper's ablation (Table 5), a temperature of $\tau = 0.1$ worked best, outperforming both larger and smaller values, and the same value is used in the official SimCLR code ^[6]. With this objective and a ResNet-50 (4x) encoder, a linear classifier trained on SimCLR representations reached 76.5% top-1 accuracy on ImageNet, a 7% relative improvement over the previous state of the art that matched the performance of a supervised ResNet-50 ^[6].

Other Contrastive Methods

Cosine similarity is central to the training objectives of many other contrastive and self-supervised methods:

Method	Domain	How Cosine Similarity Is Used
CLIP	Vision-language	Cross-modal matching of image and text embeddings
SimCLR	Vision	Self-supervised learning with augmented view pairs
MoCo	Vision	Momentum contrast with cosine similarity queue
SimCSE	NLP	Sentence embedding learning with dropout augmentation
ALIGN	Vision-language	Image-text alignment (similar to CLIP)
DINO	Vision	Self-distillation with cosine similarity
Barlow Twins	Vision	Cross-correlation matrix objective (related to cosine)

Use in RAG and Vector Search

Retrieval-augmented generation (RAG) systems use cosine similarity as the primary mechanism for finding relevant documents to include in a large language model's context. The process works as follows:

Indexing: Documents (or document chunks) are converted to embeddings using an embedding model and stored in a vector database.
Query: When a user asks a question, the query is converted to an embedding using the same model.
Retrieval: The vector database computes cosine similarity (or an equivalent metric) between the query embedding and all stored document embeddings, returning the most similar documents.
Generation: The retrieved documents are included in the LLM's prompt to provide relevant context for answering the question.

Vector databases like Pinecone, Weaviate, Qdrant, Milvus, and Chroma support cosine similarity as a built-in distance metric. At scale, exact cosine similarity computation with every vector becomes expensive, so these systems use approximate nearest neighbor (ANN) algorithms such as HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index) to perform fast approximate searches ^[7].

Vector Database	Supported Metrics	Default Metric
Pinecone	Cosine, dot product, Euclidean	Cosine
Weaviate	Cosine, dot product, L2	Cosine
Qdrant	Cosine, dot product, Euclidean	Cosine
Milvus	Cosine, IP, L2	Varies
Chroma	Cosine, L2, IP	L2
pgvector	Cosine, L2, inner product	Varies

Semantic search more broadly relies on cosine similarity. Unlike keyword-based search (which matches exact terms), semantic search compares the meaning of queries and documents via their embeddings. A query like "how to fix a flat tire" can match a document titled "changing a punctured tire" because their embeddings point in similar directions, even though they share few words.

How does cosine similarity differ from dot product and Euclidean distance?

Cosine similarity is one of several metrics used to compare vectors. Understanding the relationships and differences between these metrics is important for choosing the right one.

Dot Product

The dot product (inner product) of two vectors A and B is:

A \cdot B = \sum_i A_i B_i = \lVert A \rVert \, \lVert B \rVert \cos(\theta)

The dot product equals cosine similarity multiplied by the magnitudes of both vectors. When vectors are L2-normalized (unit vectors), the dot product and cosine similarity are identical. Many contrastive learning frameworks normalize embeddings before computing similarities, making the choice between dot product and cosine similarity irrelevant in practice.

However, when vectors are not normalized, the dot product incorporates magnitude information. This can be useful in some settings. For example, in recommendation systems, the magnitude of a user or item embedding might encode popularity or confidence, and the dot product captures both the relevance (direction) and strength (magnitude) of the match.

Euclidean Distance

Euclidean distance (L2 distance) between vectors A and B is:

d(A, B) = \sqrt{\sum_i (A_i - B_i)^2}

For L2-normalized vectors, Euclidean distance and cosine similarity are monotonically related:

d(A, B)^2 = 2 - 2\cos(\theta) = 2(1 - \cos(\theta))

This means that for normalized vectors, ranking by cosine similarity and ranking by Euclidean distance produce identical results. The choice between them in this case is purely a matter of convention.

For non-normalized vectors, the relationship breaks down. Euclidean distance is sensitive to magnitude: two vectors pointing in the same direction but with different magnitudes will have a large Euclidean distance despite having cosine similarity of 1.0.

Comparison Table

Property	Cosine Similarity	Dot Product	Euclidean Distance
Range	$[-1, 1]$	$(-\infty, +\infty)$	$[0, +\infty)$
Magnitude invariant	Yes	No	No
Equivalent for normalized vectors	Yes (= dot product)	Yes (= cosine sim)	Yes (monotonically related)
Interpretability	High (bounded, intuitive)	Moderate	Moderate
Common use case	Text embeddings, semantic search	Recommendations, attention scores	Clustering, kNN
Captures direction only	Yes	No	No

When to Use Which

Cosine similarity: Best when you care about semantic direction and want magnitude invariance. This is the default choice for text embeddings and semantic search.
Dot product: Best when magnitude carries useful information (e.g., confidence, popularity) or when vectors are already normalized (making it equivalent to cosine similarity but faster to compute).
Euclidean distance: Best for low-dimensional spaces, clustering algorithms (like k-means), or when absolute distance matters more than angular similarity.

Cosine Similarity in Attention Mechanisms

The standard self-attention mechanism in transformers computes attention scores as the scaled dot product between query and key vectors: $\text{score} = Q K^\top / \sqrt{d_k}$ . This is closely related to but distinct from cosine similarity. The scaling by $\sqrt{d_k}$ prevents large dot product values but does not normalize by vector magnitudes ^[8].

Some architectures have experimented with using explicit cosine similarity for attention. In particular, query-key normalization (where queries and keys are L2-normalized before computing attention scores) has been explored to stabilize training and reduce the need for learning rate warmup. This makes the attention logits exactly cosine similarities (up to scaling), bounding them to the range $[-1, 1]$ before the temperature scaling.

Computational Considerations

Efficiency of Computation

Cosine similarity computation consists of three parts: a dot product ( $O(n)$ for n-dimensional vectors), and two norm computations (also $O(n)$ each). The total is $O(n)$ , making it linear in the vector dimension. For batch computations, cosine similarity matrices between m query vectors and k database vectors can be computed as a matrix multiplication (after normalizing each vector), leveraging highly optimized BLAS routines and GPU tensor operations.

Pre-normalization Optimization

A common optimization is to L2-normalize all vectors once at indexing time. After normalization, cosine similarity reduces to a dot product, eliminating the per-query normalization cost. Most vector databases and embedding libraries apply this optimization by default when cosine similarity is selected as the metric.

Numerical Precision

When computing cosine similarity in floating-point arithmetic, care must be taken with very similar vectors. The difference between cosine similarity values of 0.9999 and 0.99999 can be significant for ranking but may be lost to floating-point rounding, especially in float16 or bfloat16. For precision-sensitive applications (such as deduplication), using float32 for similarity computation is recommended even when embeddings are stored in lower precision.

Approximate Nearest Neighbor Search

For databases with millions or billions of vectors, exact cosine similarity search (comparing the query to every vector) becomes impractical. ANN algorithms trade a small amount of accuracy for orders-of-magnitude speedup:

ANN Algorithm	Approach	Typical Recall@10	Build Time
HNSW	Graph-based navigation	95-99%	Moderate
IVF-PQ	Clustering + product quantization	85-95%	Fast
ScaNN	Anisotropic vector quantization	95-99%	Fast
FAISS (flat)	Exact brute force	100%	N/A (no index)

These algorithms are designed to find vectors with high cosine similarity to a query without examining every vector in the database. They work by organizing vectors into data structures that allow pruning large portions of the search space. The HNSW algorithm of Malkov and Yashunin (2018), one of the most widely deployed, builds a multi-layer proximity graph and is reported to deliver a logarithmic search complexity scaling at high recall ^[7].

Limitations

Despite its widespread use, cosine similarity has several known limitations:

Not a true metric: Cosine distance (1 - cosine similarity) does not satisfy the triangle inequality in general, which means some metric-based algorithms and proofs do not directly apply.
Insensitivity to magnitude: While magnitude invariance is usually a feature, it can be a limitation when magnitude carries meaningful information.
Distributional collapse: In some embedding spaces, vectors can cluster in a small region of the hypersphere, reducing the discriminative power of cosine similarity. This is an active area of research in contrastive learning, where techniques like uniformity regularization aim to spread embeddings more evenly. Wang and Isola (2020) formalized this by identifying alignment and uniformity on the hypersphere as the two key properties of good contrastive representations ^[9].
Zero vector problem: Cosine similarity is undefined for zero vectors, requiring special handling in implementations.
Not always the best choice: For some tasks and embedding models, dot product or Euclidean distance may actually outperform cosine similarity. The optimal metric depends on how the embedding model was trained.

When was cosine similarity first used in information retrieval?

Cosine similarity has been used in information retrieval since the 1970s as part of the Vector Space Model (VSM) associated with Gerard Salton. In the foundational formulation by Salton, Wong, and Yang (1975) in Communications of the ACM, documents and queries are represented as term-weight vectors, and relevance is measured by their angular closeness ^[10]. This was one of the earliest applications of geometric reasoning to text, and the fundamental idea (represent text as vectors, compare by angle) remains at the core of modern semantic search, just with neural embeddings replacing term-frequency vectors.

The measure gained renewed importance in the deep learning era with the rise of word embeddings, sentence embeddings, and multimodal embeddings, all of which use cosine similarity as their primary comparison metric. Today, cosine similarity underlies billions of daily similarity computations in search engines, recommendation systems, and AI applications worldwide.

References

Manning, C.D., Raghavan, P., & Schutze, H. (2008). "Introduction to Information Retrieval." Cambridge University Press. https://nlp.stanford.edu/IR-book/ ↩
Aggarwal, C.C., Hinneburg, A., & Keim, D.A. (2001). "On the Surprising Behavior of Distance Metrics in High Dimensional Space." Proceedings of the 8th International Conference on Database Theory. https://link.springer.com/chapter/10.1007/3-540-44503-X_27 ↩
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." Advances in Neural Information Processing Systems 26. https://arxiv.org/abs/1310.4546 ↩
Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. https://arxiv.org/abs/1908.10084 ↩
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." Proceedings of the 38th International Conference on Machine Learning. https://arxiv.org/abs/2103.00020 ↩
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." Proceedings of the 37th International Conference on Machine Learning. https://arxiv.org/abs/2002.05709 ↩
Malkov, Y.A. & Yashunin, D.A. (2018). "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." IEEE Transactions on Pattern Analysis and Machine Intelligence. https://arxiv.org/abs/1603.09320 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.03762 ↩
Wang, T. & Isola, P. (2020). "Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere." Proceedings of the 37th International Conference on Machine Learning. https://arxiv.org/abs/2005.10242 ↩
Salton, G., Wong, A., & Yang, C.S. (1975). "A Vector Space Model for Automatic Indexing." Communications of the ACM, 18(11), 613-620. https://doi.org/10.1145/361219.361220 ↩
OpenAI (2024). "New embedding models and API updates." OpenAI. https://openai.com/index/new-embedding-models-and-api-updates/ ↩
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. https://arxiv.org/abs/2210.07316 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Cosine similarity

Definition and Formula

What range of values does cosine similarity produce?

Why is cosine similarity used in AI?

Magnitude Invariance

High-Dimensional Behavior

Computational Efficiency

Use in Embeddings

Word Embeddings

Sentence and Document Embeddings

Use in Contrastive Learning

CLIP

SimCLR

Other Contrastive Methods

Use in RAG and Vector Search

How does cosine similarity differ from dot product and Euclidean distance?

Dot Product

Euclidean Distance

Comparison Table

When to Use Which

Cosine Similarity in Attention Mechanisms

Computational Considerations

Efficiency of Computation

Pre-normalization Optimization

Numerical Precision

Approximate Nearest Neighbor Search

Limitations

When was cosine similarity first used in information retrieval?

References

Improve this article

What links here (24 of 37)

What links here (24 of 37)

Definition and Formula

What range of values does cosine similarity produce?

Why is cosine similarity used in AI?

Magnitude Invariance

High-Dimensional Behavior

Computational Efficiency

Use in Embeddings

Word Embeddings

Sentence and Document Embeddings

Use in Contrastive Learning

CLIP

SimCLR

Other Contrastive Methods

Use in RAG and Vector Search

How does cosine similarity differ from dot product and Euclidean distance?

Dot Product

Euclidean Distance

Comparison Table

When to Use Which

Cosine Similarity in Attention Mechanisms

Computational Considerations

Efficiency of Computation

Pre-normalization Optimization

Numerical Precision

Approximate Nearest Neighbor Search

Limitations

When was cosine similarity first used in information retrieval?

References

Improve this article

Related Articles

Prompt Engineering

Trigram

Agentic Context Engineering

BLEU (Bilingual Evaluation Understudy)

Bag of Words

Bigram

What links here (24 of 37)

Related Articles

Prompt Engineering

Trigram

Agentic Context Engineering

BLEU (Bilingual Evaluation Understudy)

Bag of Words

Bigram

What links here (24 of 37)