k-Nearest Neighbors (often abbreviated k-NN or KNN) is a non-parametric, instance-based supervised learning algorithm used for both classification and regression. The algorithm classifies a new data point by looking at the k closest examples in the training set under some distance metric, then assigning the majority class (for classification) or the average target value (for regression). KNN is one of the oldest and conceptually simplest algorithms in machine learning, with origins tracing back to a 1951 technical report by Evelyn Fix and Joseph Hodges, but it remains foundational to modern artificial intelligence systems through its role in vector database similarity search and retrieval-augmented generation.
KNN is sometimes described as a lazy learner because it performs no explicit training step. The model effectively is the training set: when a query arrives, the algorithm computes distances from the query to all stored examples and uses the nearest ones to make a prediction. This contrasts with eager learners such as logistic regression, support vector machines, or neural networks, which fit a compact set of parameters at training time and discard the data afterward. The trade-off is direct: KNN has zero training cost but pays at inference time, both in computation and in memory.
The standard k-NN procedure for a query point x is straightforward:
The decision boundary implied by KNN is implicit: it is the locus of points equidistant from neighbors of different classes. With k = 1, the boundary is the Voronoi tessellation of the training set. As k increases, the boundary becomes smoother and less sensitive to individual training points.
In the classification setting, given training pairs (x_1, y_1), ..., (x_n, y_n) with discrete labels y_i in {1, ..., C}, the prediction for a new query x is:
y_hat = argmax_c sum over i in N_k(x) of 1{y_i = c}
where N_k(x) is the set of indices of the k nearest neighbors of x, and 1{...} is the indicator function. For binary classification, choosing an odd value of k avoids ties.
For regression with continuous targets y_i in R, the standard prediction is the mean of the neighbors:
y_hat = (1/k) * sum over i in N_k(x) of y_i
Median or mode aggregations are sometimes used when the target distribution is skewed or contains outliers. KNN regression produces piecewise-constant predictions, which makes it a useful baseline but limits its accuracy on smooth functions compared to methods like linear regression or gradient boosting.
The table below summarizes the differences between the two main use cases of KNN.
| Aspect | KNN Classification | KNN Regression |
|---|---|---|
| Output | Discrete class label | Continuous numeric value |
| Aggregation | Plurality vote among k neighbors | Mean, median, or weighted average of neighbor targets |
| Tie-breaking | Choose odd k for binary problems; reduce k or use distance weighting otherwise | Not usually an issue, but mean is sensitive to outliers |
| Loss connection | Approximates the Bayes classifier under mild conditions | Approximates the conditional expectation E[Y | X = x] |
| Weighting | Uniform or inverse-distance weights | Uniform mean or distance-weighted average (Shepard interpolation) |
| Common metric | Accuracy, F1, AUC | Mean squared error, mean absolute error |
| Decision surface | Piecewise-constant in label space | Piecewise-constant in target space |
The k-nearest neighbor rule was first introduced in a 1951 unpublished technical report titled Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties by Evelyn Fix and Joseph Lawson Hodges Jr. at the University of California, Berkeley. The report was prepared for the United States Air Force School of Aviation Medicine and circulated only as an internal document. Fix and Hodges studied a closely related procedure, the k-NN density estimator, and proposed using it as the basis for a non-parametric classification rule. Although the report was not formally published until decades later, it established the theoretical foundation for the family of nearest-neighbor methods.
Fix (1904 to 1965) was a pioneering woman in mathematical statistics, and Hodges (1922 to 2000) had been working with the U.S. Air Force on statistical problems since World War II. Their report quietly shaped the field of pattern recognition for years before the broader research community took notice.
The nearest neighbor rule entered the mainstream of pattern recognition research with the publication of Nearest Neighbor Pattern Classification by Thomas Cover and Peter Hart in the IEEE Transactions on Information Theory, volume 13, pages 21 to 27, in 1967. Cover and Hart proved a striking asymptotic result that became one of the most cited findings in early pattern recognition theory.
The Cover-Hart theorem states that as the number of training examples n approaches infinity, the error rate R of the 1-nearest-neighbor classifier satisfies:
R* <= R <= R* * (2 - M * R* / (M - 1))
where R* is the Bayes error rate (the irreducible minimum error rate of any classifier with knowledge of the true class-conditional distributions) and M is the number of classes. In the binary case, this simplifies to R <= 2 * R*, often quoted as the 2x Bayes bound: with infinitely many training examples, the simple 1-NN rule never has more than twice the error rate of the optimal Bayes classifier. For larger k, the bound improves further, approaching R* as k grows (provided k/n approaches zero so the classifier remains local).
This result was revolutionary because it provided strong theoretical guarantees for an algorithm that requires no training and no knowledge of the underlying probability distributions. It gave nearest-neighbor methods a place at the foundation of statistical pattern recognition alongside more elaborate parametric methods.
Following Cover and Hart, researchers extended the theory and practice of KNN in several directions:
These contributions kept KNN relevant through decades when more elaborate algorithms such as neural networks and support vector machines dominated headlines.
The choice of distance function determines what similarity means for KNN. A poor metric can cause the algorithm to find neighbors that are not semantically related to the query, while a well-chosen metric encodes domain knowledge into the model implicitly.
Let x = (x_1, ..., x_d) and y = (y_1, ..., y_d) be two points in R^d.
The most familiar distance, equal to the straight-line distance between two points:
d(x, y) = sqrt(sum from i=1 to d of (x_i - y_i)^2)
Euclidean distance is the default in most KNN implementations. It assumes that all features are on the same scale, so feature standardization (zero mean, unit variance) is usually required before applying it.
Also called taxicab or city-block distance, equal to the sum of absolute differences along each dimension:
d(x, y) = sum from i=1 to d of |x_i - y_i|
Manhattan distance is more robust to outliers than Euclidean distance and is often preferred in high-dimensional settings, where the L1 norm tends to preserve more meaningful contrast between points than L2.
A family of distances parameterized by p that generalizes both Euclidean (p = 2) and Manhattan (p = 1):
d(x, y) = (sum from i=1 to d of |x_i - y_i|^p)^(1/p)
For p approaching infinity, the Minkowski distance reduces to the Chebyshev distance (the maximum absolute difference along any single dimension). The Minkowski formulation is useful because the parameter p can be tuned by cross-validation as another hyperparameter.
Cosine similarity measures the cosine of the angle between two vectors, ignoring their magnitudes:
cos(x, y) = (x . y) / (||x|| * ||y||)
The corresponding cosine distance is 1 minus this value. Cosine distance is the standard choice for high-dimensional sparse data such as text embeddings, TF-IDF vectors, and the dense vector representations produced by transformer models. It is the distance metric used by most modern vector databases for semantic search.
The Mahalanobis distance accounts for the covariance structure of the data:
d(x, y) = sqrt((x - y)^T * S^(-1) * (x - y))
where S is the sample covariance matrix. By dividing through the covariance, Mahalanobis distance becomes scale-invariant and adjusts for correlations between features. It is equivalent to Euclidean distance after a whitening transformation. The metric is widely used in anomaly detection and in cases where features are correlated.
For categorical or binary features, the Hamming distance counts the number of positions at which two strings differ. It is the standard choice for categorical KNN and for comparing binary hash codes used in approximate nearest-neighbor search.
| Metric | Formula (informal) | Best for | Caveats |
|---|---|---|---|
| Euclidean (L2) | Square root of summed squared differences | Continuous, low-dimensional, rescaled features | Sensitive to feature scale and to the curse of dimensionality |
| Manhattan (L1) | Sum of absolute differences | High-dimensional or sparse data, robustness to outliers | Less geometrically intuitive than Euclidean |
| Minkowski (Lp) | Generalization of L1 and L2 with parameter p | When p can be tuned by cross-validation | Cost of an additional hyperparameter |
| Chebyshev (L-infinity) | Maximum absolute difference | Game-board distances, infinity-norm constraints | Often dominated by a single feature |
| Cosine | One minus cosine of angle between vectors | Dense and sparse high-dimensional embeddings, text | Ignores vector magnitude |
| Mahalanobis | Whitened Euclidean using inverse covariance | Correlated features, anomaly detection | Requires estimating and inverting a covariance matrix |
| Hamming | Count of mismatched positions | Binary or categorical data, hash codes | Not meaningful for continuous values |
The number of neighbors k is the central hyperparameter of KNN, controlling the bias-variance trade-off of the model.
With very small k (especially k = 1), the model has low bias but high variance. The decision boundary is jagged and snakes around every training point, perfectly classifying the training set but reacting strongly to noise. The model is said to overfit.
With very large k (approaching n), the model has high bias but low variance. The classifier averages over a large neighborhood and approaches a single global majority vote in the limit, ignoring local structure. The model is said to underfit. In the extreme case k = n, every query receives the same prediction.
The optimal k typically lies between these extremes and depends on the dataset, the noise level, and the dimensionality. As a rough rule of thumb, k around sqrt(n) is sometimes suggested as a starting point, but in practice the value should always be tuned empirically.
The standard procedure for selecting k is to perform k-fold cross-validation over a grid of candidate values and pick the value that minimizes validation error. Tools such as GridSearchCV in scikit-learn automate this search. It is important to perform cross-validation on the training set only, not on the test set, to avoid optimistic bias.
When the number of classes is two, choosing an odd k guarantees that there can be no tie in the majority vote, making the prediction deterministic. For multiclass problems, odd k does not eliminate ties, so a tie-breaking strategy (random selection, falling back to the closest neighbor, or distance weighting) must be specified.
For the Cover-Hart consistency results to hold beyond the 2x Bayes bound, k must be allowed to grow with n while satisfying k/n approaches 0. Practical heuristics such as k = sqrt(n) or k = log(n) approximate this asymptotic regime.
In the basic algorithm, all k neighbors contribute equally to the prediction. A common refinement is to weight each neighbor by a function of its distance, so closer neighbors carry more influence.
Every neighbor contributes a vote of 1, regardless of how close or far it is. This is the default in most libraries and works well when neighbors are roughly equally informative.
Neighbors are weighted by w_i = 1 / d(x, x_i) (sometimes 1 / d^2 or another power). This gives nearby neighbors much more influence, which can sharpen the prediction near complex decision boundaries. A small constant is usually added to the denominator to handle the case where d = 0.
Donald Shepard's 1968 method generalized inverse-distance weighting to arbitrary power exponents. The interpolated value at a query point is a weighted average of the surrounding values with weights w_i = 1 / d^p. With large p, the result becomes a Voronoi-like mosaic with nearly constant interpolated value within each cell. Shepard interpolation is widely used in geostatistics and is a special case of distance-weighted KNN regression.
More generally, any non-negative kernel function K(d) (Gaussian, Epanechnikov, tricube) can be used to weight neighbors. This connects KNN to kernel regression and Nadaraya-Watson estimation. With a Gaussian kernel and an appropriate bandwidth, the weighted KNN regressor approximates kernel density estimation of the conditional expectation.
A naive KNN implementation computes the distance from the query to every training example, leading to O(n * d) work per query for n training points in d dimensions. For large datasets this is prohibitively slow, motivating a long line of research into faster exact and approximate nearest-neighbor search.
The brute-force approach simply scans the entire training set. It is the most straightforward implementation, parallelizes well on GPUs, and is the only tractable option for very high-dimensional data where tree-based indices fail. Frameworks like PyTorch and JAX make brute-force KNN on the GPU surprisingly fast for moderate dataset sizes.
A k-d tree (Bentley, 1975) is a binary tree that recursively splits the data along axis-aligned hyperplanes. Each internal node selects a feature and a split value, and points are routed left or right depending on whether their value is below or above the split. Nearest-neighbor queries can be answered in expected O(log n) time when the data are reasonably uniform. KD-trees work best for low-dimensional data (d < 20). In higher dimensions, the backtracking required to guarantee correct answers degrades query time toward O(n).
A ball tree partitions the data into nested hyperspheres rather than axis-aligned boxes. Each node represents a ball with a center and a radius, and child balls collectively contain all points of the parent. Ball trees handle higher dimensions better than k-d trees because they do not depend on axis alignment. They are particularly useful when the data lie on a low-dimensional manifold within a high-dimensional ambient space.
Locality-sensitive hashing (LSH), introduced by Piotr Indyk and Rajeev Motwani in 1998, hashes input points so that similar points collide with high probability. To answer an approximate nearest-neighbor query, the algorithm hashes the query and considers only the points in the same hash bucket. LSH provides theoretical guarantees on approximation quality and is the conceptual ancestor of many modern approximate methods.
For billion-scale datasets typical of modern AI systems, even tree-based exact search is too slow. Approximate nearest neighbor (ANN) algorithms trade a small loss in recall for orders-of-magnitude faster queries and lower memory footprint. ANN search is the engine behind vector databases and the retrieval step in retrieval-augmented generation.
HNSW, introduced by Yury Malkov and Dmitry Yashunin in Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (IEEE TPAMI, 2018; arXiv 2016), is a graph-based ANN algorithm that builds a multi-layer proximity graph. Each point is inserted into a random subset of layers with exponentially decaying probability, producing a sparse top layer for long-range navigation and a dense bottom layer for fine-grained search. Queries start at the top layer, greedily walk toward the query, then descend layer by layer, gathering candidate neighbors at each level. HNSW achieves logarithmic query complexity and is consistently among the top performers on the ANN-Benchmarks suite. It is the default index in Pinecone, Weaviate, Qdrant, and many other vector databases.
FAISS (Facebook AI Similarity Search) is an open-source library released by Meta in 2017, accompanying the paper Billion-Scale Similarity Search with GPUs by Jeff Johnson, Matthijs Douze, and Herve Jegou. FAISS provides a unified interface to multiple index types, including flat (brute-force), inverted file (IVF), product quantization (PQ), and HNSW. Its key innovation is GPU-accelerated brute-force and IVF-PQ search, which together enable approximate nearest-neighbor queries on indices containing billions of vectors. FAISS popularized the use of product quantization, which compresses each high-dimensional vector into a short code while preserving distance estimates, making in-memory billion-scale search feasible.
ScaNN (Scalable Nearest Neighbors), released by Google Research in 2020, is based on the paper Accelerating Large-Scale Inference with Anisotropic Vector Quantization by Ruiqi Guo, Philip Sun, and colleagues. ScaNN's main contribution is anisotropic vector quantization, a quantization scheme that minimizes inner-product error rather than reconstruction error, aligning the loss function with the actual goal of maximum inner product search. On the glove-100-angular benchmark, ScaNN handles roughly twice as many queries per second as HNSW at comparable accuracy. It is widely used inside Google products that require fast embedding-based retrieval.
ANNOY (Approximate Nearest Neighbors Oh Yeah), developed by Erik Bernhardsson at Spotify, is a tree-based ANN library built on randomized projection trees. ANNOY indices are memory-mapped, which means very large indices can be loaded almost instantly and shared across processes without memory duplication. Spotify uses ANNOY to power music recommendations across hundreds of millions of users.
DiskANN, introduced by Microsoft Research in 2019 (Subramanya et al., NeurIPS), is a graph-based ANN system designed to operate primarily out of solid-state storage. Unlike in-memory systems such as HNSW and FAISS, DiskANN can index a billion-point database on a single workstation with only 64 GB of RAM and an inexpensive SSD. Its underlying Vamana graph is constructed to have a small search radius, minimizing the number of disk accesses per query. DiskANN serves more than 5,000 queries per second with sub-3-millisecond latency at 95%+ recall on the SIFT1B benchmark.
| Library | Approach | Strength | Trade-off |
|---|---|---|---|
| HNSW (hnswlib) | Hierarchical proximity graph | Top-tier accuracy and latency, simple to tune | Memory grows with edge count |
| FAISS | Inverted file + product quantization, GPU support | Billion-scale on a single GPU; flexible index zoo | Steeper learning curve; quantization loss |
| ScaNN | Anisotropic vector quantization + tree partition | Fastest published QPS at high recall on many benchmarks | Mainly tuned for inner-product search |
| ANNOY | Random projection forest, memory-mapped | Tiny memory footprint, instant index load | Lower recall than HNSW or ScaNN |
| DiskANN | Vamana graph stored on SSD | Billion-scale on a single workstation with 64 GB RAM | Higher latency than fully in-memory systems |
| ScaNN, FAISS, HNSW (in PostgreSQL via pgvector) | Multiple indices selectable | Integrate vector search into a relational store | Often slower than dedicated vector DBs |
Although KNN was once considered a textbook curiosity in the deep learning era, it has returned to the center of AI infrastructure thanks to the rise of dense embeddings and large language models.
A modern vector database stores high-dimensional embedding vectors (often 384 to 4096 dimensions) and answers approximate KNN queries at low latency. Systems such as Pinecone, Milvus, Weaviate, Qdrant, Chroma, and the pgvector extension for PostgreSQL all rely on ANN algorithms (most commonly HNSW or IVF-PQ) under the hood. Vector databases are essentially production-grade KNN servers, hardened with sharding, replication, hybrid filtering, and tenancy.
Retrieval-augmented generation (RAG), introduced by Patrick Lewis and colleagues in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS, 2020), pairs a parametric language model with a non-parametric retrieval index. The pipeline embeds a query, performs a KNN search over a corpus of document embeddings, and conditions the language model on the retrieved passages when generating the answer. RAG enables LLMs to cite up-to-date facts, ground their answers in trustworthy sources, and reduce hallucination. At the heart of every RAG system is an approximate KNN search, typically over millions or billions of dense embedding vectors.
Urvashi Khandelwal and colleagues introduced kNN-LM in Generalization through Memorization: Nearest Neighbor Language Models (ICLR, 2020). The idea is to interpolate the next-token distribution of a parametric LM with a distribution derived from the nearest neighbors of the current context in a stored datastore. Concretely, the model embeds each context in the training corpus once, builds a FAISS index over those embeddings, and at inference time looks up the nearest contexts to the current one and reads off their next tokens. kNN-LM achieved a state-of-the-art perplexity of 15.79 on Wikitext-103 with no additional training, and the technique enables effective domain adaptation by simply swapping the datastore. It revived interest in the idea that explicit memorization, rather than only parameter-based generalization, is a powerful tool in language modeling.
More recent interpretability research has identified specific retrieval heads in transformer language models, attention heads that behave like lightweight nearest-neighbor lookups by attending to past tokens that match the current context. Disabling these heads degrades long-context retrieval and in-context learning, suggesting that something resembling KNN is implemented inside large neural networks even without an explicit external index.
KNN-style methods power many recommender systems. In user-based collaborative filtering, the algorithm finds the k users most similar to a target user (typically by cosine similarity over their rating vectors) and recommends items those neighbors enjoyed. In item-based collaborative filtering, the algorithm computes item-to-item similarity and recommends items similar to those the user already liked. The item-based variant, popularized by Amazon in the early 2000s, scales better because the item-item matrix is more stable than the user-user matrix and can be precomputed offline. Modern recommendation systems often combine deep embeddings with ANN-powered KNN retrieval to serve recommendations in milliseconds across billions of items.
Reverse image search, visual product matching, and multimodal retrieval all rely on KNN over image embeddings extracted by convolutional neural networks or vision transformers. Models like CLIP embed images and text into a shared vector space, after which finding the most relevant images for a text query reduces to an ANN search.
KNN is a natural anomaly detection tool: data points that lie far from their k nearest neighbors are likely anomalies. The Local Outlier Factor (LOF), introduced by Breunig et al. in 2000, refines this idea by comparing the local density of a point to the local density of its neighbors, identifying points whose local neighborhood is much sparser than expected. KNN-based anomaly detection is widely used in network intrusion detection, credit card fraud detection, and industrial process monitoring.