k-Nearest Neighbors

k-Nearest Neighbors (often abbreviated k-NN or KNN) is a non-parametric, instance-based supervised learning algorithm used for both classification and regression. The algorithm classifies a new data point by looking at the k closest examples in the training set under some distance metric, then assigning the majority class (for classification) or the average target value (for regression). KNN is one of the oldest and conceptually simplest algorithms in machine learning, with origins tracing back to a 1951 technical report by Evelyn Fix and Joseph Hodges, but it remains foundational to modern artificial intelligence systems through its role in vector database similarity search and retrieval-augmented generation.

KNN is sometimes described as a lazy learner because it performs no explicit training step. The model effectively is the training set: when a query arrives, the algorithm computes distances from the query to all stored examples and uses the nearest ones to make a prediction. This contrasts with eager learners such as logistic regression, support vector machines, or neural networks, which fit a compact set of parameters at training time and discard the data afterward. The trade-off is direct: KNN has zero training cost but pays at inference time, both in computation and in memory.

Basic Algorithm

The standard k-NN procedure for a query point x is straightforward:

Compute the distance d(x, x_i) between the query and every training example x_i in the dataset.
Identify the k training examples with the smallest distances. These are the k nearest neighbors of x.
For classification, assign x the class label that appears most often among the k neighbors (a plurality vote). For regression, assign x the mean (or median) of the target values of the k neighbors.
Optionally weight the contribution of each neighbor by a function of its distance, so closer neighbors influence the prediction more than distant ones.

The decision boundary implied by KNN is implicit: it is the locus of points equidistant from neighbors of different classes. With k = 1, the boundary is the Voronoi tessellation of the training set. As k increases, the boundary becomes smoother and less sensitive to individual training points.

Classification

In the classification setting, given training pairs (x_1, y_1), ..., (x_n, y_n) with discrete labels y_i in {1, ..., C}, the prediction for a new query x is:

y_hat = argmax_c sum over i in N_k(x) of 1{y_i = c}

where N_k(x) is the set of indices of the k nearest neighbors of x, and 1{...} is the indicator function. For binary classification, choosing an odd value of k avoids ties.

Regression

For regression with continuous targets y_i in R, the standard prediction is the mean of the neighbors:

y_hat = (1/k) * sum over i in N_k(x) of y_i

Median or mode aggregations are sometimes used when the target distribution is skewed or contains outliers. KNN regression produces piecewise-constant predictions, which makes it a useful baseline but limits its accuracy on smooth functions compared to methods like linear regression or gradient boosting.

Classification vs. Regression

The table below summarizes the differences between the two main use cases of KNN.

Aspect	KNN Classification	KNN Regression
Output	Discrete class label	Continuous numeric value
Aggregation	Plurality vote among k neighbors	Mean, median, or weighted average of neighbor targets
Tie-breaking	Choose odd k for binary problems; reduce k or use distance weighting otherwise	Not usually an issue, but mean is sensitive to outliers
Loss connection	Approximates the Bayes classifier under mild conditions	Approximates the conditional expectation E[Y \| X = x]
Weighting	Uniform or inverse-distance weights	Uniform mean or distance-weighted average (Shepard interpolation)
Common metric	Accuracy, F1, AUC	Mean squared error, mean absolute error
Decision surface	Piecewise-constant in label space	Piecewise-constant in target space

History

Fix and Hodges (1951)

The k-nearest neighbor rule was first introduced in a 1951 unpublished technical report titled Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties by Evelyn Fix and Joseph Lawson Hodges Jr. at the University of California, Berkeley. The report was prepared for the United States Air Force School of Aviation Medicine and circulated only as an internal document. Fix and Hodges studied a closely related procedure, the k-NN density estimator, and proposed using it as the basis for a non-parametric classification rule. Although the report was not formally published until decades later, it established the theoretical foundation for the family of nearest-neighbor methods.

Fix (1904 to 1965) was a pioneering woman in mathematical statistics, and Hodges (1922 to 2000) had been working with the U.S. Air Force on statistical problems since World War II. Their report quietly shaped the field of pattern recognition for years before the broader research community took notice.

Cover and Hart (1967)

The nearest neighbor rule entered the mainstream of pattern recognition research with the publication of Nearest Neighbor Pattern Classification by Thomas Cover and Peter Hart in the IEEE Transactions on Information Theory, volume 13, pages 21 to 27, in 1967. Cover and Hart proved a striking asymptotic result that became one of the most cited findings in early pattern recognition theory.

The Cover-Hart theorem states that as the number of training examples n approaches infinity, the error rate R of the 1-nearest-neighbor classifier satisfies:

R* <= R <= R* * (2 - M * R* / (M - 1))

where R* is the Bayes error rate (the irreducible minimum error rate of any classifier with knowledge of the true class-conditional distributions) and M is the number of classes. In the binary case, this simplifies to R <= 2 * R*, often quoted as the 2x Bayes bound: with infinitely many training examples, the simple 1-NN rule never has more than twice the error rate of the optimal Bayes classifier. For larger k, the bound improves further, approaching R* as k grows (provided k/n approaches zero so the classifier remains local).

This result was revolutionary because it provided strong theoretical guarantees for an algorithm that requires no training and no knowledge of the underlying probability distributions. It gave nearest-neighbor methods a place at the foundation of statistical pattern recognition alongside more elaborate parametric methods.

Subsequent Developments

Following Cover and Hart, researchers extended the theory and practice of KNN in several directions:

Condensed Nearest Neighbor (Hart, 1968): A data-reduction technique that selects a small subset of prototypes sufficient to correctly classify the rest of the training set with the 1-NN rule, dramatically lowering memory requirements.
Edited Nearest Neighbor (Wilson, 1972): A noise-reduction technique that removes training examples misclassified by their own neighbors, smoothing the decision boundary.
Tomek links and one-sided selection: Methods for cleaning class boundaries to improve KNN performance on imbalanced datasets.
k-d trees (Bentley, 1975) and ball trees: Spatial data structures that accelerate nearest-neighbor queries from O(n) per query to O(log n) in low-dimensional spaces.
Locality-sensitive hashing (Indyk and Motwani, 1998): A randomized technique for sublinear approximate nearest neighbor search in high dimensions.

These contributions kept KNN relevant through decades when more elaborate algorithms such as neural networks and support vector machines dominated headlines.

Distance Metrics

The choice of distance function determines what similarity means for KNN. A poor metric can cause the algorithm to find neighbors that are not semantically related to the query, while a well-chosen metric encodes domain knowledge into the model implicitly.

Let x = (x_1, ..., x_d) and y = (y_1, ..., y_d) be two points in R^d.

Euclidean Distance (L2)

The most familiar distance, equal to the straight-line distance between two points:

d(x, y) = sqrt(sum from i=1 to d of (x_i - y_i)^2)

Euclidean distance is the default in most KNN implementations. It assumes that all features are on the same scale, so feature standardization (zero mean, unit variance) is usually required before applying it.

Manhattan Distance (L1)

Also called taxicab or city-block distance, equal to the sum of absolute differences along each dimension:

d(x, y) = sum from i=1 to d of |x_i - y_i|

Manhattan distance is more robust to outliers than Euclidean distance and is often preferred in high-dimensional settings, where the L1 norm tends to preserve more meaningful contrast between points than L2.

Minkowski Distance (Lp)

A family of distances parameterized by p that generalizes both Euclidean (p = 2) and Manhattan (p = 1):

d(x, y) = (sum from i=1 to d of |x_i - y_i|^p)^(1/p)

For p approaching infinity, the Minkowski distance reduces to the Chebyshev distance (the maximum absolute difference along any single dimension). The Minkowski formulation is useful because the parameter p can be tuned by cross-validation as another hyperparameter.

Cosine Similarity and Cosine Distance

Cosine similarity measures the cosine of the angle between two vectors, ignoring their magnitudes:

cos(x, y) = (x . y) / (||x|| * ||y||)

The corresponding cosine distance is 1 minus this value. Cosine distance is the standard choice for high-dimensional sparse data such as text embeddings, TF-IDF vectors, and the dense vector representations produced by transformer models. It is the distance metric used by most modern vector databases for semantic search.

Mahalanobis Distance

The Mahalanobis distance accounts for the covariance structure of the data:

d(x, y) = sqrt((x - y)^T * S^(-1) * (x - y))

where S is the sample covariance matrix. By dividing through the covariance, Mahalanobis distance becomes scale-invariant and adjusts for correlations between features. It is equivalent to Euclidean distance after a whitening transformation. The metric is widely used in anomaly detection and in cases where features are correlated.

Hamming Distance

For categorical or binary features, the Hamming distance counts the number of positions at which two strings differ. It is the standard choice for categorical KNN and for comparing binary hash codes used in approximate nearest-neighbor search.

Distance Metric Comparison

Metric	Formula (informal)	Best for	Caveats
Euclidean (L2)	Square root of summed squared differences	Continuous, low-dimensional, rescaled features	Sensitive to feature scale and to the curse of dimensionality
Manhattan (L1)	Sum of absolute differences	High-dimensional or sparse data, robustness to outliers	Less geometrically intuitive than Euclidean
Minkowski (Lp)	Generalization of L1 and L2 with parameter p	When p can be tuned by cross-validation	Cost of an additional hyperparameter
Chebyshev (L-infinity)	Maximum absolute difference	Game-board distances, infinity-norm constraints	Often dominated by a single feature
Cosine	One minus cosine of angle between vectors	Dense and sparse high-dimensional embeddings, text	Ignores vector magnitude
Mahalanobis	Whitened Euclidean using inverse covariance	Correlated features, anomaly detection	Requires estimating and inverting a covariance matrix
Hamming	Count of mismatched positions	Binary or categorical data, hash codes	Not meaningful for continuous values

Choosing k

The number of neighbors k is the central hyperparameter of KNN, controlling the bias-variance trade-off of the model.

Bias-Variance Trade-off

With very small k (especially k = 1), the model has low bias but high variance. The decision boundary is jagged and snakes around every training point, perfectly classifying the training set but reacting strongly to noise. The model is said to overfit.

With very large k (approaching n), the model has high bias but low variance. The classifier averages over a large neighborhood and approaches a single global majority vote in the limit, ignoring local structure. The model is said to underfit. In the extreme case k = n, every query receives the same prediction.

The optimal k typically lies between these extremes and depends on the dataset, the noise level, and the dimensionality. As a rough rule of thumb, k around sqrt(n) is sometimes suggested as a starting point, but in practice the value should always be tuned empirically.

Cross-Validation

The standard procedure for selecting k is to perform k-fold cross-validation over a grid of candidate values and pick the value that minimizes validation error. Tools such as GridSearchCV in scikit-learn automate this search. It is important to perform cross-validation on the training set only, not on the test set, to avoid optimistic bias.

Odd k for Binary Classification

When the number of classes is two, choosing an odd k guarantees that there can be no tie in the majority vote, making the prediction deterministic. For multiclass problems, odd k does not eliminate ties, so a tie-breaking strategy (random selection, falling back to the closest neighbor, or distance weighting) must be specified.

k as a Function of Sample Size

For the Cover-Hart consistency results to hold beyond the 2x Bayes bound, k must be allowed to grow with n while satisfying k/n approaches 0. Practical heuristics such as k = sqrt(n) or k = log(n) approximate this asymptotic regime.

Weighting Schemes

In the basic algorithm, all k neighbors contribute equally to the prediction. A common refinement is to weight each neighbor by a function of its distance, so closer neighbors carry more influence.

Uniform Weighting

Every neighbor contributes a vote of 1, regardless of how close or far it is. This is the default in most libraries and works well when neighbors are roughly equally informative.

Inverse-Distance Weighting

Neighbors are weighted by w_i = 1 / d(x, x_i) (sometimes 1 / d^2 or another power). This gives nearby neighbors much more influence, which can sharpen the prediction near complex decision boundaries. A small constant is usually added to the denominator to handle the case where d = 0.

Shepard's Method

Donald Shepard's 1968 method generalized inverse-distance weighting to arbitrary power exponents. The interpolated value at a query point is a weighted average of the surrounding values with weights w_i = 1 / d^p. With large p, the result becomes a Voronoi-like mosaic with nearly constant interpolated value within each cell. Shepard interpolation is widely used in geostatistics and is a special case of distance-weighted KNN regression.

Kernel Weighting

More generally, any non-negative kernel function K(d) (Gaussian, Epanechnikov, tricube) can be used to weight neighbors. This connects KNN to kernel regression and Nadaraya-Watson estimation. With a Gaussian kernel and an appropriate bandwidth, the weighted KNN regressor approximates kernel density estimation of the conditional expectation.

Computational Aspects

A naive KNN implementation computes the distance from the query to every training example, leading to O(n * d) work per query for n training points in d dimensions. For large datasets this is prohibitively slow, motivating a long line of research into faster exact and approximate nearest-neighbor search.

Brute Force

The brute-force approach simply scans the entire training set. It is the most straightforward implementation, parallelizes well on GPUs, and is the only tractable option for very high-dimensional data where tree-based indices fail. Frameworks like PyTorch and JAX make brute-force KNN on the GPU surprisingly fast for moderate dataset sizes.

KD-Trees

A k-d tree (Bentley, 1975) is a binary tree that recursively splits the data along axis-aligned hyperplanes. Each internal node selects a feature and a split value, and points are routed left or right depending on whether their value is below or above the split. Nearest-neighbor queries can be answered in expected O(log n) time when the data are reasonably uniform. KD-trees work best for low-dimensional data (d < 20). In higher dimensions, the backtracking required to guarantee correct answers degrades query time toward O(n).

Ball Trees

A ball tree partitions the data into nested hyperspheres rather than axis-aligned boxes. Each node represents a ball with a center and a radius, and child balls collectively contain all points of the parent. Ball trees handle higher dimensions better than k-d trees because they do not depend on axis alignment. They are particularly useful when the data lie on a low-dimensional manifold within a high-dimensional ambient space.

Locality-Sensitive Hashing

Locality-sensitive hashing (LSH), introduced by Piotr Indyk and Rajeev Motwani in 1998, hashes input points so that similar points collide with high probability. To answer an approximate nearest-neighbor query, the algorithm hashes the query and considers only the points in the same hash bucket. LSH provides theoretical guarantees on approximation quality and is the conceptual ancestor of many modern approximate methods.

Approximate Nearest Neighbor Search

For billion-scale datasets typical of modern AI systems, even tree-based exact search is too slow. Approximate nearest neighbor (ANN) algorithms trade a small loss in recall for orders-of-magnitude faster queries and lower memory footprint. ANN search is the engine behind vector databases and the retrieval step in retrieval-augmented generation.

Hierarchical Navigable Small World (HNSW)

HNSW, introduced by Yury Malkov and Dmitry Yashunin in Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs (IEEE TPAMI, 2018; arXiv 2016), is a graph-based ANN algorithm that builds a multi-layer proximity graph. Each point is inserted into a random subset of layers with exponentially decaying probability, producing a sparse top layer for long-range navigation and a dense bottom layer for fine-grained search. Queries start at the top layer, greedily walk toward the query, then descend layer by layer, gathering candidate neighbors at each level. HNSW achieves logarithmic query complexity and is consistently among the top performers on the ANN-Benchmarks suite. It is the default index in Pinecone, Weaviate, Qdrant, and many other vector databases.

FAISS

FAISS (Facebook AI Similarity Search) is an open-source library released by Meta in 2017, accompanying the paper Billion-Scale Similarity Search with GPUs by Jeff Johnson, Matthijs Douze, and Herve Jegou. FAISS provides a unified interface to multiple index types, including flat (brute-force), inverted file (IVF), product quantization (PQ), and HNSW. Its key innovation is GPU-accelerated brute-force and IVF-PQ search, which together enable approximate nearest-neighbor queries on indices containing billions of vectors. FAISS popularized the use of product quantization, which compresses each high-dimensional vector into a short code while preserving distance estimates, making in-memory billion-scale search feasible.

ScaNN

ScaNN (Scalable Nearest Neighbors), released by Google Research in 2020, is based on the paper Accelerating Large-Scale Inference with Anisotropic Vector Quantization by Ruiqi Guo, Philip Sun, and colleagues. ScaNN's main contribution is anisotropic vector quantization, a quantization scheme that minimizes inner-product error rather than reconstruction error, aligning the loss function with the actual goal of maximum inner product search. On the glove-100-angular benchmark, ScaNN handles roughly twice as many queries per second as HNSW at comparable accuracy. It is widely used inside Google products that require fast embedding-based retrieval.

ANNOY

ANNOY (Approximate Nearest Neighbors Oh Yeah), developed by Erik Bernhardsson at Spotify, is a tree-based ANN library built on randomized projection trees. ANNOY indices are memory-mapped, which means very large indices can be loaded almost instantly and shared across processes without memory duplication. Spotify uses ANNOY to power music recommendations across hundreds of millions of users.

DiskANN

DiskANN, introduced by Microsoft Research in 2019 (Subramanya et al., NeurIPS), is a graph-based ANN system designed to operate primarily out of solid-state storage. Unlike in-memory systems such as HNSW and FAISS, DiskANN can index a billion-point database on a single workstation with only 64 GB of RAM and an inexpensive SSD. Its underlying Vamana graph is constructed to have a small search radius, minimizing the number of disk accesses per query. DiskANN serves more than 5,000 queries per second with sub-3-millisecond latency at 95%+ recall on the SIFT1B benchmark.

ANN Library Comparison

Library	Approach	Strength	Trade-off
HNSW (hnswlib)	Hierarchical proximity graph	Top-tier accuracy and latency, simple to tune	Memory grows with edge count
FAISS	Inverted file + product quantization, GPU support	Billion-scale on a single GPU; flexible index zoo	Steeper learning curve; quantization loss
ScaNN	Anisotropic vector quantization + tree partition	Fastest published QPS at high recall on many benchmarks	Mainly tuned for inner-product search
ANNOY	Random projection forest, memory-mapped	Tiny memory footprint, instant index load	Lower recall than HNSW or ScaNN
DiskANN	Vamana graph stored on SSD	Billion-scale on a single workstation with 64 GB RAM	Higher latency than fully in-memory systems
ScaNN, FAISS, HNSW (in PostgreSQL via pgvector)	Multiple indices selectable	Integrate vector search into a relational store	Often slower than dedicated vector DBs

Modern Relevance: KNN in the Era of Large Language Models

Although KNN was once considered a textbook curiosity in the deep learning era, it has returned to the center of AI infrastructure thanks to the rise of dense embeddings and large language models.

Vector Databases

A modern vector database stores high-dimensional embedding vectors (often 384 to 4096 dimensions) and answers approximate KNN queries at low latency. Systems such as Pinecone, Milvus, Weaviate, Qdrant, Chroma, and the pgvector extension for PostgreSQL all rely on ANN algorithms (most commonly HNSW or IVF-PQ) under the hood. Vector databases are essentially production-grade KNN servers, hardened with sharding, replication, hybrid filtering, and tenancy.

Retrieval-Augmented Generation

Retrieval-augmented generation (RAG), introduced by Patrick Lewis and colleagues in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS, 2020), pairs a parametric language model with a non-parametric retrieval index. The pipeline embeds a query, performs a KNN search over a corpus of document embeddings, and conditions the language model on the retrieved passages when generating the answer. RAG enables LLMs to cite up-to-date facts, ground their answers in trustworthy sources, and reduce hallucination. At the heart of every RAG system is an approximate KNN search, typically over millions or billions of dense embedding vectors.

KNN Language Models

Urvashi Khandelwal and colleagues introduced kNN-LM in Generalization through Memorization: Nearest Neighbor Language Models (ICLR, 2020). The idea is to interpolate the next-token distribution of a parametric LM with a distribution derived from the nearest neighbors of the current context in a stored datastore. Concretely, the model embeds each context in the training corpus once, builds a FAISS index over those embeddings, and at inference time looks up the nearest contexts to the current one and reads off their next tokens. kNN-LM achieved a state-of-the-art perplexity of 15.79 on Wikitext-103 with no additional training, and the technique enables effective domain adaptation by simply swapping the datastore. It revived interest in the idea that explicit memorization, rather than only parameter-based generalization, is a powerful tool in language modeling.

Retrieval Heads in Transformers

More recent interpretability research has identified specific retrieval heads in transformer language models, attention heads that behave like lightweight nearest-neighbor lookups by attending to past tokens that match the current context. Disabling these heads degrades long-context retrieval and in-context learning, suggesting that something resembling KNN is implemented inside large neural networks even without an explicit external index.

Recommender Systems and Collaborative Filtering

KNN-style methods power many recommender systems. In user-based collaborative filtering, the algorithm finds the k users most similar to a target user (typically by cosine similarity over their rating vectors) and recommends items those neighbors enjoyed. In item-based collaborative filtering, the algorithm computes item-to-item similarity and recommends items similar to those the user already liked. The item-based variant, popularized by Amazon in the early 2000s, scales better because the item-item matrix is more stable than the user-user matrix and can be precomputed offline. Modern recommendation systems often combine deep embeddings with ANN-powered KNN retrieval to serve recommendations in milliseconds across billions of items.

Image Search and Multimodal Retrieval

Reverse image search, visual product matching, and multimodal retrieval all rely on KNN over image embeddings extracted by convolutional neural networks or vision transformers. Models like CLIP embed images and text into a shared vector space, after which finding the most relevant images for a text query reduces to an ANN search.

Anomaly Detection

KNN is a natural anomaly detection tool: data points that lie far from their k nearest neighbors are likely anomalies. The Local Outlier Factor (LOF), introduced by Breunig et al. in 2000, refines this idea by comparing the local density of a point to the local density of its neighbors, identifying points whose local neighborhood is much sparser than expected. KNN-based anomaly detection is widely used in network intrusion detection, credit card fraud detection, and industrial process monitoring.

Strengths and Weaknesses

Strengths

Conceptual simplicity: The algorithm is easy to explain and to implement, requiring only a distance function and a counting step.
No training: KNN has zero training time, since the model is the data itself. New examples can be added at any time without retraining.
Non-parametric: KNN makes no assumption about the form of the underlying data distribution and can model arbitrarily complex decision boundaries.
Naturally handles multiclass problems: Unlike methods that require one-vs-rest or one-vs-one decompositions, KNN handles any number of classes uniformly.
Strong theoretical guarantees: The Cover-Hart bound proves that 1-NN is at most twice as bad as the optimal Bayes classifier in the limit, and consistency results extend to k-NN with k growing slowly.
Interpretable predictions: Each prediction can be justified by listing the k specific training examples that produced it, which is valuable in regulated domains.

Weaknesses

Slow inference: Naive KNN requires O(n) work per query. ANN methods help but introduce approximation and require auxiliary data structures.
High memory usage: The model stores the entire training set, which can be prohibitive for large datasets.
Sensitive to feature scaling: Features on different scales can dominate the distance computation. Standardization is almost always required.
Sensitive to irrelevant features: Distance is computed across all features, including noisy or irrelevant ones, which can dilute meaningful signal.
The curse of dimensionality: In high-dimensional spaces, distances between random points concentrate around a single value, so the difference between the nearest and farthest neighbor shrinks. Beyond a few dozen dimensions, the very notion of nearest becomes weak. Dimensionality reduction (PCA, autoencoders, learned embeddings) is often required to make KNN usable in high dimensions.
Sensitive to imbalanced data: Majority classes dominate neighbor votes, biasing predictions toward common labels. Techniques such as SMOTE, class-balanced weighting, and the edited nearest-neighbor rule mitigate this.
No model parameters to inspect: Because there is no fitted model, KNN does not produce coefficients or feature importances in the traditional sense. Permutation importance and surrogate models can be used as workarounds.

Practical Tips

Always standardize features before applying Euclidean or Manhattan KNN. The scale of each feature directly affects its weight in the distance computation.
Use cross-validation to choose k, the distance metric, and the weighting scheme together. The best k for one metric is often not the best for another.
Reduce dimensionality before KNN in high dimensions. Methods such as PCA, t-SNE, UMAP, or learned embeddings can transform the data into a space where Euclidean distance is meaningful.
Use ANN libraries for large datasets. Switching from a naive KNN to FAISS, HNSW, or ScaNN can reduce query time by orders of magnitude with minimal recall loss.
Be cautious with imbalanced classes. Apply class-balanced weighting or resample the training data so the rare class is not drowned out by the common one.
Beware of duplicate examples. Identical training points can dominate predictions; consider deduplication or subsampling.
For regression, try the median rather than the mean when the target distribution is skewed or contains outliers.

Variants and Extensions

Weighted KNN uses inverse-distance, kernel, or rank-based weights instead of uniform votes.
Adaptive KNN chooses a different k for each query, often based on local density.
Metric learning trains a custom distance function that is optimized for the classification task. Large Margin Nearest Neighbor (LMNN) and Neighborhood Components Analysis (NCA) are classic examples.
Fuzzy KNN assigns each training example a vector of class memberships rather than a single hard label, producing soft KNN predictions.
Mutual KNN uses only edges (i, j) such that i is among the k neighbors of j and vice versa, which yields cleaner graphs for clustering and graph-based learning.
KNN graphs (k-nearest-neighbor graphs) are widely used in dimensionality reduction (UMAP, Isomap), spectral clustering, and graph neural networks.
Prototype methods such as Condensed NN, Edited NN, and learning vector quantization (LVQ) compress the training set into a small number of representatives that approximate the original KNN behavior.

References

Fix, E., & Hodges, J. L. (1951). *Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties.* Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas. (Republished 1989, *International Statistical Review*, 57, 238 to 247.)
Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. *IEEE Transactions on Information Theory*, 13(1), 21 to 27. doi:10.1109/TIT.1967.1053964. https://isl.stanford.edu/~cover/papers/transIT/0021cove.pdf
Hart, P. E. (1968). The condensed nearest neighbor rule. *IEEE Transactions on Information Theory*, 14(3), 515 to 516.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. *IEEE Transactions on Systems, Man, and Cybernetics*, 2(3), 408 to 421.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. *Communications of the ACM*, 18(9), 509 to 517.
Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. *Proceedings of the 1968 ACM National Conference*, 517 to 524.
Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. *Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC)*, 604 to 613.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. *Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data*, 93 to 104.
Johnson, J., Douze, M., & Jegou, H. (2017). Billion-scale similarity search with GPUs. *arXiv:1702.08734.* (Published 2021 in *IEEE Transactions on Big Data*.) https://arxiv.org/abs/1702.08734
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 42(4), 824 to 836. (arXiv:1603.09320, 2016.) https://arxiv.org/abs/1603.09320
Subramanya, S. J., Devvrit, F., Kadekodi, R., Krishnaswamy, R., & Simhadri, H. V. (2019). DiskANN: Fast accurate billion-point nearest neighbor search on a single node. *Advances in Neural Information Processing Systems (NeurIPS) 32*. https://www.microsoft.com/en-us/research/publication/diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node/
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., & Kumar, S. (2020). Accelerating large-scale inference with anisotropic vector quantization. *Proceedings of the 37th International Conference on Machine Learning (ICML)*, 3887 to 3896. https://arxiv.org/abs/1908.10396
Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., & Lewis, M. (2020). Generalization through memorization: Nearest neighbor language models. *International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1911.00172
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. *Advances in Neural Information Processing Systems (NeurIPS) 33*. https://arxiv.org/abs/2005.11401
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research*, 12, 2825 to 2830. (Documentation for nearest-neighbor estimators: https://scikit-learn.org/stable/modules/neighbors.html)
Aumuller, M., Bernhardsson, E., & Faithfull, A. (2020). ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. *Information Systems*, 87. https://ann-benchmarks.com/
Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. *Proceedings of the 10th International Conference on World Wide Web (WWW)*, 285 to 295.
Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item collaborative filtering. *IEEE Internet Computing*, 7(1), 76 to 80.

Basic Algorithm

Classification

Regression

Classification vs. Regression

History

Fix and Hodges (1951)

Cover and Hart (1967)

Subsequent Developments

Distance Metrics

Euclidean Distance (L2)

Manhattan Distance (L1)

Minkowski Distance (Lp)

Cosine Similarity and Cosine Distance

Mahalanobis Distance

Hamming Distance

Distance Metric Comparison

Choosing k

Bias-Variance Trade-off

Cross-Validation

Odd k for Binary Classification

k as a Function of Sample Size

Weighting Schemes

Uniform Weighting

Inverse-Distance Weighting

Shepard's Method

Kernel Weighting

Computational Aspects

Brute Force

KD-Trees

Ball Trees

Locality-Sensitive Hashing

Approximate Nearest Neighbor Search

Hierarchical Navigable Small World (HNSW)

FAISS

ScaNN

ANNOY

DiskANN

ANN Library Comparison

Modern Relevance: KNN in the Era of Large Language Models

Vector Databases

Retrieval-Augmented Generation

KNN Language Models

Retrieval Heads in Transformers

Recommender Systems and Collaborative Filtering

Image Search and Multimodal Retrieval

Anomaly Detection

Strengths and Weaknesses

Strengths

Weaknesses

Practical Tips

Variants and Extensions

See Also

References

Improve this article

Related Articles

ARC-AGI 2

XGBoost

LightGBM

Feature Selection

SMOTE (Synthetic Minority Over-sampling Technique)

Dynamic Programming

Basic Algorithm

Classification

Regression

Classification vs. Regression

History

Fix and Hodges (1951)

Cover and Hart (1967)

Subsequent Developments

Distance Metrics

Euclidean Distance (L2)

Manhattan Distance (L1)

Minkowski Distance (Lp)

Cosine Similarity and Cosine Distance

Mahalanobis Distance

Hamming Distance

Distance Metric Comparison

Choosing k

Bias-Variance Trade-off

Cross-Validation