UMAP (Uniform Manifold Approximation and Projection) is a nonlinear dimensionality reduction technique developed by Leland McInnes, John Healy, and James Melville. First published in 2018, UMAP constructs a high-dimensional topological representation of data and then optimizes a low-dimensional layout to match that topology as closely as possible. It is widely used for visualizing high-dimensional datasets and as a general-purpose preprocessing step in machine learning pipelines. UMAP is competitive with t-SNE in visualization quality while offering faster runtimes, better preservation of global structure, and no restriction on embedding dimensionality.
Imagine you have a huge pile of LEGO bricks in hundreds of different colors and shapes. You want to arrange them on your desk so that similar bricks end up near each other, but your desk is much smaller than the pile. UMAP is like a smart helper that first figures out which bricks are "neighbors" in the big pile (which ones look alike), then carefully places them on the desk so that bricks that were close together in the pile stay close on the desk, too. It does this by building a kind of invisible web connecting nearby bricks, then slowly sliding them around on the desk until the web on the desk looks as much as possible like the web in the original pile.
Before UMAP, the dominant method for nonlinear dimensionality reduction and visualization was t-SNE, introduced by Laurens van der Maaten and Geoffrey Hinton in 2008. t-SNE produces compelling two-dimensional visualizations of complex datasets, but it has several practical limitations: it is slow on large datasets, it does not scale well to embedding dimensions higher than two or three, and it tends to lose global structure in favor of local cluster fidelity. Earlier linear methods like PCA are fast and preserve global variance, but they cannot capture the nonlinear relationships that characterize many real-world datasets.
McInnes, Healy, and Melville introduced UMAP in a 2018 arXiv preprint (later published in the Journal of Open Source Software) specifically to address these shortcomings. They grounded the algorithm in a formal mathematical framework drawing on Riemannian geometry and algebraic topology, rather than building it through purely empirical experimentation. This theoretical foundation distinguishes UMAP from most other manifold learning algorithms and provides principled justifications for each design decision in the algorithm.[1]
UMAP's design rests on two branches of mathematics: Riemannian geometry and the theory of fuzzy simplicial sets from algebraic topology.
UMAP assumes the input data lies on or near a manifold embedded in high-dimensional space. A manifold is a geometric object that locally resembles Euclidean space but may have a complex global shape (for example, the surface of a sphere is a two-dimensional manifold embedded in three dimensions).
Rather than using the ambient Euclidean distance between data points directly, UMAP constructs a local Riemannian metric around each point. Specifically, UMAP makes the assumption that the data would be uniformly distributed with respect to this local metric. Because real data is never truly uniform in the ambient space, the algorithm adapts the notion of distance at each point so that, from the perspective of that point, the data appears locally uniform. In practice, this means normalizing distances by the distance to each point's nearest neighbors. A key lemma in the UMAP paper shows that if a Riemannian manifold has a locally constant diagonal metric, then geodesic distance can be approximated by Euclidean distance scaled by a local radius factor.[1]
Once local metrics have been defined, UMAP converts the resulting distance information into a topological representation using the language of simplicial sets and fuzzy set theory.
A simplicial set is a combinatorial object used in algebraic topology to describe topological spaces. UMAP extends this concept into the "fuzzy" setting, where membership in an open set is not binary (0 or 1) but a continuous value between 0 and 1. In the UMAP framework, each data point defines a local fuzzy simplicial set in which nearby points have membership strengths close to 1 and distant points have membership strengths close to 0. The membership strength of an edge connecting points x and y is computed as:
membership(x, y) = exp(-(max(0, d(x,y) - rho) / sigma))
where d(x,y) is the distance between the two points, rho is the distance from x to its nearest neighbor (ensuring local connectivity), and sigma is a normalization constant chosen so that the sum of membership strengths over the k nearest neighbors equals log2(k).[1]
The UMAP paper defines two category-theoretic functors that translate between metric spaces and fuzzy simplicial sets: FinSing (finite fuzzy singular set functor), which converts a finite metric space into a fuzzy simplicial set, and FinReal (finite fuzzy realization functor), which goes in the other direction. A theorem in the paper establishes that these functors are adjoint, providing a formal mathematical basis for moving between distance-based and topology-based representations of data.[1]
At a practical level, UMAP operates in two main phases: constructing a weighted graph in the high-dimensional space and optimizing a low-dimensional layout.
Nearest neighbor computation. For each data point, UMAP finds its k approximate nearest neighbors. The reference implementation uses the Nearest Neighbor Descent algorithm by Dong, Moses, and Li (2011), which empirically runs in O(N^1.14) time rather than the O(N^2) brute-force approach.[2]
Local distance normalization. For each point x_i, the algorithm computes rho_i (the distance to its closest neighbor) and sigma_i (a smoothing factor). The value sigma_i is found by binary search so that the sum of exponentially-weighted distances to the k nearest neighbors equals log2(k). This step implements the uniform distribution assumption: from the perspective of each point, the local density of neighbors appears the same.
Directed weighted graph. The algorithm constructs a directed graph G = (V, E, w) where V is the set of data points, E contains edges from each point to its k nearest neighbors, and edge weights are defined by the membership function above. Setting the nearest neighbor distance to zero through rho ensures local connectivity.
Symmetrization via fuzzy union. Because each point defines its own local metric, the directed graph contains inconsistent edge weights (the weight from x to y may differ from the weight from y to x). UMAP resolves this by applying a probabilistic t-conorm to produce a symmetric adjacency matrix: B = A + A^T - A * A^T (where * denotes the element-wise product). The entry B_ij represents the probability that at least one of the two directed edges (i to j, j to i) exists.
Initialization. The low-dimensional coordinates are initialized using spectral embedding: the algorithm computes the normalized graph Laplacian L = D^(1/2)(D - A)D^(-1/2), extracts its smallest eigenvectors, and uses these as the initial positions. The normalized Laplacian serves as a discrete approximation of the Laplace-Beltrami operator on the manifold.[1]
Low-dimensional membership function. In the low-dimensional space, UMAP defines a smooth, differentiable membership function:
Phi(x, y) = (1 + a * ||x - y||^(2b))^(-1)
The parameters a and b are determined by fitting this function (via nonlinear least squares) to a piecewise target function that equals 1 when ||x - y|| is less than min_dist and decays exponentially beyond that threshold. This family of curves is related to the Student-t distribution used by t-SNE, but UMAP's formulation is more flexible.
Cross-entropy minimization. UMAP optimizes the fuzzy set cross-entropy between the high-dimensional membership strengths (mu) and the low-dimensional membership strengths (nu):
C = sum[ mu(a) * log(mu(a)/nu(a)) + (1 - mu(a)) * log((1 - mu(a))/(1 - nu(a))) ]
Since the first term is constant during optimization, the effective objective reduces to:
-sum[ mu(a) * log(nu(a)) + (1 - mu(a)) * log(1 - nu(a)) ]
This is optimized using stochastic gradient descent with negative sampling. Attractive forces pull connected points together (sampled proportionally to edge weight mu(a)), while repulsive forces push randomly sampled non-neighbors apart. The learning rate decreases linearly from 1.0 to 0 over the course of training. This sampling-based approach gives the optimization O(kN) complexity per epoch, where k is the number of neighbors.
UMAP exposes several hyperparameters that control the balance between local and global structure, the tightness of clusters, and the target dimensionality.
| Parameter | Default | Effect |
|---|---|---|
n_neighbors | 15 | Number of nearest neighbors used to construct the high-dimensional graph. Lower values emphasize local structure; higher values capture more global structure. |
min_dist | 0.1 | Minimum distance between points in the low-dimensional embedding. Lower values (e.g., 0.0) produce tighter clusters; higher values (e.g., 0.5) spread points more evenly. |
n_components | 2 | Dimensionality of the target embedding. Unlike t-SNE, UMAP scales well to higher embedding dimensions (10, 50, or more). |
metric | euclidean | Distance function used in the input space. Supports Euclidean, cosine, Manhattan, Chebyshev, Mahalanobis, Hamming, and many others. Cosine distance is common for text and embedding data. |
n_epochs | None (auto) | Number of training epochs for the SGD optimization. Automatically set based on dataset size if not specified. |
learning_rate | 1.0 | Initial learning rate for the SGD optimizer. |
spread | 1.0 | Scale of the embedding. Together with min_dist, determines the parameters a and b of the low-dimensional membership function. |
random_state | None | Seed for reproducibility. Because UMAP uses stochastic optimization, different seeds produce slightly different embeddings. |
| Scenario | Recommended n_neighbors | Recommended min_dist |
|---|---|---|
| Small dataset (fewer than 1,000 samples) | 5 to 10 | 0.0 to 0.1 |
| Medium dataset (1,000 to 10,000 samples) | 15 to 30 | 0.1 to 0.25 |
| Large dataset (more than 10,000 samples) | 30 to 50 | 0.1 to 0.5 |
| Fine-grained local structure needed | 5 to 15 | 0.0 to 0.05 |
| Global structure and topology needed | 50 to 200 | 0.25 to 0.5 |
The two most impactful parameters are n_neighbors and min_dist. They interact with each other: lower n_neighbors combined with lower min_dist produces highly detailed, tightly clustered embeddings, while higher n_neighbors with higher min_dist produces broader, smoother views of the data topology.
| Aspect | UMAP | t-SNE |
|---|---|---|
| Theoretical basis | Riemannian geometry and fuzzy topology | Probability distributions (Gaussian in high-D, Student-t in low-D) |
| Global structure | Better preservation of inter-cluster relationships | Primarily preserves local neighborhoods; inter-cluster distances may be meaningless |
| Speed | Faster; MNIST (70,000 points) embeds in under 1 minute | Slower; scikit-learn t-SNE takes approximately 45 minutes on MNIST |
| Scalability | Handles millions of points (especially with GPU acceleration) | Struggles beyond hundreds of thousands of points without approximations like FIt-SNE |
| Embedding dimension | No restriction; works well in 2, 3, 10, 50, or more dimensions | Typically limited to 2 or 3 dimensions |
| Stability | More stable across runs (lower Procrustes distance under subsampling) | Less stable; different runs can produce visually different layouts |
| Transform support | Can project new, unseen data via .transform() | No built-in out-of-sample extension |
| Use as preprocessing | Suitable for downstream clustering or classification tasks at higher dimensions | Primarily a visualization tool |
In the original UMAP paper, the authors benchmarked both methods on datasets including COIL-20, MNIST, Fashion-MNIST, and GoogleNews word vectors. UMAP matched or exceeded t-SNE on k-nearest-neighbor classifier accuracy in the embedded space, particularly for larger values of k (80 to 160), where UMAP showed significantly higher accuracy. Runtime comparisons showed UMAP was consistently faster: for example, 87 seconds on MNIST versus 1,450 seconds for t-SNE, and 361 seconds on GoogleNews versus 16,906 seconds for t-SNE.[1]
| Aspect | UMAP | PCA |
|---|---|---|
| Linearity | Nonlinear | Linear |
| Structure captured | Local and global nonlinear manifold structure | Global linear variance |
| Interpretability | Axes have no direct interpretation | Each component is a linear combination of original features with explained variance |
| Speed | Fast, but slower than PCA | Very fast (closed-form eigendecomposition) |
| Determinism | Stochastic (results vary across runs) | Deterministic |
| Best for | Complex, nonlinear relationships | Linear relationships, initial exploration, noise reduction |
In practice, PCA and UMAP are often used together rather than as substitutes. A common workflow applies PCA first to reduce dimensionality to 50 or 100 components (removing noise and speeding up computation), then applies UMAP for further reduction to 2 or 3 dimensions for visualization.
While standard UMAP is an unsupervised learning method, the algorithm supports supervised and semi-supervised modes that incorporate label information.
In supervised mode, UMAP accepts target labels (passed as the y parameter in the sklearn-compatible API). The algorithm constructs a second graph based on the label information and combines it with the data-derived graph. The result is an embedding where the structural properties of the data are preserved while known classes are cleanly separated. This makes supervised UMAP useful for metric learning, where the goal is to learn a distance function that respects class boundaries.[3]
When only a subset of data points have labels, UMAP supports semi-supervised learning. Unlabeled points are assigned a label of -1 (following the scikit-learn convention). The algorithm uses the available labels to guide the embedding while letting the unlabeled points find their positions based on data structure alone. This is valuable when labeling is expensive or when labeled data is limited but unlabeled data is abundant.[3]
Parametric UMAP, introduced by Sainburg, McInnes, and Gentner in 2021, replaces the direct coordinate optimization in the second phase of UMAP with a neural network. Instead of learning embedding coordinates directly, a deep network learns a parametric mapping from the input space to the embedding space. This variant offers several advantages:[4]
.transform() method.The Parametric UMAP implementation uses Keras and TensorFlow as a backend.
UMAP has become the standard visualization method for single-cell RNA sequencing (scRNA-seq) data. In a 2019 Nature Biotechnology paper, Becht, McInnes, Healy, and colleagues systematically compared UMAP with five other dimensionality reduction tools on mass cytometry and scRNA-seq datasets. They found that UMAP provided the fastest runtimes, the highest reproducibility, and the most meaningful organization of cell clusters.[5] Since that publication, UMAP has been widely adopted in the genomics community for tasks including cell type identification, developmental trajectory analysis, and disease state visualization. Tools like Scanpy and Seurat include UMAP as a default visualization method.
In NLP, UMAP is used to visualize high-dimensional word and document embeddings. Researchers use it to inspect the structure of word embeddings (such as Word2Vec, GloVe, or contextual embeddings from BERT and GPT models), identify semantic clusters, detect anomalies in embedding spaces, and debug language models. Because UMAP supports cosine distance as a metric, it works naturally with normalized embedding vectors.
In computer vision, UMAP helps researchers visualize learned feature representations from convolutional neural networks or vision transformers. It is commonly used to inspect the latent spaces of image classifiers, generative adversarial networks, and autoencoders.
By projecting data into a low-dimensional space while preserving topological structure, UMAP can make anomalies visually apparent as isolated points or small clusters far from the main data mass. It is frequently used as a preprocessing step before applying clustering-based anomaly detection methods.
Unlike t-SNE, UMAP's ability to embed into arbitrary dimensions (not just 2 or 3) makes it useful as a general-purpose feature engineering tool. Practitioners sometimes use UMAP to reduce feature dimensionality before feeding data into random forests, support vector machines, or gradient boosting models. In this role, UMAP can capture nonlinear feature interactions that linear methods like PCA would miss.
DuBois and colleagues (2020) demonstrated in Nature Communications that UMAP can visualize physical and genetic interaction networks, revealing biologically meaningful structure that other methods obscure.[6]
UMAP's computational complexity is dominated by two operations: nearest neighbor search and SGD optimization.
| Operation | Complexity | Notes |
|---|---|---|
| Nearest neighbor search (Nearest Neighbor Descent) | Empirically O(N^1.14) | Much faster than brute-force O(N^2); approximate but accurate |
| Graph construction (symmetrization) | O(kN) | k is the number of neighbors |
| SGD optimization | O(kN) per epoch | Linear in dataset size per epoch |
| Overall | Approximately O(N^1.14) | Dominated by the nearest neighbor step |
In practice, UMAP embeds the 70,000-point MNIST dataset in under one minute on a single CPU core. For larger datasets, GPU-accelerated implementations such as RAPIDS cuML provide dramatic speedups. NVIDIA reported a 311x speedup on a 20-million-point dataset with 384 dimensions, reducing total runtime from 10 hours to 2 minutes. The cuML implementation can handle datasets of 50 million points with 768 dimensions (approximately 150 GB of data) on a single GPU.[7]
UMAP is available in multiple programming languages and frameworks.
| Implementation | Language | Notes |
|---|---|---|
| umap-learn | Python | Reference implementation by McInnes; scikit-learn compatible API |
| RAPIDS cuML | Python (GPU) | GPU-accelerated implementation from NVIDIA; supports large-scale datasets |
| uwot | R | R implementation with similar API |
| umap (CRAN) | R | Alternative R package with a simpler interface |
| UMAP.jl | Julia | Julia implementation |
| Parametric UMAP | Python (TensorFlow/Keras) | Neural network variant; part of the umap-learn package |
The Python reference implementation (umap-learn) is installed via pip (pip install umap-learn) and follows the scikit-learn transformer API with fit, transform, and fit_transform methods.
import umap
from sklearn.datasets import load_digits
digits = load_digits()
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
metric='euclidean'
)
embedding = reducer.fit_transform(digits.data)
Despite its popularity, UMAP has several limitations that users should be aware of.
UMAP preserves local neighborhoods well, but global distances in the embedding can be distorted. Two clusters that appear far apart in a UMAP plot may actually be close in the original space, or vice versa. Researchers should not interpret the spacing between clusters as a measure of biological or semantic similarity.[8]
Because UMAP relies on stochastic optimization and random initialization (when spectral initialization fails or is not used), different runs with different random seeds produce different embeddings. While the general structure tends to be consistent, the specific layout (rotations, reflections, relative positions of clusters) can change.
The appearance of a UMAP embedding can change substantially with different settings of n_neighbors and min_dist. Users may inadvertently choose parameters that emphasize or suppress certain structures in the data. There is no universally correct parameter setting; the appropriate values depend on the dataset and the analysis goals.
A common criticism, raised by Chari and Pachter (2023) and others, is that researchers sometimes interpret visual features of UMAP plots (cluster shapes, trajectories, densities) as if they directly reflected biological or physical reality. The nonlinear transformation applied by UMAP can create visual artifacts, and the curvature or shape of clusters in the embedding may be algorithmic artifacts rather than reflections of true data structure.[9]
While UMAP has a mathematical framework motivating its design, the algorithm does not come with formal performance guarantees. There is no proven bound on how well the embedding preserves distances, topology, or any other specific geometric property of the data.[10]
For very small datasets (fewer than a few hundred points), the overhead of graph construction and optimization may not be justified, and simpler methods like PCA or even direct visualization may be more appropriate.
UMAP is related to several other dimensionality reduction and manifold learning methods.
| Algorithm | Relationship to UMAP |
|---|---|
| t-SNE | UMAP uses a similar attract/repel optimization but with a different cost function (cross-entropy vs. KL divergence) and graph construction. UMAP's low-dimensional kernel is more flexible than t-SNE's fixed Student-t distribution. |
| Laplacian Eigenmaps | UMAP's spectral initialization step is closely related to Laplacian Eigenmaps. The authors note that UMAP can be viewed as a generalization of Laplacian Eigenmaps. |
| Isomap | Both assume data lies on a manifold, but Isomap preserves geodesic distances globally while UMAP uses local fuzzy topological representations. |
| LargeVis | UMAP's SGD optimization with negative sampling is similar to LargeVis (Tang et al., 2016), but UMAP differs in graph construction and has a stronger theoretical foundation. |
| PCA | Linear method that UMAP extends into the nonlinear domain. PCA is often used as a preprocessing step before UMAP. |
| TriMAP | A later method (Amid and Warmuth, 2019) that uses triplet-based objectives; competes with UMAP for global structure preservation. |
| PHATE | Designed specifically for biological trajectory data; uses diffusion-based distances rather than UMAP's fuzzy topological approach. |