UMAP (Uniform Manifold Approximation and Projection)

Data Science Machine Learning

21 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v4 · 4,145 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

UMAP (Uniform Manifold Approximation and Projection) is a nonlinear dimensionality reduction technique that compresses high-dimensional data into a low-dimensional map (typically 2 or 3 dimensions) while preserving as much of the data's local and global structure as possible. It was developed by Leland McInnes, John Healy, and James Melville and first published as an arXiv preprint on 9 February 2018.^[1] UMAP works by constructing a high-dimensional topological representation of the data and then optimizing a low-dimensional layout to match that topology as closely as possible. The authors state that "the UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance," and that it "has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning."^[1] These three properties (visualization quality comparable to t-SNE, better preservation of global structure, and faster runtime) make UMAP one of the most widely used methods for visualizing embeddings and for general-purpose preprocessing in machine learning pipelines.

ELI5 (Explain like I'm 5)

Imagine you have a huge pile of LEGO bricks in hundreds of different colors and shapes. You want to arrange them on your desk so that similar bricks end up near each other, but your desk is much smaller than the pile. UMAP is like a smart helper that first figures out which bricks are "neighbors" in the big pile (which ones look alike), then carefully places them on the desk so that bricks that were close together in the pile stay close on the desk, too. It does this by building a kind of invisible web connecting nearby bricks, then slowly sliding them around on the desk until the web on the desk looks as much as possible like the web in the original pile.

When was UMAP created, and why?

Before UMAP, the dominant method for nonlinear dimensionality reduction and visualization was t-SNE, introduced by Laurens van der Maaten and Geoffrey Hinton in 2008. t-SNE produces compelling two-dimensional visualizations of complex datasets, but it has several practical limitations: it is slow on large datasets, it does not scale well to embedding dimensions higher than two or three, and it tends to lose global structure in favor of local cluster fidelity. Earlier linear methods like PCA are fast and preserve global variance, but they cannot capture the nonlinear relationships that characterize many real-world datasets.

McInnes, Healy, and Melville introduced UMAP in the 2018 arXiv preprint (arXiv:1802.03426) specifically to address these shortcomings; a companion software paper appeared in the Journal of Open Source Software in September 2018.^[1]^[11] They grounded the algorithm in a formal mathematical framework drawing on Riemannian geometry and algebraic topology, rather than building it through purely empirical experimentation. As the abstract puts it, "UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data."^[1] This theoretical foundation distinguishes UMAP from most other manifold learning algorithms and provides principled justifications for each design decision in the algorithm.

What is the theoretical foundation of UMAP?

UMAP's design rests on two branches of mathematics: Riemannian geometry and the theory of fuzzy simplicial sets from algebraic topology.

Manifold assumption and Riemannian geometry

UMAP assumes the input data lies on or near a manifold embedded in high-dimensional space. A manifold is a geometric object that locally resembles Euclidean space but may have a complex global shape (for example, the surface of a sphere is a two-dimensional manifold embedded in three dimensions).

Rather than using the ambient Euclidean distance between data points directly, UMAP constructs a local Riemannian metric around each point. Specifically, UMAP makes the assumption that the data would be uniformly distributed with respect to this local metric. Because real data is never truly uniform in the ambient space, the algorithm adapts the notion of distance at each point so that, from the perspective of that point, the data appears locally uniform. In practice, this means normalizing distances by the distance to each point's nearest neighbors. A key lemma in the UMAP paper shows that if a Riemannian manifold has a locally constant diagonal metric, then geodesic distance can be approximated by Euclidean distance scaled by a local radius factor.^[1]

Fuzzy simplicial sets

Once local metrics have been defined, UMAP converts the resulting distance information into a topological representation using the language of simplicial sets and fuzzy set theory.

A simplicial set is a combinatorial object used in algebraic topology to describe topological spaces. UMAP extends this concept into the "fuzzy" setting, where membership in an open set is not binary (0 or 1) but a continuous value between 0 and 1. In the UMAP framework, each data point defines a local fuzzy simplicial set in which nearby points have membership strengths close to 1 and distant points have membership strengths close to 0. The membership strength of an edge connecting points x and y is computed as:

\mathrm{membership}(x, y) = \exp\left(-\frac{\max(0, d(x,y) - \rho)}{\sigma}\right)

where $d(x,y)$ is the distance between the two points, $\rho$ is the distance from x to its nearest neighbor (ensuring local connectivity), and $\sigma$ is a normalization constant chosen so that the sum of membership strengths over the k nearest neighbors equals $\log_2(k)$ .^[1]

The UMAP paper defines two category-theoretic functors that translate between metric spaces and fuzzy simplicial sets: FinSing (finite fuzzy singular set functor), which converts a finite metric space into a fuzzy simplicial set, and FinReal (finite fuzzy realization functor), which goes in the other direction. A theorem in the paper establishes that these functors are adjoint, providing a formal mathematical basis for moving between distance-based and topology-based representations of data.^[1]

How does the UMAP algorithm work?

At a practical level, UMAP operates in two main phases: constructing a weighted graph in the high-dimensional space and optimizing a low-dimensional layout.

Phase 1: Graph construction

Nearest neighbor computation. For each data point, UMAP finds its k approximate nearest neighbors. The reference implementation uses the Nearest Neighbor Descent algorithm by Dong, Moses, and Li (2011), which empirically runs in $O(N^{1.14})$ time rather than the $O(N^2)$ brute-force approach.^[2]
Local distance normalization. For each point $x_i$ , the algorithm computes $\rho_i$ (the distance to its closest neighbor) and $\sigma_i$ (a smoothing factor). The value $\sigma_i$ is found by binary search so that the sum of exponentially-weighted distances to the k nearest neighbors equals $\log_2(k)$ . This step implements the uniform distribution assumption: from the perspective of each point, the local density of neighbors appears the same.
Directed weighted graph. The algorithm constructs a directed graph $G = (V, E, w)$ where V is the set of data points, E contains edges from each point to its k nearest neighbors, and edge weights are defined by the membership function above. Setting the nearest neighbor distance to zero through $\rho$ ensures local connectivity.
Symmetrization via fuzzy union. Because each point defines its own local metric, the directed graph contains inconsistent edge weights (the weight from x to y may differ from the weight from y to x). UMAP resolves this by applying a probabilistic t-conorm to produce a symmetric adjacency matrix: $B = A + A^\top - A \odot A^\top$ (where $\odot$ denotes the element-wise product). The entry $B_{ij}$ represents the probability that at least one of the two directed edges (i to j, j to i) exists.

Phase 2: Layout optimization

Initialization. The low-dimensional coordinates are initialized using spectral embedding: the algorithm computes the normalized graph Laplacian $L = D^{1/2}(D - A)D^{-1/2}$ , extracts its smallest eigenvectors, and uses these as the initial positions. The normalized Laplacian serves as a discrete approximation of the Laplace-Beltrami operator on the manifold.^[1]
Low-dimensional membership function. In the low-dimensional space, UMAP defines a smooth, differentiable membership function:

\Phi(x, y) = (1 + a \lVert x - y \rVert^{2b})^{-1}

The parameters a and b are determined by fitting this function (via nonlinear least squares) to a piecewise target function that equals 1 when $\lVert x - y \rVert$ is less than min_dist and decays exponentially beyond that threshold. This family of curves is related to the Student-t distribution used by t-SNE, but UMAP's formulation is more flexible.

Cross-entropy minimization. UMAP optimizes the fuzzy set cross-entropy between the high-dimensional membership strengths (mu) and the low-dimensional membership strengths (nu):

C = \sum \left[ \mu(a) \log\frac{\mu(a)}{\nu(a)} + (1 - \mu(a)) \log\frac{1 - \mu(a)}{1 - \nu(a)} \right]

Since the first term is constant during optimization, the effective objective reduces to:

-\sum \left[ \mu(a) \log(\nu(a)) + (1 - \mu(a)) \log(1 - \nu(a)) \right]

This is optimized using stochastic gradient descent with negative sampling. Attractive forces pull connected points together (sampled proportionally to edge weight $\mu(a)$ ), while repulsive forces push randomly sampled non-neighbors apart. The learning rate decreases linearly from 1.0 to 0 over the course of training. This sampling-based approach gives the optimization $O(kN)$ complexity per epoch, where k is the number of neighbors.

What are the key parameters of UMAP?

UMAP exposes several hyperparameters that control the balance between local and global structure, the tightness of clusters, and the target dimensionality. The two most influential, n_neighbors and min_dist, default to 15 and 0.1 respectively in the reference umap-learn implementation; the documentation recommends n_neighbors in the range 5 to 50 and min_dist in the range 0.001 to 0.5.^[13]

Parameter	Default	Effect
`n_neighbors`	15	Number of nearest neighbors used to construct the high-dimensional graph. Lower values emphasize local structure; higher values capture more global structure.
`min_dist`	0.1	Minimum distance between points in the low-dimensional embedding. Lower values (e.g., 0.0) produce tighter clusters; higher values (e.g., 0.5) spread points more evenly.
`n_components`	2	Dimensionality of the target embedding. Unlike t-SNE, UMAP scales well to higher embedding dimensions (10, 50, or more).
`metric`	euclidean	Distance function used in the input space. Supports Euclidean, cosine, Manhattan, Chebyshev, Mahalanobis, Hamming, and many others. Cosine distance is common for text and embedding data.
`n_epochs`	None (auto)	Number of training epochs for the SGD optimization. Automatically set based on dataset size if not specified.
`learning_rate`	1.0	Initial learning rate for the SGD optimizer.
`spread`	1.0	Scale of the embedding. Together with `min_dist`, determines the parameters a and b of the low-dimensional membership function.
`random_state`	None	Seed for reproducibility. Because UMAP uses stochastic optimization, different seeds produce slightly different embeddings.

How should you tune n_neighbors and min_dist?

Scenario	Recommended n_neighbors	Recommended min_dist
Small dataset (fewer than 1,000 samples)	5 to 10	0.0 to 0.1
Medium dataset (1,000 to 10,000 samples)	15 to 30	0.1 to 0.25
Large dataset (more than 10,000 samples)	30 to 50	0.1 to 0.5
Fine-grained local structure needed	5 to 15	0.0 to 0.05
Global structure and topology needed	50 to 200	0.25 to 0.5

The two most impactful parameters are n_neighbors and min_dist. They interact with each other: lower n_neighbors combined with lower min_dist produces highly detailed, tightly clustered embeddings, while higher n_neighbors with higher min_dist produces broader, smoother views of the data topology.

How does UMAP compare with other methods?

How does UMAP differ from t-SNE?

Aspect	UMAP	t-SNE
Theoretical basis	Riemannian geometry and fuzzy topology	Probability distributions (Gaussian in high-D, Student-t in low-D)
Global structure	Better preservation of inter-cluster relationships	Primarily preserves local neighborhoods; inter-cluster distances may be meaningless
Speed	Faster; MNIST (70,000 points) embeds in under 1 minute	Slower; scikit-learn t-SNE takes approximately 45 minutes on MNIST
Scalability	Handles millions of points (especially with GPU acceleration)	Struggles beyond hundreds of thousands of points without approximations like FIt-SNE
Embedding dimension	No restriction; works well in 2, 3, 10, 50, or more dimensions	Typically limited to 2 or 3 dimensions
Stability	More stable across runs (lower Procrustes distance under subsampling)	Less stable; different runs can produce visually different layouts
Transform support	Can project new, unseen data via `.transform()`	No built-in out-of-sample extension
Use as preprocessing	Suitable for downstream clustering or classification tasks at higher dimensions	Primarily a visualization tool

In the original UMAP paper, the authors benchmarked both methods on datasets including COIL-20, MNIST, Fashion-MNIST, and GoogleNews word vectors. UMAP matched or exceeded t-SNE on k-nearest-neighbor classifier accuracy in the embedded space, particularly for larger values of k (80 to 160), where UMAP showed significantly higher accuracy. Runtime comparisons showed UMAP was consistently faster: for example, 87 seconds on MNIST versus 1,450 seconds for t-SNE, and 361 seconds on GoogleNews versus 16,906 seconds for t-SNE.^[1]

How does UMAP differ from PCA?

Aspect	UMAP	PCA
Linearity	Nonlinear	Linear
Structure captured	Local and global nonlinear manifold structure	Global linear variance
Interpretability	Axes have no direct interpretation	Each component is a linear combination of original features with explained variance
Speed	Fast, but slower than PCA	Very fast (closed-form eigendecomposition)
Determinism	Stochastic (results vary across runs)	Deterministic
Best for	Complex, nonlinear relationships	Linear relationships, initial exploration, noise reduction

In practice, PCA and UMAP are often used together rather than as substitutes. A common workflow applies PCA first to reduce dimensionality to 50 or 100 components (removing noise and speeding up computation), then applies UMAP for further reduction to 2 or 3 dimensions for visualization.

Supervised and semi-supervised UMAP

While standard UMAP is an unsupervised learning method, the algorithm supports supervised and semi-supervised modes that incorporate label information.

Supervised UMAP

In supervised mode, UMAP accepts target labels (passed as the y parameter in the sklearn-compatible API). The algorithm constructs a second graph based on the label information and combines it with the data-derived graph. The result is an embedding where the structural properties of the data are preserved while known classes are cleanly separated. This makes supervised UMAP useful for metric learning, where the goal is to learn a distance function that respects class boundaries.^[3]

Semi-supervised UMAP

When only a subset of data points have labels, UMAP supports semi-supervised learning. Unlabeled points are assigned a label of -1 (following the scikit-learn convention). The algorithm uses the available labels to guide the embedding while letting the unlabeled points find their positions based on data structure alone. This is valuable when labeling is expensive or when labeled data is limited but unlabeled data is abundant.^[3]

Parametric UMAP

Parametric UMAP, introduced by Sainburg, McInnes, and Gentner in 2021, replaces the direct coordinate optimization in the second phase of UMAP with a neural network. Instead of learning embedding coordinates directly, a deep network learns a parametric mapping from the input space to the embedding space. This variant offers several advantages:^[4]

Fast inference on new data. Once trained, the network can embed new, unseen points in a single forward pass, much faster than the graph-based .transform() method.
Inverse transforms. A decoder network can be trained alongside the encoder, enabling approximate reconstruction of high-dimensional data from embedding coordinates.
Integration with deep learning. The UMAP loss function can be combined with other losses (such as classification cross-entropy), allowing it to regularize autoencoders or improve semi-supervised classifiers.

The Parametric UMAP implementation uses Keras and TensorFlow as a backend.

What is UMAP used for?

Single-cell genomics

UMAP has become the standard visualization method for single-cell RNA sequencing (scRNA-seq) data. In a 2019 Nature Biotechnology paper, Becht, McInnes, Healy, and colleagues systematically compared UMAP with five other dimensionality reduction tools on mass cytometry and scRNA-seq datasets. They found that UMAP provided the fastest runtimes, the highest reproducibility, and the most meaningful organization of cell clusters.^[5] Since that publication, UMAP has been widely adopted in the genomics community for tasks including cell type identification, developmental trajectory analysis, and disease state visualization. Tools like Scanpy and Seurat include UMAP as a default visualization method.

Natural language processing

In NLP, UMAP is used to visualize high-dimensional word and document embeddings. Researchers use it to inspect the structure of word embeddings (such as Word2Vec, GloVe, or contextual embeddings from BERT and GPT models), identify semantic clusters, detect anomalies in embedding spaces, and debug language models. Because UMAP supports cosine distance as a metric, it works naturally with normalized embedding vectors.

Computer vision

In computer vision, UMAP helps researchers visualize learned feature representations from convolutional neural networks or vision transformers. It is commonly used to inspect the latent spaces of image classifiers, generative adversarial networks, and autoencoders.

Anomaly detection

By projecting data into a low-dimensional space while preserving topological structure, UMAP can make anomalies visually apparent as isolated points or small clusters far from the main data mass. It is frequently used as a preprocessing step before applying clustering-based anomaly detection methods.

General machine learning preprocessing

Unlike t-SNE, UMAP's ability to embed into arbitrary dimensions (not just 2 or 3) makes it useful as a general-purpose feature engineering tool. Practitioners sometimes use UMAP to reduce feature dimensionality before feeding data into random forests, support vector machines, or gradient boosting models. In this role, UMAP can capture nonlinear feature interactions that linear methods like PCA would miss.

Physical and genetic interaction networks

Dorrity and colleagues (2020) demonstrated in Nature Communications that UMAP can visualize physical and genetic interaction networks, revealing biologically meaningful structure that other methods obscure.^[6]

How scalable is UMAP?

UMAP's computational complexity is dominated by two operations: nearest neighbor search and SGD optimization.

Operation	Complexity	Notes
Nearest neighbor search (Nearest Neighbor Descent)	Empirically $O(N^{1.14})$	Much faster than brute-force $O(N^2)$ ; approximate but accurate
Graph construction (symmetrization)	$O(kN)$	k is the number of neighbors
SGD optimization	$O(kN)$ per epoch	Linear in dataset size per epoch
Overall	Approximately $O(N^{1.14})$	Dominated by the nearest neighbor step

In practice, UMAP embeds the 70,000-point MNIST dataset in under one minute on a single CPU core. For larger datasets, GPU-accelerated implementations such as RAPIDS cuML provide dramatic speedups. NVIDIA reported a 311x speedup on a 20-million-point dataset with 384 dimensions, reducing total runtime from 10 hours to 2 minutes. The cuML implementation has been demonstrated on a dataset of 59 million points with 768 dimensions (approximately 153 GB of data) on a single NVIDIA H100 GPU with 80 GB of memory, using a batching-with-clustering approach because the full dataset does not fit in GPU memory.^[7]

What implementations of UMAP are available?

UMAP is available in multiple programming languages and frameworks.

Implementation	Language	Notes
umap-learn	Python	Reference implementation by McInnes; scikit-learn compatible API
RAPIDS cuML	Python (GPU)	GPU-accelerated implementation from NVIDIA; supports large-scale datasets
uwot	R	R implementation with similar API
umap (CRAN)	R	Alternative R package with a simpler interface
UMAP.jl	Julia	Julia implementation
Parametric UMAP	Python (TensorFlow/Keras)	Neural network variant; part of the umap-learn package

The Python reference implementation (umap-learn) is installed via pip (pip install umap-learn) and follows the scikit-learn transformer API with fit, transform, and fit_transform methods.

Basic usage example

import umap
from sklearn.datasets import load_digits

digits = load_digits()

reducer = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='euclidean'
)

embedding = reducer.fit_transform(digits.data)

What are the limitations of UMAP?

Despite its popularity, UMAP has several limitations that users should be aware of.

Distances between clusters are not reliable

UMAP preserves local neighborhoods well, but global distances in the embedding can be distorted. Two clusters that appear far apart in a UMAP plot may actually be close in the original space, or vice versa. Researchers should not interpret the spacing between clusters as a measure of biological or semantic similarity.^[8]

Stochastic output

Because UMAP relies on stochastic optimization and random initialization (when spectral initialization fails or is not used), different runs with different random seeds produce different embeddings. While the general structure tends to be consistent, the specific layout (rotations, reflections, relative positions of clusters) can change.

Hyperparameter sensitivity

The appearance of a UMAP embedding can change substantially with different settings of n_neighbors and min_dist. Users may inadvertently choose parameters that emphasize or suppress certain structures in the data. There is no universally correct parameter setting; the appropriate values depend on the dataset and the analysis goals.

Risk of over-interpretation

A common criticism, raised by Chari and Pachter (2023) and others, is that researchers sometimes interpret visual features of UMAP plots (cluster shapes, trajectories, densities) as if they directly reflected biological or physical reality. The nonlinear transformation applied by UMAP can create visual artifacts, and the curvature or shape of clusters in the embedding may be algorithmic artifacts rather than reflections of true data structure.^[9]

No formal guarantees

While UMAP has a mathematical framework motivating its design, the algorithm does not come with formal performance guarantees. There is no proven bound on how well the embedding preserves distances, topology, or any other specific geometric property of the data.^[10]

Computational cost for very small datasets

For very small datasets (fewer than a few hundred points), the overhead of graph construction and optimization may not be justified, and simpler methods like PCA or even direct visualization may be more appropriate.

Relationship to other algorithms

UMAP is related to several other dimensionality reduction and manifold learning methods.

Algorithm	Relationship to UMAP
t-SNE	UMAP uses a similar attract/repel optimization but with a different cost function (cross-entropy vs. KL divergence) and graph construction. UMAP's low-dimensional kernel is more flexible than t-SNE's fixed Student-t distribution.
Laplacian Eigenmaps	UMAP's spectral initialization step is closely related to Laplacian Eigenmaps. The authors note that UMAP can be viewed as a generalization of Laplacian Eigenmaps.
Isomap	Both assume data lies on a manifold, but Isomap preserves geodesic distances globally while UMAP uses local fuzzy topological representations.
LargeVis	UMAP's SGD optimization with negative sampling is similar to LargeVis (Tang et al., 2016), but UMAP differs in graph construction and has a stronger theoretical foundation.^[12]
PCA	Linear method that UMAP extends into the nonlinear domain. PCA is often used as a preprocessing step before UMAP.
TriMAP	A later method (Amid and Warmuth, 2019) that uses triplet-based objectives; competes with UMAP for global structure preservation.
PHATE	Designed specifically for biological trajectory data; uses diffusion-based distances rather than UMAP's fuzzy topological approach.

References

McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426. https://arxiv.org/abs/1802.03426 ↩
Dong, W., Moses, C., and Li, K. (2011). "Efficient k-nearest neighbor graph construction for generic similarity measures." Proceedings of the 20th International Conference on World Wide Web, pp. 577-586. ↩
McInnes, L. and Healy, J. "UMAP for Supervised Dimension Reduction and Metric Learning." umap-learn documentation. https://umap-learn.readthedocs.io/en/latest/supervised.html ↩
Sainburg, T., McInnes, L., and Gentner, T.Q. (2021). "Parametric UMAP Embeddings for Representation and Semisupervised Learning." Neural Computation, 33(11), pp. 2881-2907. https://direct.mit.edu/neco/article/33/11/2881/107068 ↩
Becht, E., McInnes, L., Healy, J., Dutertre, C.A., Kwok, I.W.H., Ng, L.G., Ginhoux, F., and Newell, E.W. (2019). "Dimensionality reduction for visualizing single-cell data using UMAP." Nature Biotechnology, 37, pp. 38-44. https://www.nature.com/articles/nbt.4314 ↩
Dorrity, M.W., Saunders, L.M., Queitsch, C., Fields, S., and Trapnell, C. (2020). "Dimensionality reduction by UMAP to visualize physical and genetic interactions." Nature Communications, 11, 1537. https://www.nature.com/articles/s41467-020-15351-4 ↩
NVIDIA. "Even Faster and More Scalable UMAP on the GPU with RAPIDS cuML." NVIDIA Technical Blog. https://developer.nvidia.com/blog/even-faster-and-more-scalable-umap-on-the-gpu-with-rapids-cuml/ ↩
Google PAIR. "Understanding UMAP." https://pair-code.github.io/understanding-umap/ ↩
Chari, T. and Pachter, L. (2023). "The specious art of single-cell genomics." PLOS Computational Biology, 19(8), e1011288. ↩
Topos Institute. (2024). "Understanding UMAP." https://topos.institute/blog/2024-04-05-understanding-umap/ ↩
McInnes, L., Healy, J., Saul, N., and Grossberger, L. (2018). "UMAP: Uniform Manifold Approximation and Projection." Journal of Open Source Software, 3(29), 861. https://joss.theoj.org/papers/10.21105/joss.00861 ↩
Tang, J., Liu, J., Zhang, M., and Mei, Q. (2016). "Visualizing Large-scale and High-dimensional Data." Proceedings of the 25th International Conference on World Wide Web, pp. 287-297. ↩
McInnes, L. "Basic UMAP Parameters." umap-learn 0.5.8 documentation. https://umap-learn.readthedocs.io/en/latest/parameters.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Abbreviations Centroid-based clustering Clustering Curse of Dimensionality DBSCAN Data preprocessing Dimensionality reduction Embedding Space Feature Extraction Models Iris dataset Manifold Learning Principal Component Analysis (PCA)RAPTOR Unlabeled example Vector embeddings k-Nearest Neighbors t-SNE

ELI5 (Explain like I'm 5)

When was UMAP created, and why?

What is the theoretical foundation of UMAP?

Manifold assumption and Riemannian geometry

Fuzzy simplicial sets

How does the UMAP algorithm work?

Phase 1: Graph construction

Phase 2: Layout optimization

What are the key parameters of UMAP?

How should you tune n_neighbors and min_dist?

How does UMAP compare with other methods?

How does UMAP differ from t-SNE?

How does UMAP differ from PCA?

Supervised and semi-supervised UMAP

Supervised UMAP

Semi-supervised UMAP

Parametric UMAP

What is UMAP used for?

Single-cell genomics

Natural language processing

Computer vision

Anomaly detection

General machine learning preprocessing

Physical and genetic interaction networks

How scalable is UMAP?

What implementations of UMAP are available?

Basic usage example

What are the limitations of UMAP?

Distances between clusters are not reliable

Stochastic output

Hyperparameter sensitivity

Risk of over-interpretation

No formal guarantees

Computational cost for very small datasets

Relationship to other algorithms

See also

References

Improve this article

Related Articles

A/B Testing

Anomaly Detection

Confirmation Bias

DataFrame

Data Analysis

Feature Set

What links here

Related Articles

A/B Testing

Anomaly Detection

Confirmation Bias

DataFrame

Data Analysis

Feature Set

What links here