Manifold learning is a class of nonlinear dimensionality reduction techniques in machine learning and statistics that recovers a low-dimensional structure assumed to lie within high-dimensional data. The field rests on the manifold hypothesis, which proposes that real-world high-dimensional datasets, such as images of faces, gene expression profiles, or the activations of a neural network, do not fill their ambient space uniformly but instead concentrate on or near a much lower-dimensional smooth surface, called a manifold, embedded inside that space. By estimating the geometry of this surface from a finite sample of observations, manifold learning algorithms produce a low-dimensional representation in which distances, neighborhoods, or topological relationships approximate those on the original manifold.
Manifold learning emerged as a distinct subfield around 2000 with two simultaneous publications in Science on the same December issue: Isomap by Tenenbaum, de Silva, and Langford, and Locally Linear Embedding by Roweis and Saul. These methods extended the older techniques of principal component analysis and classical multidimensional scaling to settings where the underlying structure is curved rather than flat. Over the following two decades, the field grew to include Laplacian Eigenmaps, Hessian LLE, Local Tangent Space Alignment, t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP), with the latter two becoming the standard tools for visualizing embeddings produced by deep learning systems.
A mathematical manifold is a topological space that locally resembles Euclidean space near each point. A two-dimensional sphere, for example, is a curved surface in three-dimensional space, but any small patch of the sphere can be charted by a flat map, much as a flat page of an atlas approximates a small region of the spherical Earth. The dimension of the manifold is the dimension of these local charts: a surface is two-dimensional even when curved through three or more ambient dimensions, and a curve is one-dimensional even when it loops through high-dimensional space.
The manifold hypothesis claims that natural data, despite being represented by very large feature vectors, has only a small number of true degrees of freedom. A grayscale face image stored as a 100 by 100 pixel array lives in a 10,000-dimensional vector space, but the set of possible faces is parameterized by a much smaller list of factors such as identity, pose, expression, lighting, and age. As these factors vary continuously, the corresponding image vectors trace out a smooth surface inside the pixel space whose intrinsic dimension is roughly the number of independent factors, far smaller than the ambient dimension.
The hypothesis matters because uniform sampling of high-dimensional spaces is impossible in practice. The volume of a unit hypercube grows so rapidly with dimension that no finite dataset can densely populate even a 30-dimensional space, a phenomenon known as the curse of dimensionality. The fact that learning algorithms succeed at all is taken as empirical evidence that the manifold hypothesis holds, at least approximately, for the kinds of data humans care about.
Before the rise of manifold learning, dimensionality reduction was dominated by linear methods. Principal component analysis projects data onto the directions of greatest variance, and classical multidimensional scaling embeds points so that pairwise Euclidean distances are preserved. Both produce a flat subspace, equivalent to fitting a hyperplane through the data cloud. When the true low-dimensional structure is itself flat, linear methods recover it exactly. When the structure is curved, projecting onto a flat subspace destroys the geometry.
The canonical illustration is the swiss roll, a synthetic dataset that has become the standard benchmark for manifold learning research. Points are sampled uniformly from a flat rectangle and then rolled into three-dimensional space along a spiral, producing a curled sheet that resembles the cross-section of a jelly roll. The intrinsic dimension is two, and a successful manifold learner should unroll the sheet into the original flat rectangle. PCA, however, projects the swiss roll onto its plane of greatest variance, collapsing spiral layers on top of each other and destroying the local neighborhood structure.
| Aspect | Linear methods (PCA, classical MDS) | Manifold learning (nonlinear) |
|---|---|---|
| Underlying assumption | Data lies near a hyperplane | Data lies near a curved manifold |
| Geometry preserved | Global Euclidean distance, variance | Local neighborhoods or geodesic distance |
| Computational cost | Fast, closed-form via eigendecomposition or SVD | Slower, often involves nearest-neighbor graph and large eigenproblem |
| Out-of-sample extension | Trivial via linear projection | Often requires retraining or auxiliary methods |
| Behavior on swiss roll | Collapses spiral layers together | Unrolls the sheet to a flat rectangle |
| Interpretability | Components are linear combinations of features | Coordinates are abstract and difficult to label |
| Typical use | Compression, denoising, regression preprocessing | Visualization, exploratory data analysis |
The key insight that separates manifold learning from linear approaches is that distances should be measured along the surface of the manifold, not through the ambient space. On a sphere, the distance between two cities is measured along the great circle connecting them, not through the chord piercing the planet. Manifold learning algorithms approximate these intrinsic distances or local relationships from data and reconstruct an embedding that respects them.
Manifold learning algorithms differ in which geometric quantities they preserve and how they compute the embedding. The early generation, developed around 2000, produced global embeddings via large eigenproblems. The later generation, exemplified by t-SNE and UMAP, optimized cost functions emphasizing local neighborhood preservation, sacrificing some global structure to produce visually compelling clusters.
Isomap, short for Isometric Mapping, was introduced by Joshua Tenenbaum, Vin de Silva, and John Langford in 2000. The algorithm proceeds in three steps: it constructs a neighborhood graph by connecting each data point to its k nearest neighbors, estimates geodesic distances between every pair of points by computing shortest-path distances in the graph via Dijkstra's or Floyd's algorithm, and applies classical multidimensional scaling to the resulting geodesic distance matrix to produce an embedding in which Euclidean distances approximate the geodesics.
The motivating intuition is that the ambient Euclidean distance between two distant points on a curved manifold is meaningless because the straight line connecting them cuts across empty space. The geodesic distance, defined as the length of the shortest curve along the manifold, is the correct notion. By approximating geodesics as paths through a dense neighborhood graph, Isomap recovers global structure that PCA cannot. On the swiss roll, Isomap unrolls the spiral into a flat rectangle whose dimensions correspond to the original parameter space. Isomap is sensitive to neighborhood size: too small disconnects the graph, while too large allows shortcuts that violate the manifold structure. The eigendecomposition step has cubic complexity, limiting Isomap to a few thousand examples without approximations such as landmark Isomap.
Locally Linear Embedding, or LLE, was introduced by Sam Roweis and Lawrence Saul in the same December 2000 issue of Science as Isomap. Rather than estimating global geodesic distances, LLE assumes that each point can be reconstructed as a linear combination of its nearest neighbors, and that the same reconstruction weights remain valid in the low-dimensional embedding. The algorithm first identifies the k nearest neighbors of each point, then computes weights that best reconstruct each point from its neighbors subject to a sum-to-one constraint, and finally finds the low-dimensional embedding that best preserves these reconstruction weights via the bottom eigenvectors of a sparse matrix.
LLE is conceptually elegant because it never explicitly references geodesic distance: the local linear patches that approximate the manifold are stitched together implicitly through shared weights. The method is computationally lighter than Isomap because the relevant eigenproblem involves a sparse matrix. However, LLE has known failure modes, including collapsing outputs into a single point when the regularization is poorly chosen, and struggling with manifolds of nonuniform sampling density.
Laplacian Eigenmaps were introduced by Mikhail Belkin and Partha Niyogi in 2003. The method begins with a nearest-neighbor graph, weights edges by a heat kernel that decays with Euclidean distance, constructs the graph Laplacian (a discrete analog of the Laplace-Beltrami operator on a Riemannian manifold), and computes its bottom eigenvectors as the embedding coordinates. Points connected by heavy edges in the graph end up close in the embedding, providing a smooth low-dimensional representation that respects local neighborhoods. The method has direct connections to spectral clustering and forms part of the mathematical foundation for later methods like UMAP.
Hessian LLE, proposed by David Donoho and Carrie Grimes in 2003, modifies LLE to better recover manifolds with nonconvex parameter spaces such as a square with a hole removed. It replaces the reconstruction weights with a Hessian-based functional that vanishes on locally linear functions, providing tighter recovery guarantees in theory. Local Tangent Space Alignment (LTSA), introduced by Zhenyue Zhang and Hongyuan Zha in 2004, estimates the local tangent space at each point via principal component analysis on its neighborhood and then aligns these local tangent spaces into a global coordinate system. Both methods share the spectral-method structure of building a neighborhood graph, encoding local geometry in a matrix, and extracting the embedding from a low-rank eigendecomposition. They are implemented in scikit-learn but are rarely the first choice for new applications because they are slower and less robust than t-SNE and UMAP.
t-distributed Stochastic Neighbor Embedding, abbreviated t-SNE, was introduced by Laurens van der Maaten and Geoffrey Hinton in 2008 as a refinement of an earlier method called Stochastic Neighbor Embedding by Hinton and Roweis. Unlike the eigendecomposition-based methods of the early 2000s, t-SNE optimizes a probabilistic objective via gradient descent. It defines a Gaussian probability distribution over neighbors of each point in the high-dimensional space, capturing the likelihood that one point would pick another as its neighbor. It then defines a Student t-distribution over neighbors in the low-dimensional embedding, with heavier tails that allow distant clusters to spread apart. The algorithm minimizes the Kullback-Leibler divergence between these two distributions, pushing the embedding toward configurations in which local neighborhood structure is preserved.
The heavy-tailed Student t-distribution is the key innovation that gives t-SNE its name and visual appeal. It allows the algorithm to faithfully represent local clusters while letting unrelated regions repel each other into well-separated blobs, producing the now-iconic t-SNE plot in which classes of digits, cell types, or document topics appear as distinct, often colorful islands. For more than a decade, t-SNE has been the default method for visualizing high-dimensional embeddings in machine learning papers and biology dashboards.
t-SNE has well-known limitations. The optimization is nonconvex and stochastic, so different runs produce different embeddings. The cluster sizes in a t-SNE plot do not correspond to true sizes in the original space, and the distances between clusters are not meaningful. The perplexity hyperparameter must be tuned by the user, and small changes can produce qualitatively different plots. The original implementation scales as the square of the number of points, although the Barnes-Hut approximation introduced by van der Maaten in 2014 reduced the cost to roughly linear.
Uniform Manifold Approximation and Projection, or UMAP, was introduced by Leland McInnes, John Healy, and James Melville in 2018. The method is grounded in a theoretical framework drawn from Riemannian geometry and algebraic topology, although in practice it is closely related to Laplacian Eigenmaps and t-SNE. UMAP first constructs a fuzzy simplicial complex over the data, which can be understood as a weighted nearest-neighbor graph where edge weights represent the probability that two points are connected at a given scale. It then optimizes a low-dimensional embedding so that its own fuzzy simplicial complex matches the high-dimensional one as closely as possible, using cross-entropy as the loss function and stochastic gradient descent on negative samples.
UMAP has several practical advantages over t-SNE that explain its rapid adoption since 2018. It is dramatically faster, often by an order of magnitude on large datasets, thanks to negative-sampling optimization and approximate nearest neighbor search. It preserves more global structure, so the relative positions of clusters in a UMAP plot are more interpretable than in a t-SNE plot. It scales to millions of points and supports embeddings of arbitrary dimension. It also offers an inverse transform mapping points back to the high-dimensional space, and a parametric variant that learns a neural network for fast inference on new data.
| Property | t-SNE | UMAP |
|---|---|---|
| Year introduced | 2008 | 2018 |
| Theoretical basis | Probabilistic divergence between neighbor distributions | Cross-entropy of fuzzy simplicial sets, Riemannian geometry |
| Optimization | Gradient descent on KL divergence | Stochastic gradient descent with negative sampling |
| Speed on 100k points | Minutes to hours | Seconds to minutes |
| Scaling | Roughly linear with Barnes-Hut, struggles past 100k | Scales to millions of points |
| Global structure preservation | Limited, distances between clusters not meaningful | Better, relative cluster positions more reliable |
| Embedding dimensions | Typically 2 or 3, higher rarely used | 2, 3, or higher; supports general-purpose reduction |
| Out-of-sample mapping | Not native, requires parametric variants | Built-in transform, plus parametric UMAP |
| Hyperparameters | Perplexity, learning rate, iterations | Number of neighbors, minimum distance, components |
| Reproducibility | Stochastic, results vary across runs | Stochastic but generally more stable |
| Common implementation | openTSNE, scikit-learn, Rtsne | umap-learn (Python), uwot (R) |
Potential of Heat-diffusion for Affinity-based Trajectory Embedding, abbreviated PHATE, was introduced by Kevin Moon and colleagues in 2019. The method was designed for biological data, particularly single-cell genomics, where datasets often contain branching structures representing developmental trajectories of cells. PHATE constructs a Markov random walk on the data, computes diffusion-based distances over many time steps, and embeds the resulting potential distances. Compared with t-SNE and UMAP, PHATE often does a better job of preserving long, branching trajectories where points along a continuum should appear as connected curves rather than discrete clusters.
TriMap, introduced by Ehsan Amid and Manfred Warmuth in 2019, uses triplet constraints rather than pairwise neighbor probabilities, with each triplet specifying that two anchor points should be closer to each other than to a third negative point. PaCMAP, introduced by Yingfan Wang and colleagues in 2020, generalizes the triplet idea to a combination of near, mid-range, and farther pairs, with the goal of preserving both local and global structure. Both methods are competitive with UMAP on standard benchmarks.
Manifold learning research relies on a small set of synthetic datasets. The swiss roll is the most famous: a two-dimensional rectangular sheet rolled into three dimensions along a logarithmic spiral. Variants include the swiss roll with a hole removed before rolling, used to test whether algorithms can handle nonconvex parameter spaces. The S-curve is a sheet bent into the shape of the letter S, and the severed sphere removes a polar cap to test recovery of a manifold with boundary. Real benchmarks include MNIST handwritten digits, the COIL-20 object image set, the Frey faces dataset, and various subsets of single-cell RNA sequencing data.
The practical role of manifold learning has shifted substantially since the early 2000s. The eigendecomposition methods of the first generation are still taught and implemented but are rarely the first choice for production work, with their primary modern role being pedagogical. The methods that dominate today are t-SNE and UMAP, both used overwhelmingly for visualization rather than downstream supervised learning.
The single most common use of manifold learning in modern machine learning is the visualization of representations learned by neural networks. When a word2vec model produces 300-dimensional vectors for words in a vocabulary, or a transformer language model produces 4,096-dimensional hidden states for tokens, the only practical way to inspect those representations is to project them down to two or three dimensions for plotting. t-SNE and UMAP plots of word embeddings reveal clusters of semantically related words, plots of transformer activations have been used to study how language models encode syntax and topic, and vision researchers project the activations of convolutional networks at various layers to study what features each layer detects. These visualizations have become a standard component of machine learning papers, although critics warn that overinterpretation of cluster sizes and shapes is widespread.
Single-cell RNA sequencing produces datasets in which each row is a cell and each column is the expression level of a gene, with tens of thousands of cells and genes per experiment. The cells form a complex landscape of types, subtypes, and developmental states, and visualizing this landscape is essential for biological interpretation. UMAP has effectively replaced t-SNE as the default visualization in single-cell analysis pipelines such as Seurat (R) and Scanpy (Python), with PHATE used when developmental trajectories are of interest. The UMAP plot in which each point is a cell colored by cluster is now the canonical figure in cell biology papers. Manifold learning is also used in chemoinformatics for visualizing chemical space, in network analysis for visualizing graph embeddings, in finance for portfolios and market regimes, and in cybersecurity for clustering malware samples. In each case, the role is exploratory: the embedding helps the analyst see structure that would be invisible in the raw data.
A notable feature of modern manifold learning is its limited role in feeding downstream supervised models. In the early 2000s, before deep learning, low-dimensional embeddings produced by Isomap or LLE were sometimes used as input features for classifiers. With the rise of deep learning, this use case has largely disappeared, because modern neural networks already produce excellent low-dimensional representations as a byproduct of supervised or self-supervised training. The representations produced by t-SNE and UMAP are also generally unsuitable as features for supervised learning: they are stochastic and unstable, sensitive to hyperparameters, and not designed to preserve information relevant to any particular prediction task. New points cannot be added without retraining or using parametric variants. As a result, t-SNE and UMAP are almost always used as the last step of an analysis pipeline for visualization, rather than as a preprocessing step for further modeling.
A mature ecosystem of open-source software supports manifold learning across languages and platforms.
| Tool | Language | Methods supported | Notes |
|---|---|---|---|
scikit-learn manifold module | Python | Isomap, LLE, Hessian LLE, LTSA, Laplacian Eigenmaps, MDS, t-SNE | Standard reference implementations, slower than specialized tools on large data |
| openTSNE | Python | t-SNE | Fast t-SNE with multicore Barnes-Hut and FIt-SNE backends, supports adding new points |
| umap-learn | Python | UMAP, parametric UMAP | Reference UMAP implementation by the original authors, widely used in single-cell analysis |
| uwot | R | UMAP | Fast UMAP and LargeVis implementations for R |
| Rtsne | R | t-SNE | Standard t-SNE for R via Barnes-Hut |
| Scanpy | Python | UMAP, PHATE, t-SNE | Single-cell genomics toolkit |
| Seurat | R | UMAP, t-SNE | Dominant single-cell analysis package in R |
| FIt-SNE | C++/Python/Matlab | t-SNE | Fast Fourier transform-based acceleration for very large data |
| TensorFlow Embedding Projector | Web | PCA, t-SNE, UMAP | Browser-based interactive visualization |
The choice of tool is driven primarily by the language ecosystem of the analysis pipeline rather than by methodological differences.
Manifold learning rests on several assumptions that are rarely tested in practice. The data must lie close to a smooth manifold, a condition that may fail when the data is contaminated by noise or outliers. The manifold must have a well-defined intrinsic dimension, which is itself a quantity that must be estimated from the data. The sampling must be dense enough to estimate local geometry reliably. When these conditions are violated, manifold learning algorithms can produce artifacts that mislead the analyst.
A persistent concern is the difficulty of validating manifold learning results. Unlike supervised learning, where predictions can be tested against ground truth, manifold embeddings are typically evaluated by visual inspection or by ad hoc preservation metrics with weak theoretical grounding. The fact that t-SNE and UMAP produce visually compelling plots even on random data has led to a literature of cautionary papers warning against overinterpretation. Researchers are encouraged to compare results across multiple methods, to vary hyperparameters systematically, and to confirm visualization findings with independent quantitative tests.
From a theoretical standpoint, manifold learning connects to broader questions in geometry, topology, and high-dimensional probability. The Whitney embedding theorem guarantees that a smooth d-manifold can be embedded in 2d-dimensional Euclidean space, and the Nash embedding theorem guarantees that the embedding can be made isometric. Spectral methods in manifold learning are connected to the heat equation and the Laplace-Beltrami operator, and the convergence of graph Laplacians to their continuum counterparts has been studied extensively.
Deep learning has transformed the role of manifold learning. On the one hand, the success of deep neural networks is sometimes cited as confirmation of the manifold hypothesis, since the hidden layers of a trained network appear to discover compact representations of complex data. Autoencoders, variational autoencoders, and self-supervised methods such as SimCLR and DINO can be seen as parametric, learnable forms of manifold learning that scale to vast datasets and produce reusable embeddings. The contractive autoencoder of Rifai and colleagues was explicitly motivated by manifold-tangent considerations.
On the other hand, classical methods such as Isomap and LLE have been largely displaced for representation learning. Where the goal is to produce a low-dimensional space useful for classification, retrieval, or generation, neural networks trained on the appropriate task or pretext objective consistently outperform spectral manifold methods. The classical methods retain a niche in visualization, in pedagogy, and in settings where data is small and labels are unavailable, but they are no longer at the frontier of representation learning research.