Dimensionality Reduction

Dimensionality reduction is the process of transforming data from a high-dimensional space into a lower-dimensional representation that retains as much of the meaningful structure of the original data as possible. It is a foundational technique in machine learning, statistics, signal processing, and data visualization, used to combat the curse of dimensionality, reveal hidden structure, compress data, and accelerate downstream learning algorithms.

Definition

Formally, dimensionality reduction is a mapping $f: \mathbb{R}^D \rightarrow \mathbb{R}^d$ (with $d \ll D$) applied to a dataset $X = {x_1, x_2, \dots, x_n} \subset \mathbb{R}^D$, producing a new dataset $Y = {y_1, y_2, \dots, y_n} \subset \mathbb{R}^d$. The goal is for $Y$ to preserve certain properties of $X$, such as variance, pairwise distances, local neighborhoods, class separability, or probability distributions. Depending on the method, the mapping $f$ may be linear or nonlinear, parametric or non-parametric, supervised or unsupervised, and invertible (reconstructive) or not.

Dimensionality reduction methods are broadly divided into two families:

Feature selection: choosing a subset of the original features, discarding the rest. See feature selection.
Feature extraction: constructing new features as functions of the originals. See feature extraction.

Both families aim to reduce $D$ while preserving predictive or descriptive information.

ELI5 (Explain Like I'm 5)

Imagine you have a huge spreadsheet with 10,000 columns describing each of your friends: their height, favorite color, birthday, shoe size, how loudly they laugh, and thousands more things. It is way too much to look at. Dimensionality reduction is like squishing that spreadsheet down to just two or three columns that still let you tell your friends apart. It throws away the repeated or boring stuff and keeps the interesting differences, sort of like making a cartoon of a face: you only need a few strokes for the eyes, nose, and mouth, and the cartoon still looks like the person.

Motivation

The curse of dimensionality

The phrase curse of dimensionality, coined by Richard E. Bellman in 1961 while studying dynamic programming, refers to the collection of phenomena that make high-dimensional spaces behave in counterintuitive ways. As dimensionality grows, the volume of the space grows exponentially, so a fixed-size sample of data becomes vanishingly sparse. Several concrete problems follow:

Distance concentration: in high dimensions under many common distributions, the ratio between the distances to the nearest and farthest neighbors of a point approaches one, so nearest-neighbor searches lose discriminative power.
Sample complexity: the amount of data required to reliably estimate densities, decision boundaries, or smooth functions grows exponentially in the number of dimensions.
Hubness: in high-dimensional k-nearest-neighbor graphs, a few points (hubs) appear in the neighbor lists of many other points, skewing algorithms that rely on neighborhood structure.
Overfitting: models with many input features easily memorize idiosyncratic noise, harming generalization.

Dimensionality reduction mitigates these problems by projecting data into a space whose dimension is closer to its intrinsic dimensionality, the number of degrees of freedom that actually generate the variation observed in the data.

Visualization

Humans perceive at most three spatial dimensions. To inspect, cluster, or sanity-check large datasets visually, practitioners routinely project data to two or three dimensions using methods such as t-SNE, UMAP, or the first two principal components. Good visualization embeddings let researchers spot clusters, outliers, trajectories, and mislabeled points that would be impossible to notice in the raw data.

Computational efficiency

Many learning algorithms scale poorly with input dimension. K-nearest neighbors, kernel methods, Gaussian processes, and dense neural networks all become slower and more memory-hungry as $D$ grows. Reducing the dimensionality to a compact code accelerates training and inference, and sometimes improves accuracy by removing noise and redundancy.

Noise reduction and denoising

High-dimensional data often lies near a lower-dimensional manifold, with deviations from that manifold corresponding to noise. Projecting to the manifold, or to its principal directions of variation, tends to filter out this noise. This is why methods like principal component analysis and autoencoders are common preprocessing steps in image, audio, and sensor applications.

Storage and compression

A 2048-pixel grayscale image lives in a 2048-dimensional vector space, but only a small manifold of that space contains realistic images. Dimensionality reduction exploits this redundancy. Lossy compression schemes based on discrete cosine transforms (JPEG), wavelets, and learned autoencoders are all forms of dimensionality reduction applied to media.

Multicollinearity and interpretability

Real-world features are often correlated. Dimensionality reduction combines correlated features into independent or less-correlated components, stabilizing regression coefficients and making downstream models more interpretable when the reduced components themselves have meaningful interpretations.

Taxonomy of methods

There is no single axis along which to classify dimensionality reduction methods, but the following table summarizes the main algorithms and a few key properties. Supervised indicates whether label information is used, Linear indicates whether the mapping is linear in the input features, and Global/Local indicates which type of structure the method is designed to preserve.

Method	Year	Key reference	Linear	Supervised	Structure preserved
Principal Component Analysis	1901	Pearson	Yes	No	Global variance
Factor Analysis	1904	Spearman	Yes	No	Latent common factors
Multidimensional Scaling (classical)	1938	Young and Householder	Yes	No	Global pairwise distances
Linear Discriminant Analysis	1936	Fisher	Yes	Yes	Class separability
Independent Component Analysis	1994	Comon	Yes	No	Statistical independence
Sammon Mapping	1969	Sammon	No	No	Global distances (weighted)
Kernel PCA	1998	Scholkopf, Smola, Muller	No	No	Nonlinear variance
Isomap	2000	Tenenbaum, de Silva, Langford	No	No	Geodesic distances
Locally Linear Embedding	2000	Roweis, Saul	No	No	Local linearity
Laplacian Eigenmaps	2001	Belkin, Niyogi	No	No	Local neighborhoods
Diffusion Maps	2005	Coifman, Lafon	No	No	Random-walk geometry
Random Projection	1984	Johnson, Lindenstrauss	Yes	No	Pairwise distances (approx)
Probabilistic PCA	1999	Tipping, Bishop	Yes	No	Gaussian latent variable
Deep Autoencoder	2006	Hinton, Salakhutdinov	No	No	Reconstructive code
t-SNE	2008	van der Maaten, Hinton	No	No	Local neighborhoods
UMAP	2018	McInnes, Healy, Melville	No	No	Local and some global

Orthogonal to this taxonomy, one can also distinguish:

Parametric methods (e.g., PCA, LDA, autoencoders) produce an explicit function that can embed new points after training.
Non-parametric methods (e.g., classical t-SNE, Isomap, LLE) only embed the specific training points; new points require an extra out-of-sample extension.
Reconstructive methods (e.g., PCA, autoencoders) can map reduced codes back to the original space approximately.
Non-reconstructive methods (e.g., t-SNE, UMAP) produce an embedding without an inverse mapping.

Linear methods

Principal Component Analysis

Principal component analysis (PCA) is the oldest and most widely used dimensionality reduction technique. It was introduced by Karl Pearson in his 1901 paper On Lines and Planes of Closest Fit to Systems of Points in Space, and developed independently by Harold Hotelling in 1933. Pearson formulated the method geometrically, as the problem of finding the best-fitting low-dimensional affine subspace to a cloud of points. Hotelling reformulated it algebraically in terms of eigenvectors of the covariance matrix.

Given centered data $X \in \mathbb{R}^{n \times D}$ (each column has zero mean), the sample covariance matrix is

$$\Sigma = \frac{1}{n-1} X^\top X.$$

PCA computes the eigendecomposition $\Sigma = V \Lambda V^\top$, where the columns of $V$ are orthonormal eigenvectors (the principal components or loading vectors) and $\Lambda$ is a diagonal matrix of nonnegative eigenvalues sorted in decreasing order. The projection of a data point $x$ onto the top $d$ components is

$$y = V_d^\top x,$$

where $V_d$ contains the first $d$ columns of $V$. Equivalently, PCA can be computed from the singular value decomposition $X = U S V^\top$, which is numerically preferable because it avoids forming $\Sigma$ explicitly.

PCA has three equivalent interpretations:

Variance maximization: the first principal component is the unit vector $v_1$ that maximizes $\mathrm{Var}(X v_1)$; subsequent components maximize variance subject to being orthogonal to the previous ones.
Reconstruction error minimization: PCA finds the orthogonal projection of rank $d$ that minimizes the squared reconstruction error $|X - \hat{X}|_F^2$. This is the classical Eckart-Young theorem.
Decorrelation: the principal components are the directions in which the data are uncorrelated.

The fraction of variance captured by the first $d$ components is $\sum_{i=1}^d \lambda_i / \sum_{i=1}^D \lambda_i$, which gives an intuitive rule for picking $d$ (e.g., retain enough components to explain 95% of variance).

Strengths: simple, fast, optimal under the Frobenius norm, closed-form solution, parametric so new points can be projected trivially, invertible up to the truncation.

Limitations: purely linear, sensitive to feature scaling (features with larger units dominate), sensitive to outliers (because it uses squared error), cannot capture nonlinear manifold structure such as a Swiss roll.

Singular Value Decomposition and Truncated SVD

The singular value decomposition factorizes any real matrix $X$ as $X = U S V^\top$, where $U$ and $V$ are orthogonal and $S$ is a diagonal matrix of nonnegative singular values in decreasing order. Truncated SVD keeps only the top $d$ singular triplets, yielding the best rank-$d$ approximation of $X$ in both the Frobenius and spectral norms (Eckart-Young-Mirsky theorem).

Truncated SVD is closely related to PCA but does not require centering, which makes it practical for sparse matrices (centering would destroy sparsity). In natural language processing it is the core computation behind latent semantic analysis, which applies truncated SVD to a term-document matrix to obtain low-dimensional document and term embeddings.

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA), introduced by Ronald A. Fisher in his 1936 paper The Use of Multiple Measurements in Taxonomic Problems, is a supervised linear method. Given labeled data with $C$ classes, LDA seeks directions that maximize between-class variance relative to within-class variance. Formally, LDA maximizes the Rayleigh quotient

$$J(w) = \frac{w^\top S_B w}{w^\top S_W w},$$

where $S_B$ is the between-class scatter and $S_W$ is the within-class scatter. The optimal $w$ vectors are the generalized eigenvectors of $S_W^{-1} S_B$. LDA can extract at most $C - 1$ components and is primarily used as a supervised preprocessing step for classification tasks.

Whereas PCA seeks directions that explain variance in the data as a whole, LDA seeks directions that separate the classes. The two can give very different projections on the same dataset.

Independent Component Analysis

Independent Component Analysis (ICA), formalized by Pierre Comon in 1994, seeks a linear transformation such that the resulting components are as statistically independent as possible, not merely uncorrelated. ICA assumes the observed signals are linear mixtures of independent, non-Gaussian source signals and attempts to invert the mixing. A classic application is the cocktail party problem, separating individual voices from microphone recordings. Algorithms such as FastICA and Infomax estimate the unmixing matrix by maximizing non-Gaussianity measures like negentropy or kurtosis.

ICA is useful as dimensionality reduction when interpretable source signals (rather than variance-capturing components) are desired, for example in EEG, fMRI, and audio processing.

Factor Analysis

Factor analysis, originating with Charles Spearman's 1904 work on intelligence testing, assumes that observed variables are linear combinations of a smaller number of latent common factors plus independent noise:

$$x = W z + \mu + \epsilon, \quad z \sim \mathcal{N}(0, I), \quad \epsilon \sim \mathcal{N}(0, \Psi),$$

where $\Psi$ is diagonal. The loadings $W$ are estimated by maximum likelihood or the method of principal factors, often followed by rotation (varimax, promax) to improve interpretability. Unlike PCA, factor analysis is not rotation-invariant and permits unique noise on each feature, which makes it more appropriate when observation noise differs across features.

Random Projection and the Johnson-Lindenstrauss Lemma

Random projection is a surprisingly powerful family of methods in which the projection matrix is drawn at random rather than learned from data. Its theoretical foundation is the Johnson-Lindenstrauss lemma, proved in 1984 by William B. Johnson and Joram Lindenstrauss, which states:

For any $0 < \varepsilon < 1$ and any set $X$ of $n$ points in $\mathbb{R}^D$, there exists a linear map $f : \mathbb{R}^D \to \mathbb{R}^m$ with $m = O(\varepsilon^{-2} \log n)$ such that for all $u, v \in X$,

$$(1 - \varepsilon) |u - v|^2 \le |f(u) - f(v)|^2 \le (1 + \varepsilon) |u - v|^2.$$

The target dimension depends only logarithmically on the number of points and not at all on the original dimension. Moreover, the classical proof is constructive: with high probability, a matrix whose entries are i.i.d. Gaussian (or Rademacher) random variables, suitably scaled, realizes the map. Practical implementations include Gaussian random projection and the sparse random projection of Achlioptas.

Random projection is attractive when $D$ is extremely large, for streaming or one-pass settings, and as a preprocessing step before other dimensionality reduction methods. It is data-oblivious, which is both its strength (no training) and its weakness (it does not adapt to the data's structure).

Multidimensional Scaling

Multidimensional scaling (MDS), in its classical form introduced by Young and Householder in 1938 and popularized by Torgerson and Gower, takes as input a matrix of pairwise dissimilarities between $n$ objects and produces coordinates in $\mathbb{R}^d$ such that the Euclidean distances approximate the input dissimilarities. Classical (metric) MDS solves this via an eigendecomposition of the doubly centered squared-distance matrix, and in the special case where the dissimilarities are Euclidean distances it is equivalent to PCA.

Non-metric MDS (Shepard, Kruskal) preserves only the rank order of the dissimilarities and is solved iteratively. MDS is widely used in psychometrics, ecology, and marketing research where only similarity judgments are available.

Probabilistic PCA

Probabilistic PCA (PPCA), introduced by Tipping and Bishop in 1999, embeds PCA in a probabilistic latent variable framework:

$$x = Wz + \mu + \epsilon, \quad z \sim \mathcal{N}(0, I), \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I).$$

Maximum likelihood estimation recovers the same principal subspace as classical PCA, but PPCA offers additional advantages: it handles missing data naturally through the EM algorithm, it provides a proper likelihood for model comparison and mixture modeling, and it generalizes smoothly to factor analysis (by replacing $\sigma^2 I$ with a diagonal $\Psi$) and to Bayesian PCA.

Nonlinear and manifold methods

Linear methods are insufficient when data lies on a curved manifold embedded in high-dimensional space. The canonical toy example is the Swiss roll: a two-dimensional sheet rolled into three dimensions. PCA will flatten the roll and confuse points that are far apart along the sheet but close together in the ambient space. Nonlinear or manifold learning methods aim to recover the underlying low-dimensional coordinates while respecting the curvature of the manifold. See manifold learning for an overview.

Kernel PCA

Kernel PCA, introduced by Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller in a 1998 paper in Neural Computation, generalizes PCA via the kernel trick. Instead of computing the eigendecomposition of the covariance matrix in input space, kernel PCA computes the eigendecomposition of the $n \times n$ centered kernel (Gram) matrix $K$ with entries $K_{ij} = k(x_i, x_j)$, where $k$ is a positive semidefinite kernel such as the Gaussian (RBF) kernel $k(x, y) = \exp(-|x - y|^2 / 2\sigma^2)$ or a polynomial kernel.

This corresponds to performing standard PCA in a (possibly infinite-dimensional) feature space induced by the kernel, without ever explicitly computing the feature map. Kernel PCA can capture nonlinear structure, but it scales as $O(n^2)$ in memory and $O(n^3)$ in time for the eigendecomposition, and out-of-sample extension requires re-applying the kernel.

Isomap

Isomap (isometric feature mapping), introduced by Joshua Tenenbaum, Vin de Silva, and John Langford in a 2000 Science paper, extends classical MDS by replacing Euclidean distances with geodesic distances measured along the data manifold. The algorithm has three steps:

Build a neighborhood graph connecting each point to its $k$ nearest neighbors (or to all points within a radius $\epsilon$).
Compute geodesic distances as shortest-path distances on this graph (Dijkstra or Floyd-Warshall).
Apply classical MDS to the resulting geodesic distance matrix to produce a low-dimensional embedding.

Isomap recovers the true intrinsic geometry of a manifold when the data is sufficiently dense, converges to the true geodesic distances in the limit of infinite data, and on the Swiss roll unrolls the sheet into a flat rectangle. Its weaknesses are sensitivity to the choice of $k$ or $\epsilon$ (too large causes short-circuiting across folds of the manifold), high computational cost on large datasets, and the requirement that the manifold be isometric to a convex subset of Euclidean space.

Locally Linear Embedding

Locally linear embedding (LLE), introduced by Sam Roweis and Lawrence Saul in the same 2000 issue of Science as Isomap, takes a local-to-global approach. The intuition is that a smooth manifold looks locally linear, so each point can be approximated as a linear combination of its neighbors, and the same linear weights should describe the point in the low-dimensional embedding. The algorithm has three steps:

Find neighbors: identify the $k$ nearest neighbors of each point.
Compute reconstruction weights: for each point $x_i$, find weights $w_{ij}$ that minimize $|x_i - \sum_j w_{ij} x_j|^2$ subject to $\sum_j w_{ij} = 1$, using only the neighbors of $x_i$.
Compute the embedding: find low-dimensional coordinates $y_i$ that minimize $\sum_i |y_i - \sum_j w_{ij} y_j|^2$. This reduces to a sparse eigenvalue problem.

LLE is appealing because its optimization has no local minima and it captures the local geometry faithfully, but it can distort global structure and is sensitive to the neighborhood parameter and to noise. Variants include Hessian LLE, Modified LLE, and Local Tangent Space Alignment (LTSA).

Laplacian Eigenmaps

Laplacian eigenmaps, introduced by Mikhail Belkin and Partha Niyogi (conference version 2001, journal version 2003 in Neural Computation), embed the data by solving a generalized eigenvalue problem on the graph Laplacian of a neighborhood graph. The algorithm constructs a weighted adjacency matrix $W$ (often using a Gaussian heat kernel on neighbor pairs), forms the graph Laplacian $L = D - W$ where $D$ is the diagonal degree matrix, and computes the eigenvectors corresponding to the smallest nonzero eigenvalues of the generalized problem $L y = \lambda D y$. These eigenvectors are the low-dimensional coordinates.

The method is motivated by the observation that on a Riemannian manifold, the Laplace-Beltrami operator governs heat diffusion and its eigenfunctions provide a natural basis of smooth, locality-preserving functions. Laplacian eigenmaps therefore produce embeddings in which neighboring points in the graph are close in the embedding, which is a natural criterion for clustering-friendly projections. Laplacian eigenmaps is closely connected to spectral clustering and to normalized cuts.

Hessian Eigenmaps and Diffusion Maps

Hessian eigenmaps (Donoho and Grimes, 2003) replaces the Laplacian with a Hessian-based operator and can recover the manifold's coordinate chart even when the manifold is not convex, removing one of Isomap's limitations. Diffusion maps (Coifman and Lafon, 2005) view the data as a Markov chain and embed points so that Euclidean distances in the embedding approximate the diffusion distance (distance between probability distributions after several random-walk steps). The diffusion distance averages over all paths between two points and is therefore robust to noise and short-circuiting.

Sammon Mapping

Sammon mapping (John Sammon, 1969) is an early nonlinear MDS variant that minimizes a weighted stress function

$$E = \frac{1}{\sum_{i<j} d_{ij}} \sum_{i<j} \frac{(d_{ij} - \delta_{ij})^2}{d_{ij}},$$

where $d_{ij}$ are the original distances and $\delta_{ij}$ are distances in the embedding. The weighting $1/d_{ij}$ emphasizes preserving small distances over large ones, which makes Sammon mapping a precursor to modern locality-preserving methods.

Modern visualization-oriented embeddings

Two algorithms dominate the practice of data visualization in the 2010s and 2020s: t-SNE and UMAP.

t-SNE

t-distributed stochastic neighbor embedding (t-SNE), introduced by Laurens van der Maaten and Geoffrey Hinton in the 2008 Journal of Machine Learning Research paper Visualizing Data using t-SNE, is a probabilistic method specifically designed for two- and three-dimensional visualization. It is a variant of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that addresses the crowding problem of earlier methods.

The algorithm operates as follows:

For each pair of points in the high-dimensional space, compute a conditional probability $p_{j \mid i}$ that $x_j$ would be chosen as $x_i$'s neighbor under a Gaussian centered at $x_i$. Symmetrize to get a joint probability $p_{ij}$.
The bandwidth of each Gaussian is set so that the effective number of neighbors, measured by the perplexity $2^{H(P_i)}$ where $H(P_i)$ is the Shannon entropy, matches a user-specified target (typically 5 to 50).
In the low-dimensional space, define a similar joint probability $q_{ij}$ using a Student's t-distribution with one degree of freedom (a Cauchy distribution) rather than a Gaussian. The heavy tails of the t-distribution let moderately similar high-dimensional points be placed far apart in the embedding, resolving the crowding problem.
Minimize the Kullback-Leibler divergence $KL(P | Q) = \sum_{i \ne j} p_{ij} \log (p_{ij} / q_{ij})$ by gradient descent.

t-SNE excels at revealing cluster structure and is ubiquitous in scientific visualization. However it has well-known limitations: the objective is non-convex, so different random initializations produce different layouts; the method does not preserve global distances (clusters that appear far apart in a t-SNE plot may not actually be far apart); and its original complexity is $O(n^2)$, though Barnes-Hut t-SNE and FIt-SNE accelerate it to $O(n \log n)$ and $O(n)$ respectively. Users must also interpret t-SNE plots carefully because the relative sizes and distances of clusters are not meaningful.

UMAP

Uniform Manifold Approximation and Projection (UMAP), introduced by Leland McInnes, John Healy, and James Melville in a 2018 arXiv preprint (and a 2018 Journal of Open Source Software paper), is built on a mathematical framework grounded in Riemannian geometry and algebraic topology. In practice UMAP behaves much like t-SNE but runs faster and tends to preserve more global structure.

UMAP constructs a weighted $k$-nearest-neighbor graph in which edge weights encode fuzzy simplicial set memberships derived from locally adaptive distance scaling. An equivalent graph is constructed in the low-dimensional embedding space, and UMAP minimizes a cross-entropy between the two fuzzy simplicial sets using stochastic gradient descent with negative sampling. The objective includes attractive forces along the graph edges and repulsive forces on randomly sampled non-neighbors.

Compared to t-SNE, UMAP has several practical advantages: faster runtime (scales to millions of points), better preservation of global structure, the ability to embed into more than three dimensions for use as a general-purpose feature extractor, a parametric variant that can project new points, and supervised and semi-supervised variants that incorporate labels. Its main hyperparameters are the number of neighbors ($n_neighbors$), which trades off local versus global structure, and the minimum distance ($min_dist$), which controls how tightly points pack together. Like t-SNE, UMAP visualizations can mislead if interpreted too literally, particularly regarding the absolute sizes and distances between clusters.

Neural-network-based methods

Autoencoders

An autoencoder is a neural network trained to reconstruct its input through a bottleneck layer whose dimension is smaller than the input. If $f_\theta : \mathbb{R}^D \to \mathbb{R}^d$ is the encoder and $g_\phi : \mathbb{R}^d \to \mathbb{R}^D$ is the decoder, the autoencoder is trained to minimize a reconstruction loss such as

$$\mathcal{L}(\theta, \phi) = \frac{1}{n} \sum_{i=1}^n |x_i - g_\phi(f_\theta(x_i))|^2.$$

The bottleneck activation $z = f_\theta(x)$ is the low-dimensional code. With linear activations and squared loss, an autoencoder recovers the same subspace as PCA (up to rotation). With nonlinear activations and multiple layers, autoencoders can learn highly nonlinear manifolds that PCA cannot.

The landmark 2006 Science paper Reducing the Dimensionality of Data with Neural Networks by Geoffrey Hinton and Ruslan Salakhutdinov demonstrated that deep autoencoders, when initialized by greedy layer-wise pretraining with restricted Boltzmann machines, learned low-dimensional codes for images and documents that substantially outperformed PCA on reconstruction error. This result is often cited as one of the catalysts of the modern deep learning era.

Several regularized variants exist:

Denoising autoencoders (Vincent et al., 2008) are trained to reconstruct clean inputs from corrupted versions, encouraging robustness.
Sparse autoencoders penalize the activations of the bottleneck units, learning codes in which only a few units are active for any given input.
Contractive autoencoders (Rifai et al., 2011) penalize the Frobenius norm of the encoder's Jacobian, producing representations that are locally invariant to small input perturbations.

Variational Autoencoders

Variational autoencoders (VAEs), introduced by Diederik Kingma and Max Welling in their 2013 paper Auto-Encoding Variational Bayes, combine autoencoding with variational Bayesian inference. The encoder outputs the parameters of an approximate posterior $q_\phi(z \mid x)$, typically a diagonal Gaussian, and the decoder parameterizes the likelihood $p_\theta(x \mid z)$. Training maximizes the evidence lower bound (ELBO), which is the sum of an expected reconstruction term and a KL divergence that regularizes the posterior toward a prior $p(z)$, typically $\mathcal{N}(0, I)$.

VAEs provide a principled generative model in addition to a dimensionality reduction, and their latent spaces tend to be smooth and interpolable. They are widely used in generative modeling of images, molecules, audio, and scientific data.

Self-supervised representation learning

Modern self-supervised representation learning methods treat large pretrained neural networks as general-purpose dimensionality reducers. Methods like SimCLR (Chen et al., 2020), MoCo (He et al., 2020), and BYOL (Grill et al., 2020) train encoders so that different augmented views of the same image map to similar embeddings while views of different images map to dissimilar ones. The resulting learned features, often hundreds or thousands of dimensions extracted from an image originally represented by hundreds of thousands of pixels, form the basis for transfer learning and few-shot recognition. See contrastive learning and SimCLR for details. The broader area is representation learning.

Word and sentence embeddings

Word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) map discrete words into dense vectors of a few hundred dimensions by exploiting co-occurrence statistics in large text corpora. Although they operate on originally categorical inputs, word embeddings can be viewed as a dimensionality reduction of the very high-dimensional one-hot or co-occurrence representation of a vocabulary, and they inherit many theoretical connections with matrix factorization. Sentence and document encoders (Sentence-BERT, SimCSE, and modern large language model encoders) extend the same idea to longer pieces of text.

Feature selection

Feature selection is dimensionality reduction by removing features rather than transforming them. Selected features retain their original meaning, which is a major advantage for interpretability, and feature selection integrates cleanly with sparse models. The three main approaches, formalized by Kohavi and John (1997) and refined by Guyon and Elisseeff (2003), are:

Filter methods

Filter methods score each feature using a statistic that is computed independently of any learner, then keep the top-scoring features. Common scores include:

Pearson correlation with the target.
Mutual information.
Chi-squared statistic (for categorical features and targets).
ANOVA F-statistic.
Variance threshold (remove near-constant features).
Relief-family scores.

Filter methods are fast and model-agnostic but ignore feature interactions and model-specific relevance.

Wrapper methods

Wrapper methods treat feature selection as a search over subsets, evaluating each candidate subset by the cross-validation performance of an actual learner. Classical search strategies include forward selection (start empty, add features greedily), backward elimination (start full, remove features greedily), and recursive feature elimination. Wrapper methods typically achieve the best accuracy for the target model but are computationally expensive because they retrain the model many times.

Embedded methods

Embedded methods perform feature selection during model training. The most famous example is LASSO regression (Tibshirani, 1996), which adds an $L_1$ penalty $\lambda |w|_1$ to the ordinary least squares objective. The non-smoothness of the $L_1$ norm at zero drives many coefficients to be exactly zero, yielding a sparse solution in which only a subset of features have nonzero weights. Related methods include elastic net (an $L_1 + L_2$ blend), group LASSO, tree-based feature importances (Random Forest, gradient boosted trees), and regularized logistic regression.

Embedded methods strike a balance between the speed of filter methods and the model-specific accuracy of wrapper methods.

Evaluation

Judging the quality of a dimensionality reduction is surprisingly subtle because there is rarely a single ground-truth embedding. The appropriate metric depends on the downstream goal.

Metric	What it measures	Used for
Reconstruction error	$\|X - \hat{X}\|_F^2$ after projecting and inverting	PCA, autoencoders
Explained variance ratio	Fraction of variance captured by top components	PCA
Trustworthiness	Are neighbors in the embedding also neighbors in the input?	t-SNE, UMAP, manifold methods
Continuity	Are neighbors in the input also neighbors in the embedding?	t-SNE, UMAP, manifold methods
Stress	Weighted mismatch of distances	MDS, Sammon
Residual variance	$1 - R^2$ between input and geodesic distances	Isomap
k-NN classification accuracy	Downstream classifier accuracy on embedded features	Supervised evaluation
Silhouette score	Cluster compactness and separation in embedding	Cluster preservation
Mean reciprocal rank / retrieval recall	Retrieval quality in the embedded space	Information retrieval

Trustworthiness and continuity, formalized by Venna and Kaski, are the standard metrics for neighborhood-preserving embeddings. Trustworthiness penalizes points that appear close in the embedding but are far in the original space (avoiding false neighbors), while continuity penalizes points that are close in the original space but far in the embedding (avoiding missing neighbors). Both range from 0 to 1, and most methods must trade one against the other.

Intrinsic dimensionality estimation asks, independent of any specific embedding, how many degrees of freedom the data actually has. Estimators include correlation dimension, maximum likelihood estimation, two-nearest-neighbors (TwoNN), and the method based on the Hill estimator. Knowing the intrinsic dimension helps pick a sensible target dimension for the reduction and provides a lower bound on reconstruction error.

Downstream task performance is in practice the most honest evaluation: train a classifier, regressor, clustering algorithm, or retrieval system on the reduced representation and measure its performance on held-out data. If the reduction throws away information the downstream task needs, that shows up here regardless of how good the unsupervised reconstruction or visualization metrics look.

Applications

Data visualization

Dimensionality reduction is the workhorse of exploratory data analysis. PCA scatter plots, t-SNE maps, and UMAP projections are standard figures in scientific papers across essentially every quantitative discipline. In genomics, a UMAP plot of cells is now expected in virtually every single-cell study. In deep learning, visualizing learned embeddings in two dimensions helps debug models, diagnose mode collapse, and communicate results.

Preprocessing for machine learning

Reducing the input dimensionality before training a downstream classifier, regressor, or cluster model can improve generalization, reduce training time, and stabilize optimization. PCA is a default preprocessing step for nearest-neighbor classifiers, Gaussian mixture models, and support vector machines on very wide tabular data. Random projection is used before locality-sensitive hashing in large-scale search.

Bioinformatics and single-cell RNA-seq

Single-cell RNA sequencing produces expression matrices with tens of thousands of genes measured across hundreds of thousands or millions of cells. Dimensionality reduction is essential for clustering cell types, visualizing developmental trajectories, and identifying marker genes. A typical workflow applies PCA to the filtered and normalized count matrix to compact the data into a few dozen principal components, then uses UMAP or t-SNE on those components to produce a two-dimensional plot. Specialized methods like diffusion maps are popular for trajectory inference. See bioinformatics.

NLP and information retrieval

Latent semantic analysis (Deerwester et al., 1990) applies truncated SVD to a term-document matrix to produce compact document embeddings that capture topical structure and mitigate synonymy and polysemy. Modern word and sentence embeddings generalize this to nonlinear and contextual models. In search, dimensionality reduction compresses document and query vectors for fast nearest-neighbor retrieval; approximate nearest neighbor indexes like HNSW, IVF, and product quantization rely on low-dimensional or quantized representations.

Computer vision

Before deep learning, PCA-based methods like eigenfaces (Turk and Pentland, 1991) were state of the art for face recognition: each face image was projected onto the top few principal components of a training set and compared to enrolled faces in that compact space. Today, learned convolutional and transformer features have replaced eigenfaces, but PCA, autoencoders, and contrastive embeddings remain ubiquitous preprocessing and representation tools.

Recommender systems

Recommender systems based on matrix factorization decompose a sparse user-item interaction matrix into low-rank user and item embeddings, effectively reducing both the user space and the item space to a common low-dimensional latent space. Simon Funk's SVD-based solution for the 2006 Netflix Prize popularized this approach, which remains a building block of modern hybrid recommenders.

Anomaly detection

Reconstruction error from a dimensionality reduction model is a natural anomaly score: normal points, which lie on the learned manifold, reconstruct well, while anomalies reconstruct poorly. PCA-based anomaly detection and autoencoder-based anomaly detection are standard tools for fraud detection, network intrusion detection, and industrial equipment monitoring.

Signal and image compression

Transform coding, the core idea of JPEG, MP3, and related codecs, is dimensionality reduction by a fixed linear transform (DCT, wavelet) followed by quantization. Learned compression with autoencoders and vector-quantized models pushes this further, adapting the transform to the data distribution.

Reinforcement learning and control

Learned state representations are essential for reinforcement learning in high-dimensional observation spaces like pixels. Methods that combine autoencoders or contrastive learning with policy learning (such as CURL, SPR, and DreamerV2) rely on dimensionality reduction to turn raw observations into compact states over which policies are easier to learn.

Tradeoffs and limitations

No dimensionality reduction method is free of tradeoffs.

Information loss: any reduction discards information. The only question is whether the discarded information is important for the task at hand, which is often unknown in advance.
Linear versus nonlinear: linear methods are fast, interpretable, parametric, and invertible, but they cannot capture manifold structure. Nonlinear methods are more expressive but slower, harder to interpret, and often non-parametric (so embedding new points is nontrivial).
Global versus local: methods that preserve global distances (PCA, MDS, Isomap) may crush local neighborhoods, while methods that preserve local neighborhoods (t-SNE, LLE, Laplacian eigenmaps) may distort global distances.
Parameter sensitivity: neighborhood-based methods are sensitive to $k$ or the kernel bandwidth. t-SNE is sensitive to perplexity and learning rate; UMAP is sensitive to $n_neighbors$ and $min_dist$. Results can change qualitatively under reasonable parameter changes.
Non-determinism: stochastic methods (t-SNE, UMAP, autoencoders) give different embeddings on different runs, which complicates comparison and reproducibility.
Computational cost: spectral methods scale as $O(n^3)$ or $O(n^2)$ if implemented naively, and even accelerated versions struggle with tens of millions of points. Random projection and stochastic gradient based methods scale best.
Interpretability: principal components or ICA components can sometimes be given a meaningful interpretation (dominant modes of variation, independent sources), but t-SNE and UMAP coordinates have no intrinsic meaning. Feature selection retains interpretability by construction.
Ground-truth ambiguity: there is often no unambiguously correct embedding. Different methods answer different questions, and reasonable people may disagree on which embedding of a dataset is better.
Misleading visualizations: in t-SNE and UMAP plots, the sizes of clusters, densities within clusters, and distances between clusters are not directly meaningful, yet readers habitually interpret them as if they were.

Choosing a method

A rough practical guide based on data properties and goals:

Goal	Data property	Recommended starting point
Quick look, interpretable axes	Tabular, roughly linear	PCA
Visualization with clusters	Any	UMAP (then t-SNE for comparison)
Visualization of continuous trajectories	Manifold data	Diffusion maps, Isomap, UMAP
Preprocessing for classifier	Wide tabular	PCA or LDA (if labels)
Extreme dimensionality, streaming	Very large D	Random projection
Nonlinear reconstruction	Images, audio	Deep autoencoder
Generative latent space	Images, molecules	Variational autoencoder
Text, vocabulary reduction	Sparse term-document	Truncated SVD (LSA) or word embeddings
Interpretable feature subset	Need to keep original features	LASSO or tree-based importance
Single-cell RNA-seq visualization	High-dimensional sparse counts	PCA then UMAP
Source separation	Mixed signals	Independent component analysis

When possible, run several methods, inspect the embeddings visually and with trustworthiness and continuity metrics, and validate on a downstream task. Do not trust any single embedding as the representation of your data.

References

Pearson, K. (1901). *On Lines and Planes of Closest Fit to Systems of Points in Space*. Philosophical Magazine, Series 6, 2(11), 559-572.
Spearman, C. (1904). *General Intelligence, Objectively Determined and Measured*. American Journal of Psychology, 15, 201-293.
Hotelling, H. (1933). *Analysis of a Complex of Statistical Variables Into Principal Components*. Journal of Educational Psychology, 24, 417-441 and 498-520.
Fisher, R. A. (1936). *The Use of Multiple Measurements in Taxonomic Problems*. Annals of Eugenics, 7, 179-188.
Young, G. and Householder, A. S. (1938). *Discussion of a Set of Points in Terms of their Mutual Distances*. Psychometrika, 3, 19-22.
Bellman, R. E. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
Sammon, J. W. (1969). *A Nonlinear Mapping for Data Structure Analysis*. IEEE Transactions on Computers, C-18(5), 401-409.
Johnson, W. B. and Lindenstrauss, J. (1984). *Extensions of Lipschitz Mappings into a Hilbert Space*. Contemporary Mathematics, 26, 189-206.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). *Indexing by Latent Semantic Analysis*. Journal of the American Society for Information Science, 41(6), 391-407.
Turk, M. and Pentland, A. (1991). *Eigenfaces for Recognition*. Journal of Cognitive Neuroscience, 3(1), 71-86.
Comon, P. (1994). *Independent Component Analysis, a New Concept?*. Signal Processing, 36(3), 287-314.
Tibshirani, R. (1996). *Regression Shrinkage and Selection via the Lasso*. Journal of the Royal Statistical Society, Series B, 58(1), 267-288.
Kohavi, R. and John, G. H. (1997). *Wrappers for Feature Subset Selection*. Artificial Intelligence, 97(1-2), 273-324.
Scholkopf, B., Smola, A. J., and Muller, K.-R. (1998). *Nonlinear Component Analysis as a Kernel Eigenvalue Problem*. Neural Computation, 10(5), 1299-1319.
Tipping, M. E. and Bishop, C. M. (1999). *Probabilistic Principal Component Analysis*. Journal of the Royal Statistical Society, Series B, 61(3), 611-622.
Roweis, S. T. and Saul, L. K. (2000). *Nonlinear Dimensionality Reduction by Locally Linear Embedding*. Science, 290(5500), 2323-2326.
Tenenbaum, J. B., de Silva, V., and Langford, J. C. (2000). *A Global Geometric Framework for Nonlinear Dimensionality Reduction*. Science, 290(5500), 2319-2323.
Hinton, G. E. and Roweis, S. T. (2002). *Stochastic Neighbor Embedding*. Advances in Neural Information Processing Systems 15, 833-840.
Guyon, I. and Elisseeff, A. (2003). *An Introduction to Variable and Feature Selection*. Journal of Machine Learning Research, 3, 1157-1182.
Belkin, M. and Niyogi, P. (2003). *Laplacian Eigenmaps for Dimensionality Reduction and Data Representation*. Neural Computation, 15(6), 1373-1396.
Donoho, D. L. and Grimes, C. (2003). *Hessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data*. Proceedings of the National Academy of Sciences, 100(10), 5591-5596.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Coifman, R. R. and Lafon, S. (2006). *Diffusion Maps*. Applied and Computational Harmonic Analysis, 21(1), 5-30.
Hinton, G. E. and Salakhutdinov, R. R. (2006). *Reducing the Dimensionality of Data with Neural Networks*. Science, 313(5786), 504-507.
Venna, J. and Kaski, S. (2006). *Local Multidimensional Scaling*. Neural Networks, 19(6-7), 889-899.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. (2008). *Extracting and Composing Robust Features with Denoising Autoencoders*. Proceedings of the 25th International Conference on Machine Learning, 1096-1103.
van der Maaten, L. and Hinton, G. (2008). *Visualizing Data using t-SNE*. Journal of Machine Learning Research, 9, 2579-2605.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. Second edition. Springer.
Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. (2010). *Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization*. Journal of Machine Learning Research, 11, 451-490.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). *Contractive Auto-Encoders: Explicit Invariance During Feature Extraction*. Proceedings of the 28th International Conference on Machine Learning, 833-840.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). *Efficient Estimation of Word Representations in Vector Space*. arXiv:1301.3781.
Kingma, D. P. and Welling, M. (2013). *Auto-Encoding Variational Bayes*. arXiv:1312.6114.
Pennington, J., Socher, R., and Manning, C. D. (2014). *GloVe: Global Vectors for Word Representation*. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532-1543.
McInnes, L., Healy, J., and Melville, J. (2018). *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction*. arXiv:1802.03426.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). *A Simple Framework for Contrastive Learning of Visual Representations (SimCLR)*. Proceedings of the 37th International Conference on Machine Learning.

Dimensionality Reduction

Definition

ELI5 (Explain Like I'm 5)

Motivation

The curse of dimensionality

Visualization

Computational efficiency

Noise reduction and denoising

Storage and compression

Multicollinearity and interpretability

Taxonomy of methods

Linear methods

Principal Component Analysis

Singular Value Decomposition and Truncated SVD

Linear Discriminant Analysis

Independent Component Analysis

Factor Analysis

Random Projection and the Johnson-Lindenstrauss Lemma

Multidimensional Scaling

Probabilistic PCA

Nonlinear and manifold methods

Kernel PCA

Isomap

Locally Linear Embedding

Laplacian Eigenmaps

Hessian Eigenmaps and Diffusion Maps

Sammon Mapping

Modern visualization-oriented embeddings

t-SNE

UMAP

Neural-network-based methods

Autoencoders

Variational Autoencoders

Self-supervised representation learning

Word and sentence embeddings

Feature selection

Filter methods

Wrapper methods

Embedded methods

Evaluation

Applications

Data visualization

Preprocessing for machine learning

Bioinformatics and single-cell RNA-seq

NLP and information retrieval

Computer vision

Recommender systems

Anomaly detection

Signal and image compression

Reinforcement learning and control

Tradeoffs and limitations

Choosing a method

See also

References

Improve this article

Related Articles

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

Confirmation Bias

Feature Set

Synthetic Feature

Dimensionality Reduction

Definition

ELI5 (Explain Like I'm 5)

Motivation

The curse of dimensionality

Visualization

Computational efficiency

Noise reduction and denoising

Storage and compression

Multicollinearity and interpretability

Taxonomy of methods

Linear methods

Principal Component Analysis

Singular Value Decomposition and Truncated SVD

Linear Discriminant Analysis

Independent Component Analysis

Factor Analysis

Random Projection and the Johnson-Lindenstrauss Lemma