Dimension Reduction

Introduction

Dimensionality reduction, also known as dimension reduction, is a fundamental technique in machine learning and data analysis that reduces the number of features (variables) in a dataset while preserving its underlying structure and information. High-dimensional data arises naturally in fields ranging from genomics and computer vision to natural language processing and finance. Working directly with hundreds or thousands of features introduces statistical, computational, and interpretive challenges that dimensionality reduction is designed to address.

There are two primary approaches to dimensionality reduction: feature selection and feature extraction. Feature selection chooses a subset of the original features, while feature extraction constructs new features by combining or transforming the originals. Both approaches aim to retain the most informative aspects of the data in a lower-dimensional representation. The choice between them depends on the goals of the analysis, the nature of the data, and whether interpretability of individual features is required.

Explain like I'm 5 (ELI5)

Imagine you have a big box of differently shaped and colored LEGO bricks. You want to build something cool, but there are so many pieces that it is hard to know where to start. Dimensionality reduction is like sorting those LEGO bricks by the things that matter most, maybe just their shape and color. Instead of keeping track of every tiny detail about each brick (its weight, its exact shade, whether it has scratches), you focus on just a couple of important things. Now you can see which bricks are similar and which are different, and building becomes much easier and faster.

Why dimensionality reduction matters

The curse of dimensionality

The curse of dimensionality is a term coined by Richard Bellman in 1961 to describe a collection of problems that arise when working with high-dimensional data. As the number of dimensions (features) in a dataset increases, several issues emerge that make analysis, modeling, and computation significantly harder.

Problem	Description
Data sparsity	As dimensions increase, the volume of the space grows exponentially, causing data points to become increasingly sparse. In high-dimensional space, even large datasets may have very few data points near any given point.
Distance concentration	In high dimensions, the distance between any two points tends to converge to similar values. This makes distance-based algorithms like k-nearest neighbors and k-means less effective, because the notion of "nearest neighbor" becomes less meaningful.
Overfitting	With many features relative to the number of samples, models can easily memorize training data rather than learning generalizable patterns. The more dimensions, the more parameters a model must learn, and the more training data is needed to estimate those parameters reliably.
Computational cost	Training time and memory requirements grow with the number of features. Algorithms with complexity proportional to the number of dimensions become impractical in very high-dimensional spaces.
Visualization difficulty	Humans can perceive at most three spatial dimensions. Visualizing data in hundreds or thousands of dimensions requires reducing to 2D or 3D.

As a rule of thumb, for reliable statistical estimation, the number of samples needed grows exponentially with the number of dimensions. A dataset with 100 features may require vastly more samples than one with 10 features to achieve the same level of model accuracy.

Additional motivations

Beyond the curse of dimensionality, dimensionality reduction serves several practical purposes:

Noise reduction. Minor dimensions often capture noise rather than signal. Projecting data onto its principal components or a learned latent space filters out irrelevant variation.
Data compression. Storing and transmitting high-dimensional data is expensive. Compact representations reduce storage requirements while preserving approximate reconstructability.
Improved model performance. Removing redundant or irrelevant features can improve the generalization of downstream models, especially when the sample size is limited.
Faster training and inference. Fewer features translate directly to faster computation in many algorithms.

Feature selection

Feature selection identifies and retains only the most relevant features from the original dataset. Because the selected features remain in their original form, this approach preserves interpretability.

Filter methods

Filter methods evaluate the relevance of each feature independently, typically by measuring its relationship with the target variable. These methods are computationally efficient and do not rely on any specific machine learning model.

Method	How it works	Strengths
Pearson correlation	Measures linear relationship between feature and target	Fast; works well for linearly related features
Mutual information	Measures the amount of information gained about the target from the feature	Captures non-linear relationships
Chi-squared test	Tests statistical independence between categorical features and the target	Well-suited for categorical data
ANOVA F-test	Compares the means of a feature across target classes	Good for selecting numerical features for classification tasks
Variance threshold	Removes features with variance below a specified level	Simple; removes constant or near-constant features

Wrapper methods

Wrapper methods use a specific machine learning model to evaluate the usefulness of feature subsets. The selection process is performed iteratively, adding or removing features until an optimal subset is found. While these methods can produce better results than filter methods, they are more computationally expensive.

Forward selection. Begins with an empty feature set and adds features one at a time, selecting the feature that provides the greatest improvement in model performance at each step.
Backward elimination. Begins with the full feature set and removes one feature at a time, dropping the feature whose removal causes the least degradation in performance.
Recursive feature elimination (RFE). Trains a model, ranks features by importance, removes the least important features, and repeats until the desired number remains.

Embedded methods

Embedded methods incorporate feature selection as part of the training process of a machine learning model.

LASSO (Least Absolute Shrinkage and Selection Operator). Applies L1 regularization, driving coefficients of irrelevant features to exactly zero.
Decision tree-based methods. Random forests and gradient boosting machines compute feature importance scores based on impurity reduction or permutation importance.
Elastic net. Combines L1 and L2 regularization, balancing feature selection with coefficient shrinkage.

Feature extraction: linear methods

Linear techniques assume that the data can be represented in a lower-dimensional space using linear combinations of the original features.

Principal component analysis (PCA)

Principal component analysis (PCA) is the most widely used linear dimensionality reduction technique. Developed by Karl Pearson in 1901 and independently by Harold Hotelling in the 1930s, PCA identifies the directions of maximum variance in the data (called principal components) and projects the data onto these directions.

How PCA works:

Center the data by subtracting the mean of each feature.
Compute the covariance matrix of the centered data.
Calculate the eigenvectors and eigenvalues of the covariance matrix (eigenvalue decomposition).
Sort the eigenvectors by their corresponding eigenvalues in descending order.
Select the top k eigenvectors as the principal components.
Project the data onto the selected principal components.

The eigenvalue associated with each principal component represents the amount of variance explained by that component. The cumulative explained variance ratio helps determine how many components to retain. A common heuristic is to keep enough components to explain 95% of the total variance. Alternatively, a scree plot (eigenvalues plotted against component index) can reveal an "elbow" where adding more components yields diminishing returns.

Property	Description
Type	Linear, unsupervised
Preserves	Global variance structure
Computational complexity	O(min(n^3, d^3)) for n samples and d dimensions
Scalability	Good; efficient implementations exist for large datasets (randomized SVD, incremental PCA)
Limitations	Cannot capture non-linear relationships; sensitive to feature scaling
Common variants	Kernel PCA, Incremental PCA, Sparse PCA, Robust PCA

PCA is deterministic and produces interpretable results: each principal component is a linear combination of original features, and the loadings reveal which features contribute most. Because PCA relies on variance, it is important to standardize features to unit variance before applying it when features have different scales.

Linear discriminant analysis (LDA)

Linear discriminant analysis (LDA) is a supervised dimensionality reduction technique that finds the linear combinations of features that best separate two or more classes. Unlike PCA, which maximizes variance, LDA maximizes the ratio of between-class variance to within-class variance (the Fisher criterion). LDA can reduce the dimensionality to at most C-1 dimensions, where C is the number of classes.

LDA is particularly useful as a preprocessing step for classification tasks because it explicitly optimizes class separability. However, it assumes that each class has the same covariance matrix (homoscedasticity), and its performance degrades when this assumption is violated.

Factor analysis

Factor analysis is a statistical method that explains variability among observed, correlated variables in terms of a smaller number of unobserved latent variables called factors. While PCA extracts components that maximize variance, factor analysis models the data as arising from latent factors plus noise.

The key distinction is conceptual. PCA is a data reduction technique that finds orthogonal directions of maximum variance. Factor analysis, by contrast, assumes a generative model: each observed variable is a linear combination of latent factors plus an error term specific to that variable. The correlation between a variable and a factor is called the variable's factor loading.

Factor analysis is widely used in psychology, social sciences, and marketing where the goal is to identify interpretable latent constructs (such as "intelligence" or "customer satisfaction") from a large set of measured variables.

Singular value decomposition (SVD)

Singular value decomposition (SVD) factorizes a matrix into three component matrices: U, Sigma, and V^T. Truncated SVD retains only the top k singular values and their corresponding vectors, providing a low-rank approximation of the data. SVD is closely related to PCA and is particularly useful for sparse data, since it does not require computing the full covariance matrix. Truncated SVD is the standard approach for latent semantic analysis (LSA) in natural language processing.

Random projections (Johnson-Lindenstrauss)

Random projection is a computationally efficient, data-agnostic method for dimensionality reduction. It works by projecting data onto a randomly generated lower-dimensional subspace. The theoretical foundation is the Johnson-Lindenstrauss lemma (1984), which states that a set of n points in high-dimensional Euclidean space can be embedded into a space of dimension O(log(n) / epsilon^2) such that all pairwise distances are preserved within a factor of (1 +/- epsilon).

The key advantage of random projections is that the projection matrix depends only on the number of points and the desired distortion tolerance, not on the data itself. This makes random projections extremely fast and suitable for very large, high-dimensional datasets where computing eigenvectors would be prohibitively expensive.

Two common implementations are available in scikit-learn:

Gaussian random projection. Uses a projection matrix with entries drawn from a normal distribution.
Sparse random projection. Uses a sparse matrix with entries from {-1, 0, +1}, which is more memory-efficient. Achlioptas (2003) showed that this "database-friendly" approach preserves the theoretical guarantees of the lemma.

Independent component analysis (ICA)

Independent component analysis (ICA) is a technique that separates a multivariate signal into additive, statistically independent components. While PCA finds uncorrelated components that maximize variance, ICA finds components that are as statistically independent as possible, using measures of non-Gaussianity.

ICA was originally developed for blind source separation, the classic example being the "cocktail party problem" where multiple speech signals must be separated from a recording of overlapping conversations. In the context of dimensionality reduction, ICA is often applied to fMRI data, EEG signals, and financial time series where the underlying sources are assumed to be independent.

Common algorithms include FastICA (Hyvarinen and Oja, 1997) and JADE. ICA is typically preceded by PCA to reduce the number of dimensions before the independence optimization.

Feature extraction: non-linear methods

Non-linear techniques are employed when the data's underlying structure lies on a curved manifold that cannot be adequately represented by linear combinations.

t-distributed stochastic neighbor embedding (t-SNE)

t-SNE, introduced by Laurens van der Maaten and Geoffrey Hinton in 2008, is a non-linear dimensionality reduction technique designed primarily for visualization of high-dimensional data in two or three dimensions.

How t-SNE works:

Compute pairwise similarities between data points in the high-dimensional space using Gaussian kernels. Nearby points receive high similarity; distant points receive low similarity.
Define a similar probability distribution in the low-dimensional space using a Student's t-distribution (which has heavier tails than a Gaussian).
Minimize the Kullback-Leibler (KL) divergence between the two distributions using gradient descent.

The use of the t-distribution in the low-dimensional space helps alleviate the "crowding problem," where points that are moderately distant in high-dimensional space are forced too close together in the lower-dimensional embedding.

Property	Description
Type	Non-linear, unsupervised
Preserves	Local neighborhood structure
Key hyperparameter	Perplexity (roughly corresponds to the number of nearest neighbors considered; typical values range from 5 to 50)
Computational complexity	O(n^2) for naive implementation; O(n log n) with Barnes-Hut approximation
Limitations	Non-deterministic (random initialization); does not preserve global structure well; cannot be applied to new data points without re-running; slow for very large datasets

Important caveats when interpreting t-SNE plots: the sizes of clusters do not indicate their density, the distances between clusters do not reflect their true separation, and different runs with different random seeds can produce visually different results. It is advisable to run t-SNE multiple times and to experiment with different perplexity values.

Uniform manifold approximation and projection (UMAP)

UMAP, introduced by Leland McInnes, John Healy, and James Melville in 2018, is a non-linear dimensionality reduction technique grounded in Riemannian geometry and algebraic topology. UMAP constructs a topological representation of the high-dimensional data and then optimizes a low-dimensional embedding that preserves this structure.

How UMAP works:

Construct a weighted k-nearest-neighbor graph in the high-dimensional space.
Build a fuzzy simplicial complex that captures the topological structure of the data.
Optimize a low-dimensional embedding by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy simplicial complexes.

UMAP is founded on three core assumptions about the data: (1) the data is uniformly distributed on a Riemannian manifold, (2) the Riemannian metric is locally constant or can be approximated as such, and (3) the manifold is locally connected.

Property	Description
Type	Non-linear, unsupervised
Preserves	Both local and global structure (better global preservation than t-SNE)
Key hyperparameters	n_neighbors (controls local vs. global balance), min_dist (controls tightness of embedding)
Computational complexity	O(n^1.14) with approximate nearest neighbor search
Advantages over t-SNE	Faster; better preservation of global structure; supports embedding into arbitrary dimensions; can transform new data points

PCA vs. t-SNE vs. UMAP

The following table compares the three most commonly used dimensionality reduction techniques for different use cases.

Aspect	PCA	t-SNE	UMAP
Type	Linear	Non-linear	Non-linear
Speed	Very fast	Slow on large datasets	Fast
Global structure	Well preserved	Poorly preserved	Better preserved than t-SNE
Local structure	Not emphasized	Excellent	Excellent
Embedding dimensions	Any	Typically 2 or 3	Any
New data points	Supports transform	Requires re-running on full dataset	Supports transform
Determinism	Deterministic	Non-deterministic	Non-deterministic (more stable with same seed)
Interpretability	High (loadings are meaningful)	Low (distances between clusters not meaningful)	Low (distances between clusters partially meaningful)
Best for	Preprocessing, noise reduction, when interpretability matters	Visualizing cluster structure in 2D	Visualization, preprocessing, large-scale analysis
Theoretical basis	Linear algebra (eigendecomposition)	Information theory (KL divergence)	Riemannian geometry and algebraic topology

Autoencoders for dimensionality reduction

Autoencoders are a class of neural networks that learn to compress data into a lower-dimensional representation (the latent space) and then reconstruct the original data from this compressed form. An autoencoder consists of two parts:

Encoder. Maps the input data to a lower-dimensional latent representation. The encoder progressively reduces the number of neurons in each layer, forcing the network to learn a compressed representation.
Decoder. Reconstructs the original data from the latent representation. The decoder mirrors the encoder, progressively increasing the number of neurons.

The bottleneck layer (the layer with the fewest neurons) defines the latent space. The network is trained to minimize the reconstruction error (the difference between the input and the reconstructed output), which forces the latent space to capture the most important features of the data.

Advantages over PCA:

Autoencoders can capture non-linear relationships, whereas PCA is limited to linear projections.
Variational autoencoders (VAEs) learn a continuous, structured latent space that can be used for generative tasks.
Denoising autoencoders learn representations that are robust to noise.

Disadvantages:

Require more data and computational resources to train.
The learned representation is harder to interpret than PCA components.
Prone to overfitting on small datasets.

A simple linear autoencoder with one hidden layer and mean squared error loss learns the same subspace as PCA. The power of autoencoders comes from using non-linear activation functions and deeper architectures, which allow them to learn complex, non-linear manifold structures.

Kernel PCA

Kernel PCA (Scholkopf, Smola, and Muller, 1998) extends PCA to non-linear data by first mapping the data into a higher-dimensional feature space via a kernel function and then performing PCA in that space. Common kernels include the radial basis function (RBF), polynomial, and sigmoid kernels. The kernel trick avoids explicitly computing the high-dimensional mapping, making the method computationally tractable.

Manifold learning methods

Several classical manifold learning algorithms assume the data lies on or near a low-dimensional manifold embedded in the high-dimensional space.

Technique	Year	Key idea
Multidimensional scaling (MDS)	1952	Preserves pairwise distances between points in the lower-dimensional space
Isomap	2000	Preserves geodesic distances on the data manifold using shortest-path graph distances
Locally linear embedding (LLE)	2000	Preserves local linear relationships between neighboring points
Laplacian eigenmaps	2001	Uses the graph Laplacian to find a low-dimensional embedding that preserves local distances
Diffusion maps	2006	Uses random walks on a graph to capture the multi-scale geometry of the data

Newer non-linear methods

Recent work has introduced methods that aim to better balance the preservation of local and global structure.

TriMap (Amid and Warmuth, 2019) uses triplet constraints to preserve both local neighborhoods and global arrangement of the data.
PaCMAP (Wang et al., 2021) introduces "mid-near pairs" that dynamically guide the optimization to capture global structure before refining local structure. In comprehensive benchmarks, PaCMAP consistently ranks among the top methods for preserving both local and global data geometry.

Feature selection vs. feature extraction

Understanding the difference between feature selection and feature extraction is important for choosing the right approach.

Aspect	Feature selection	Feature extraction
Output features	Subset of original features	New, transformed features
Interpretability	High (original feature meanings preserved)	Low to moderate (new features are combinations of originals)
Information loss	May discard useful information	Retains most information in fewer dimensions
Computational cost	Varies (filter methods are fast; wrapper methods are slow)	Varies (PCA is fast; autoencoders are slow)
Handling redundancy	Removes redundant features	Combines redundant features into compact representations
Best for	When interpretability is important; when features have clear domain meaning	When features are highly correlated; when visualization is the goal
Reversibility	Fully reversible (original features are retained)	Partially reversible (approximate reconstruction possible)

In practice, feature selection and feature extraction are often complementary. A common pipeline uses feature engineering to create candidate features, feature selection to remove clearly irrelevant ones, and then PCA or another extraction method to reduce redundancy among the remaining features.

Choosing the right method

Selecting a dimensionality reduction technique is not a one-size-fits-all decision. The choice depends on the goals of the analysis, data characteristics, and computational constraints.

Criterion	Recommendation
Goal is visualization	Use t-SNE or UMAP for 2D scatter plots. Use PCA if you also need interpretable axes.
Goal is preprocessing for a downstream model	Use PCA (or truncated SVD for sparse data). UMAP can also serve as a preprocessing step.
Data is linearly separable or relationships are linear	PCA or LDA (if labels are available).
Data lies on a non-linear manifold	UMAP, t-SNE, Isomap, or autoencoders.
Dataset is very large (millions of samples)	PCA (fast), random projections (very fast), or UMAP (scales well). Avoid t-SNE unless using accelerated implementations.
Interpretability is critical	PCA (loadings are interpretable) or feature selection methods.
Need to transform new data points	PCA, UMAP, random projections, or autoencoders. Avoid t-SNE and classical manifold methods that lack out-of-sample extensions.
Supervised setting with class labels	LDA for linear reduction; supervised UMAP for non-linear reduction.
Very high-dimensional sparse data (text, counts)	Truncated SVD (LSA) or sparse random projections.

A practical workflow is to start with PCA to get a quick baseline and understand the linear structure, then try UMAP or t-SNE if the PCA visualization does not reveal clear patterns. When PCA is used as preprocessing before t-SNE, reducing to 50 dimensions first can significantly speed up t-SNE without losing important structure.

Evaluating the quality of dimensionality reduction

Assessing how well a dimensionality reduction method has performed is essential but not straightforward, because the "correct" low-dimensional representation is usually unknown. Several metrics and approaches exist.

Reconstruction error

For methods that support reconstruction (PCA, autoencoders), the mean squared error between the original data and its reconstruction from the reduced representation is a natural quality measure. Lower reconstruction error indicates better preservation of the original information.

Explained variance ratio

Specific to PCA, the cumulative explained variance ratio measures the fraction of total variance captured by the selected components. A ratio of 0.95 means the chosen components explain 95% of the variance in the data.

Trustworthiness and continuity

These metrics, introduced by Venna and Kaski (2006), evaluate local neighborhood preservation.

Trustworthiness measures whether points that appear close in the low-dimensional embedding were also close in the original space. Low trustworthiness indicates false neighbors (points that appear close but were actually far apart).
Continuity measures whether points that were close in the original space remain close in the embedding. Low continuity indicates torn neighborhoods (nearby points that were separated by the reduction).

Scikit-learn provides a trustworthiness function that computes this metric directly.

Downstream task performance

Often the most practical evaluation is to measure how well the reduced data performs in a downstream task. For example, if dimensionality reduction is used as a preprocessing step for clustering, the quality of the resulting clusters (measured by silhouette score, adjusted Rand index, or similar metrics) provides an indirect assessment of the reduction quality.

Co-ranking matrix and LCMC

The co-ranking matrix (Lee and Verleysen, 2009) compares the rank-order of neighbors in the original and reduced spaces. The local continuity meta-criterion (LCMC) summarizes the co-ranking matrix into a single score, capturing both local and global preservation.

Implementation in scikit-learn

Scikit-learn provides implementations of many dimensionality reduction techniques under the sklearn.decomposition and sklearn.manifold modules.

PCA example

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA retaining 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>} to {X_reduced.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>} dimensions")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")

t-SNE example

from sklearn.manifold import TSNE

# Reduce to 2D for visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_scaled)

UMAP example

UMAP is available through the umap-learn package rather than scikit-learn, but it follows the same API.

import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X_scaled)

# Transform new data points
X_new_2d = reducer.transform(X_new)

Random projections example

from sklearn.random_projection import SparseRandomProjection

transformer = SparseRandomProjection(n_components=100, random_state=42)
X_reduced = transformer.fit_transform(X)

Applications

Dimensionality reduction is used across many domains.

Data visualization

One of the most common applications is projecting high-dimensional data into 2D or 3D for visualization. t-SNE and UMAP are especially popular for visualizing clusters in single-cell RNA sequencing data, word embeddings, image features, and other complex datasets.

Noise reduction

By projecting data onto its principal components or a learned latent space, dimensionality reduction can filter out noise that exists in minor dimensions. In image processing, retaining only the top principal components removes high-frequency noise while preserving the essential structure of the image.

Compression

Dimensionality reduction is used in data and image compression. PCA and autoencoders can reduce the storage requirements for large datasets by encoding them in fewer dimensions, with the option to approximately reconstruct the original data when needed.

Preprocessing for machine learning

Reducing the number of features before training a model can improve training speed, reduce overfitting, and sometimes improve model performance. PCA is commonly applied as a preprocessing step for support vector machines and other algorithms sensitive to the curse of dimensionality.

Genomics and bioinformatics

In genomics, datasets may contain expression levels for tens of thousands of genes across relatively few samples. Dimensionality reduction is essential for identifying meaningful patterns, such as cell types in single-cell RNA-seq analysis, where UMAP has become the standard visualization tool.

Recommender systems

Collaborative filtering and matrix factorization methods for recommender systems rely on dimensionality reduction to identify latent factors that explain user preferences and item characteristics. SVD and non-negative matrix factorization (NMF) are the backbone of many recommendation algorithms.

Natural language processing

In NLP, text data is often represented as high-dimensional vectors (one dimension per vocabulary term). Latent semantic analysis (via truncated SVD) reduces this space to capture semantic relationships between terms. More recently, learned embeddings from neural models project words and documents into dense, lower-dimensional spaces.

Anomaly detection

Dimensionality reduction can reveal outliers that are not apparent in the full feature space. Points that have high reconstruction error under PCA or an autoencoder, or that appear isolated in a t-SNE/UMAP plot, may be anomalies worth investigating.

References

Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." *Philosophical Magazine*, 2(11), 559-572.
Bellman, R. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
Johnson, W.B. and Lindenstrauss, J. (1984). "Extensions of Lipschitz mappings into a Hilbert space." *Contemporary Mathematics*, 26, 189-206.
Scholkopf, B., Smola, A., and Muller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." *Neural Computation*, 10(5), 1299-1319.
Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). "A Global Geometric Framework for Nonlinear Dimensionality Reduction." *Science*, 290(5500), 2319-2323.
Roweis, S.T. and Saul, L.K. (2000). "Nonlinear Dimensionality Reduction by Locally Linear Embedding." *Science*, 290(5500), 2323-2326.
Jolliffe, I.T. (2002). *Principal Component Analysis*. 2nd Edition. Springer.
Achlioptas, D. (2003). "Database-friendly random projections: Johnson-Lindenstrauss with binary coins." *Journal of Computer and System Sciences*, 66(4), 671-687.
Hinton, G.E. and Salakhutdinov, R.R. (2006). "Reducing the Dimensionality of Data with Neural Networks." *Science*, 313(5786), 504-507.
Venna, J. and Kaski, S. (2006). "Local multidimensional scaling." *Neural Networks*, 19(6-7), 889-899.
van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605.
McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint arXiv:1802.03426*.
Nguyen, L.H. and Holmes, S. (2019). "Ten quick tips for effective dimensionality reduction." *PLOS Computational Biology*, 15(6), e1006907.
Wang, Y., Huang, H., Rudin, C., and Shaposhnik, Y. (2021). "Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization." *Journal of Machine Learning Research*, 22(201), 1-73.

Introduction

Explain like I'm 5 (ELI5)

Why dimensionality reduction matters

The curse of dimensionality

Additional motivations

Feature selection

Filter methods

Wrapper methods

Embedded methods

Feature extraction: linear methods

Principal component analysis (PCA)

Linear discriminant analysis (LDA)

Factor analysis

Singular value decomposition (SVD)

Random projections (Johnson-Lindenstrauss)

Independent component analysis (ICA)

Feature extraction: non-linear methods

t-distributed stochastic neighbor embedding (t-SNE)

Uniform manifold approximation and projection (UMAP)

PCA vs. t-SNE vs. UMAP

Autoencoders for dimensionality reduction

Kernel PCA

Manifold learning methods

Newer non-linear methods

Feature selection vs. feature extraction

Choosing the right method

Evaluating the quality of dimensionality reduction

Reconstruction error

Explained variance ratio

Trustworthiness and continuity

Downstream task performance

Co-ranking matrix and LCMC

Implementation in scikit-learn

PCA example

t-SNE example

UMAP example

Random projections example

Applications

Data visualization

Noise reduction

Compression

Preprocessing for machine learning

Genomics and bioinformatics

Recommender systems

Natural language processing

Anomaly detection

References

Improve this article

Related Articles

ARC-AGI 2

Latent Dirichlet allocation

DBSCAN

Anomaly Detection

Clustering

Divisive Clustering

Introduction

Explain like I'm 5 (ELI5)

Why dimensionality reduction matters

The curse of dimensionality

Additional motivations

Feature selection

Filter methods

Wrapper methods

Embedded methods

Feature extraction: linear methods

Principal component analysis (PCA)

Linear discriminant analysis (LDA)

Factor analysis

Singular value decomposition (SVD)

Random projections (Johnson-Lindenstrauss)

Independent component analysis (ICA)

Feature extraction: non-linear methods

t-distributed stochastic neighbor embedding (t-SNE)

Uniform manifold approximation and projection (UMAP)

PCA vs. t-SNE vs. UMAP

Autoencoders for dimensionality reduction

Kernel PCA

Manifold learning methods

Newer non-linear methods

Feature selection vs. feature extraction