See also: Machine learning terms
Dimensionality reduction, also known as dimension reduction, is a fundamental technique in machine learning and data analysis that reduces the number of features (variables) in a dataset while preserving its underlying structure and information. High-dimensional data arises naturally in fields ranging from genomics and computer vision to natural language processing and finance. Working directly with hundreds or thousands of features introduces statistical, computational, and interpretive challenges that dimensionality reduction is designed to address.
There are two primary approaches to dimensionality reduction: feature selection and feature extraction. Feature selection chooses a subset of the original features, while feature extraction constructs new features by combining or transforming the originals. Both approaches aim to retain the most informative aspects of the data in a lower-dimensional representation. The choice between them depends on the goals of the analysis, the nature of the data, and whether interpretability of individual features is required.
Imagine you have a big box of differently shaped and colored LEGO bricks. You want to build something cool, but there are so many pieces that it is hard to know where to start. Dimensionality reduction is like sorting those LEGO bricks by the things that matter most, maybe just their shape and color. Instead of keeping track of every tiny detail about each brick (its weight, its exact shade, whether it has scratches), you focus on just a couple of important things. Now you can see which bricks are similar and which are different, and building becomes much easier and faster.
The curse of dimensionality is a term coined by Richard Bellman in 1961 to describe a collection of problems that arise when working with high-dimensional data. As the number of dimensions (features) in a dataset increases, several issues emerge that make analysis, modeling, and computation significantly harder.
| Problem | Description |
|---|---|
| Data sparsity | As dimensions increase, the volume of the space grows exponentially, causing data points to become increasingly sparse. In high-dimensional space, even large datasets may have very few data points near any given point. |
| Distance concentration | In high dimensions, the distance between any two points tends to converge to similar values. This makes distance-based algorithms like k-nearest neighbors and k-means less effective, because the notion of "nearest neighbor" becomes less meaningful. |
| Overfitting | With many features relative to the number of samples, models can easily memorize training data rather than learning generalizable patterns. The more dimensions, the more parameters a model must learn, and the more training data is needed to estimate those parameters reliably. |
| Computational cost | Training time and memory requirements grow with the number of features. Algorithms with complexity proportional to the number of dimensions become impractical in very high-dimensional spaces. |
| Visualization difficulty | Humans can perceive at most three spatial dimensions. Visualizing data in hundreds or thousands of dimensions requires reducing to 2D or 3D. |
As a rule of thumb, for reliable statistical estimation, the number of samples needed grows exponentially with the number of dimensions. A dataset with 100 features may require vastly more samples than one with 10 features to achieve the same level of model accuracy.
Beyond the curse of dimensionality, dimensionality reduction serves several practical purposes:
Feature selection identifies and retains only the most relevant features from the original dataset. Because the selected features remain in their original form, this approach preserves interpretability.
Filter methods evaluate the relevance of each feature independently, typically by measuring its relationship with the target variable. These methods are computationally efficient and do not rely on any specific machine learning model.
| Method | How it works | Strengths |
|---|---|---|
| Pearson correlation | Measures linear relationship between feature and target | Fast; works well for linearly related features |
| Mutual information | Measures the amount of information gained about the target from the feature | Captures non-linear relationships |
| Chi-squared test | Tests statistical independence between categorical features and the target | Well-suited for categorical data |
| ANOVA F-test | Compares the means of a feature across target classes | Good for selecting numerical features for classification tasks |
| Variance threshold | Removes features with variance below a specified level | Simple; removes constant or near-constant features |
Wrapper methods use a specific machine learning model to evaluate the usefulness of feature subsets. The selection process is performed iteratively, adding or removing features until an optimal subset is found. While these methods can produce better results than filter methods, they are more computationally expensive.
Embedded methods incorporate feature selection as part of the training process of a machine learning model.
Linear techniques assume that the data can be represented in a lower-dimensional space using linear combinations of the original features.
Principal component analysis (PCA) is the most widely used linear dimensionality reduction technique. Developed by Karl Pearson in 1901 and independently by Harold Hotelling in the 1930s, PCA identifies the directions of maximum variance in the data (called principal components) and projects the data onto these directions.
How PCA works:
The eigenvalue associated with each principal component represents the amount of variance explained by that component. The cumulative explained variance ratio helps determine how many components to retain. A common heuristic is to keep enough components to explain 95% of the total variance. Alternatively, a scree plot (eigenvalues plotted against component index) can reveal an "elbow" where adding more components yields diminishing returns.
| Property | Description |
|---|---|
| Type | Linear, unsupervised |
| Preserves | Global variance structure |
| Computational complexity | O(min(n^3, d^3)) for n samples and d dimensions |
| Scalability | Good; efficient implementations exist for large datasets (randomized SVD, incremental PCA) |
| Limitations | Cannot capture non-linear relationships; sensitive to feature scaling |
| Common variants | Kernel PCA, Incremental PCA, Sparse PCA, Robust PCA |
PCA is deterministic and produces interpretable results: each principal component is a linear combination of original features, and the loadings reveal which features contribute most. Because PCA relies on variance, it is important to standardize features to unit variance before applying it when features have different scales.
Linear discriminant analysis (LDA) is a supervised dimensionality reduction technique that finds the linear combinations of features that best separate two or more classes. Unlike PCA, which maximizes variance, LDA maximizes the ratio of between-class variance to within-class variance (the Fisher criterion). LDA can reduce the dimensionality to at most C-1 dimensions, where C is the number of classes.
LDA is particularly useful as a preprocessing step for classification tasks because it explicitly optimizes class separability. However, it assumes that each class has the same covariance matrix (homoscedasticity), and its performance degrades when this assumption is violated.
Factor analysis is a statistical method that explains variability among observed, correlated variables in terms of a smaller number of unobserved latent variables called factors. While PCA extracts components that maximize variance, factor analysis models the data as arising from latent factors plus noise.
The key distinction is conceptual. PCA is a data reduction technique that finds orthogonal directions of maximum variance. Factor analysis, by contrast, assumes a generative model: each observed variable is a linear combination of latent factors plus an error term specific to that variable. The correlation between a variable and a factor is called the variable's factor loading.
Factor analysis is widely used in psychology, social sciences, and marketing where the goal is to identify interpretable latent constructs (such as "intelligence" or "customer satisfaction") from a large set of measured variables.
Singular value decomposition (SVD) factorizes a matrix into three component matrices: U, Sigma, and V^T. Truncated SVD retains only the top k singular values and their corresponding vectors, providing a low-rank approximation of the data. SVD is closely related to PCA and is particularly useful for sparse data, since it does not require computing the full covariance matrix. Truncated SVD is the standard approach for latent semantic analysis (LSA) in natural language processing.
Random projection is a computationally efficient, data-agnostic method for dimensionality reduction. It works by projecting data onto a randomly generated lower-dimensional subspace. The theoretical foundation is the Johnson-Lindenstrauss lemma (1984), which states that a set of n points in high-dimensional Euclidean space can be embedded into a space of dimension O(log(n) / epsilon^2) such that all pairwise distances are preserved within a factor of (1 +/- epsilon).
The key advantage of random projections is that the projection matrix depends only on the number of points and the desired distortion tolerance, not on the data itself. This makes random projections extremely fast and suitable for very large, high-dimensional datasets where computing eigenvectors would be prohibitively expensive.
Two common implementations are available in scikit-learn:
Independent component analysis (ICA) is a technique that separates a multivariate signal into additive, statistically independent components. While PCA finds uncorrelated components that maximize variance, ICA finds components that are as statistically independent as possible, using measures of non-Gaussianity.
ICA was originally developed for blind source separation, the classic example being the "cocktail party problem" where multiple speech signals must be separated from a recording of overlapping conversations. In the context of dimensionality reduction, ICA is often applied to fMRI data, EEG signals, and financial time series where the underlying sources are assumed to be independent.
Common algorithms include FastICA (Hyvarinen and Oja, 1997) and JADE. ICA is typically preceded by PCA to reduce the number of dimensions before the independence optimization.
Non-linear techniques are employed when the data's underlying structure lies on a curved manifold that cannot be adequately represented by linear combinations.
t-SNE, introduced by Laurens van der Maaten and Geoffrey Hinton in 2008, is a non-linear dimensionality reduction technique designed primarily for visualization of high-dimensional data in two or three dimensions.
How t-SNE works:
The use of the t-distribution in the low-dimensional space helps alleviate the "crowding problem," where points that are moderately distant in high-dimensional space are forced too close together in the lower-dimensional embedding.
| Property | Description |
|---|---|
| Type | Non-linear, unsupervised |
| Preserves | Local neighborhood structure |
| Key hyperparameter | Perplexity (roughly corresponds to the number of nearest neighbors considered; typical values range from 5 to 50) |
| Computational complexity | O(n^2) for naive implementation; O(n log n) with Barnes-Hut approximation |
| Limitations | Non-deterministic (random initialization); does not preserve global structure well; cannot be applied to new data points without re-running; slow for very large datasets |
Important caveats when interpreting t-SNE plots: the sizes of clusters do not indicate their density, the distances between clusters do not reflect their true separation, and different runs with different random seeds can produce visually different results. It is advisable to run t-SNE multiple times and to experiment with different perplexity values.
UMAP, introduced by Leland McInnes, John Healy, and James Melville in 2018, is a non-linear dimensionality reduction technique grounded in Riemannian geometry and algebraic topology. UMAP constructs a topological representation of the high-dimensional data and then optimizes a low-dimensional embedding that preserves this structure.
How UMAP works:
UMAP is founded on three core assumptions about the data: (1) the data is uniformly distributed on a Riemannian manifold, (2) the Riemannian metric is locally constant or can be approximated as such, and (3) the manifold is locally connected.
| Property | Description |
|---|---|
| Type | Non-linear, unsupervised |
| Preserves | Both local and global structure (better global preservation than t-SNE) |
| Key hyperparameters | n_neighbors (controls local vs. global balance), min_dist (controls tightness of embedding) |
| Computational complexity | O(n^1.14) with approximate nearest neighbor search |
| Advantages over t-SNE | Faster; better preservation of global structure; supports embedding into arbitrary dimensions; can transform new data points |
The following table compares the three most commonly used dimensionality reduction techniques for different use cases.
| Aspect | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Non-linear | Non-linear |
| Speed | Very fast | Slow on large datasets | Fast |
| Global structure | Well preserved | Poorly preserved | Better preserved than t-SNE |
| Local structure | Not emphasized | Excellent | Excellent |
| Embedding dimensions | Any | Typically 2 or 3 | Any |
| New data points | Supports transform | Requires re-running on full dataset | Supports transform |
| Determinism | Deterministic | Non-deterministic | Non-deterministic (more stable with same seed) |
| Interpretability | High (loadings are meaningful) | Low (distances between clusters not meaningful) | Low (distances between clusters partially meaningful) |
| Best for | Preprocessing, noise reduction, when interpretability matters | Visualizing cluster structure in 2D | Visualization, preprocessing, large-scale analysis |
| Theoretical basis | Linear algebra (eigendecomposition) | Information theory (KL divergence) | Riemannian geometry and algebraic topology |
Autoencoders are a class of neural networks that learn to compress data into a lower-dimensional representation (the latent space) and then reconstruct the original data from this compressed form. An autoencoder consists of two parts:
The bottleneck layer (the layer with the fewest neurons) defines the latent space. The network is trained to minimize the reconstruction error (the difference between the input and the reconstructed output), which forces the latent space to capture the most important features of the data.
Advantages over PCA:
Disadvantages:
A simple linear autoencoder with one hidden layer and mean squared error loss learns the same subspace as PCA. The power of autoencoders comes from using non-linear activation functions and deeper architectures, which allow them to learn complex, non-linear manifold structures.
Kernel PCA (Scholkopf, Smola, and Muller, 1998) extends PCA to non-linear data by first mapping the data into a higher-dimensional feature space via a kernel function and then performing PCA in that space. Common kernels include the radial basis function (RBF), polynomial, and sigmoid kernels. The kernel trick avoids explicitly computing the high-dimensional mapping, making the method computationally tractable.
Several classical manifold learning algorithms assume the data lies on or near a low-dimensional manifold embedded in the high-dimensional space.
| Technique | Year | Key idea |
|---|---|---|
| Multidimensional scaling (MDS) | 1952 | Preserves pairwise distances between points in the lower-dimensional space |
| Isomap | 2000 | Preserves geodesic distances on the data manifold using shortest-path graph distances |
| Locally linear embedding (LLE) | 2000 | Preserves local linear relationships between neighboring points |
| Laplacian eigenmaps | 2001 | Uses the graph Laplacian to find a low-dimensional embedding that preserves local distances |
| Diffusion maps | 2006 | Uses random walks on a graph to capture the multi-scale geometry of the data |
Recent work has introduced methods that aim to better balance the preservation of local and global structure.
Understanding the difference between feature selection and feature extraction is important for choosing the right approach.
| Aspect | Feature selection | Feature extraction |
|---|---|---|
| Output features | Subset of original features | New, transformed features |
| Interpretability | High (original feature meanings preserved) | Low to moderate (new features are combinations of originals) |
| Information loss | May discard useful information | Retains most information in fewer dimensions |
| Computational cost | Varies (filter methods are fast; wrapper methods are slow) | Varies (PCA is fast; autoencoders are slow) |
| Handling redundancy | Removes redundant features | Combines redundant features into compact representations |
| Best for | When interpretability is important; when features have clear domain meaning | When features are highly correlated; when visualization is the goal |
| Reversibility | Fully reversible (original features are retained) | Partially reversible (approximate reconstruction possible) |
In practice, feature selection and feature extraction are often complementary. A common pipeline uses feature engineering to create candidate features, feature selection to remove clearly irrelevant ones, and then PCA or another extraction method to reduce redundancy among the remaining features.
Selecting a dimensionality reduction technique is not a one-size-fits-all decision. The choice depends on the goals of the analysis, data characteristics, and computational constraints.
| Criterion | Recommendation |
|---|---|
| Goal is visualization | Use t-SNE or UMAP for 2D scatter plots. Use PCA if you also need interpretable axes. |
| Goal is preprocessing for a downstream model | Use PCA (or truncated SVD for sparse data). UMAP can also serve as a preprocessing step. |
| Data is linearly separable or relationships are linear | PCA or LDA (if labels are available). |
| Data lies on a non-linear manifold | UMAP, t-SNE, Isomap, or autoencoders. |
| Dataset is very large (millions of samples) | PCA (fast), random projections (very fast), or UMAP (scales well). Avoid t-SNE unless using accelerated implementations. |
| Interpretability is critical | PCA (loadings are interpretable) or feature selection methods. |
| Need to transform new data points | PCA, UMAP, random projections, or autoencoders. Avoid t-SNE and classical manifold methods that lack out-of-sample extensions. |
| Supervised setting with class labels | LDA for linear reduction; supervised UMAP for non-linear reduction. |
| Very high-dimensional sparse data (text, counts) | Truncated SVD (LSA) or sparse random projections. |
A practical workflow is to start with PCA to get a quick baseline and understand the linear structure, then try UMAP or t-SNE if the PCA visualization does not reveal clear patterns. When PCA is used as preprocessing before t-SNE, reducing to 50 dimensions first can significantly speed up t-SNE without losing important structure.
Assessing how well a dimensionality reduction method has performed is essential but not straightforward, because the "correct" low-dimensional representation is usually unknown. Several metrics and approaches exist.
For methods that support reconstruction (PCA, autoencoders), the mean squared error between the original data and its reconstruction from the reduced representation is a natural quality measure. Lower reconstruction error indicates better preservation of the original information.
Specific to PCA, the cumulative explained variance ratio measures the fraction of total variance captured by the selected components. A ratio of 0.95 means the chosen components explain 95% of the variance in the data.
These metrics, introduced by Venna and Kaski (2006), evaluate local neighborhood preservation.
Scikit-learn provides a trustworthiness function that computes this metric directly.
Often the most practical evaluation is to measure how well the reduced data performs in a downstream task. For example, if dimensionality reduction is used as a preprocessing step for clustering, the quality of the resulting clusters (measured by silhouette score, adjusted Rand index, or similar metrics) provides an indirect assessment of the reduction quality.
The co-ranking matrix (Lee and Verleysen, 2009) compares the rank-order of neighbors in the original and reduced spaces. The local continuity meta-criterion (LCMC) summarizes the co-ranking matrix into a single score, capturing both local and global preservation.
Scikit-learn provides implementations of many dimensionality reduction techniques under the sklearn.decomposition and sklearn.manifold modules.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit PCA retaining 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced from {X.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>} to {X_reduced.shape<sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>} dimensions")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
from sklearn.manifold import TSNE
# Reduce to 2D for visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_scaled)
UMAP is available through the umap-learn package rather than scikit-learn, but it follows the same API.
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X_scaled)
# Transform new data points
X_new_2d = reducer.transform(X_new)
from sklearn.random_projection import SparseRandomProjection
transformer = SparseRandomProjection(n_components=100, random_state=42)
X_reduced = transformer.fit_transform(X)
Dimensionality reduction is used across many domains.
One of the most common applications is projecting high-dimensional data into 2D or 3D for visualization. t-SNE and UMAP are especially popular for visualizing clusters in single-cell RNA sequencing data, word embeddings, image features, and other complex datasets.
By projecting data onto its principal components or a learned latent space, dimensionality reduction can filter out noise that exists in minor dimensions. In image processing, retaining only the top principal components removes high-frequency noise while preserving the essential structure of the image.
Dimensionality reduction is used in data and image compression. PCA and autoencoders can reduce the storage requirements for large datasets by encoding them in fewer dimensions, with the option to approximately reconstruct the original data when needed.
Reducing the number of features before training a model can improve training speed, reduce overfitting, and sometimes improve model performance. PCA is commonly applied as a preprocessing step for support vector machines and other algorithms sensitive to the curse of dimensionality.
In genomics, datasets may contain expression levels for tens of thousands of genes across relatively few samples. Dimensionality reduction is essential for identifying meaningful patterns, such as cell types in single-cell RNA-seq analysis, where UMAP has become the standard visualization tool.
Collaborative filtering and matrix factorization methods for recommender systems rely on dimensionality reduction to identify latent factors that explain user preferences and item characteristics. SVD and non-negative matrix factorization (NMF) are the backbone of many recommendation algorithms.
In NLP, text data is often represented as high-dimensional vectors (one dimension per vocabulary term). Latent semantic analysis (via truncated SVD) reduces this space to capture semantic relationships between terms. More recently, learned embeddings from neural models project words and documents into dense, lower-dimensional spaces.
Dimensionality reduction can reveal outliers that are not apparent in the full feature space. Points that have high reconstruction error under PCA or an autoencoder, or that appear isolated in a t-SNE/UMAP plot, may be anomalies worth investigating.