# Dimension Reduction

> Source: https://aiwiki.ai/wiki/dimension_reduction
> Updated: 2026-06-21
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

Dimensionality reduction, also known as dimension reduction, is the process of transforming data from a high-dimensional space into a lower-dimensional space while retaining as much of the meaningful structure and information as possible. It is a fundamental technique in [machine learning](/wiki/machine_learning) and data analysis, used to combat the curse of dimensionality, speed up computation, remove noise, and enable visualization of data with hundreds or thousands of [features](/wiki/feature). The two broad families of methods are feature selection, which keeps a subset of the original features, and feature extraction, which constructs new features by combining or transforming the originals.

High-dimensional data arises naturally in fields ranging from genomics and computer vision to natural language processing and finance. Working directly with hundreds or thousands of features introduces statistical, computational, and interpretive challenges that dimensionality reduction is designed to address. The choice between feature selection and feature extraction depends on the goals of the analysis, the nature of the data, and whether interpretability of individual features is required.

The field spans more than a century of work, from Karl Pearson's 1901 formulation of [principal component analysis](/wiki/principal_component_analysis) [1] to modern non-linear methods such as t-SNE (2008) [11] and UMAP (2018) [12], the latter of which the authors describe as "competitive with t-SNE for visualization quality" while "arguably preserv[ing] more of the global structure with superior run time performance." [12]

## Explain like I'm 5 (ELI5)

Imagine you have a big box of differently shaped and colored LEGO bricks. You want to build something cool, but there are so many pieces that it is hard to know where to start. Dimensionality reduction is like sorting those LEGO bricks by the things that matter most, maybe just their shape and color. Instead of keeping track of every tiny detail about each brick (its weight, its exact shade, whether it has scratches), you focus on just a couple of important things. Now you can see which bricks are similar and which are different, and building becomes much easier and faster.

## Why does dimensionality reduction matter?

### The curse of dimensionality

The **curse of dimensionality** is a term coined by Richard Bellman in his 1961 book *Adaptive Control Processes: A Guided Tour* to describe a collection of problems that arise when working with high-dimensional data. [2] As the number of dimensions (features) in a dataset increases, several issues emerge that make analysis, modeling, and computation significantly harder.

| Problem | Description |
|---|---|
| **Data sparsity** | As dimensions increase, the volume of the space grows exponentially, causing data points to become increasingly sparse. In high-dimensional space, even large datasets may have very few data points near any given point. |
| **Distance concentration** | In high dimensions, the distance between any two points tends to converge to similar values. This makes distance-based algorithms like [k-nearest neighbors](/wiki/k_nearest_neighbors) and [k-means](/wiki/k-means) less effective, because the notion of "nearest neighbor" becomes less meaningful. |
| **[Overfitting](/wiki/overfitting)** | With many features relative to the number of samples, models can easily memorize training data rather than learning generalizable patterns. The more dimensions, the more parameters a model must learn, and the more training data is needed to estimate those parameters reliably. |
| **Computational cost** | Training time and memory requirements grow with the number of features. Algorithms with complexity proportional to the number of dimensions become impractical in very high-dimensional spaces. |
| **Visualization difficulty** | Humans can perceive at most three spatial dimensions. Visualizing data in hundreds or thousands of dimensions requires reducing to 2D or 3D. |

As a rule of thumb, for reliable statistical estimation the number of samples needed grows exponentially with the number of dimensions. [2] A concrete illustration: covering a unit interval with a grid spaced every 0.1 needs 10 points in 1D, but the same grid resolution needs 100 points in 2D and 1,000 points in 3D, so the data required to maintain a fixed density grows as 10 raised to the number of dimensions. A dataset with 100 features may therefore require vastly more samples than one with 10 features to achieve the same level of model accuracy.

### Additional motivations

Beyond the curse of dimensionality, dimensionality reduction serves several practical purposes:

- **Noise reduction.** Minor dimensions often capture noise rather than signal. Projecting data onto its principal components or a learned latent space filters out irrelevant variation.
- **Data compression.** Storing and transmitting high-dimensional data is expensive. Compact representations reduce storage requirements while preserving approximate reconstructability.
- **Improved model performance.** Removing redundant or irrelevant features can improve the generalization of downstream models, especially when the sample size is limited.
- **Faster training and inference.** Fewer features translate directly to faster computation in many algorithms.

## Feature selection

Feature selection identifies and retains only the most relevant features from the original dataset. Because the selected features remain in their original form, this approach preserves interpretability.

### Filter methods

Filter methods evaluate the relevance of each feature independently, typically by measuring its relationship with the target variable. These methods are computationally efficient and do not rely on any specific machine learning model.

| Method | How it works | Strengths |
|---|---|---|
| **Pearson correlation** | Measures linear relationship between feature and target | Fast; works well for linearly related features |
| **Mutual information** | Measures the amount of information gained about the target from the feature | Captures non-linear relationships |
| **Chi-squared test** | Tests statistical independence between categorical features and the target | Well-suited for categorical data |
| **ANOVA F-test** | Compares the means of a feature across target classes | Good for selecting numerical features for classification tasks |
| **Variance threshold** | Removes features with variance below a specified level | Simple; removes constant or near-constant features |

### Wrapper methods

Wrapper methods use a specific machine learning model to evaluate the usefulness of feature subsets. The selection process is performed iteratively, adding or removing features until an optimal subset is found. While these methods can produce better results than filter methods, they are more computationally expensive.

- **Forward selection.** Begins with an empty feature set and adds features one at a time, selecting the feature that provides the greatest improvement in model performance at each step.
- **Backward elimination.** Begins with the full feature set and removes one feature at a time, dropping the feature whose removal causes the least degradation in performance.
- **Recursive feature elimination (RFE).** Trains a model, ranks features by importance, removes the least important features, and repeats until the desired number remains.

### Embedded methods

Embedded methods incorporate feature selection as part of the training process of a machine learning model.

- **[LASSO](/wiki/lasso_regression) (Least Absolute Shrinkage and Selection Operator).** Applies L1 [regularization](/wiki/regularization), driving coefficients of irrelevant features to exactly zero.
- **[Decision tree](/wiki/decision_tree)-based methods.** [Random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) machines compute feature importance scores based on impurity reduction or permutation importance.
- **[Elastic net](/wiki/elastic_net).** Combines L1 and L2 regularization, balancing feature selection with coefficient shrinkage.

## Feature extraction: linear methods

Linear techniques assume that the data can be represented in a lower-dimensional space using linear combinations of the original features.

### Principal component analysis (PCA)

[Principal component analysis](/wiki/principal_component_analysis) (PCA) is the most widely used linear dimensionality reduction technique. Developed by Karl Pearson in 1901 [1] and independently by Harold Hotelling in the 1930s, PCA identifies the directions of maximum variance in the data (called principal components) and projects the data onto these directions. Pearson framed the goal geometrically, as finding the "lines and planes of closest fit to systems of points in space." [1]

**How PCA works:**

1. Center the data by subtracting the mean of each feature.
2. Compute the covariance matrix of the centered data.
3. Calculate the eigenvectors and eigenvalues of the covariance matrix (eigenvalue decomposition).
4. Sort the eigenvectors by their corresponding eigenvalues in descending order.
5. Select the top *k* eigenvectors as the principal components.
6. Project the data onto the selected principal components.

The eigenvalue associated with each principal component represents the amount of variance explained by that component. The **cumulative explained variance ratio** helps determine how many components to retain. A common heuristic is to keep enough components to explain 95% of the total variance. [7] Alternatively, a scree plot (eigenvalues plotted against component index) can reveal an "elbow" where adding more components yields diminishing returns.

| Property | Description |
|---|---|
| Type | Linear, [unsupervised](/wiki/unsupervised_machine_learning) |
| Preserves | Global variance structure |
| Computational complexity | O(min(n^3, d^3)) for n samples and d dimensions |
| Scalability | Good; efficient implementations exist for large datasets (randomized SVD, incremental PCA) |
| Limitations | Cannot capture non-linear relationships; sensitive to feature scaling |
| Common variants | Kernel PCA, Incremental PCA, Sparse PCA, Robust PCA |

PCA is deterministic and produces interpretable results: each principal component is a linear combination of original features, and the loadings reveal which features contribute most. [7] Because PCA relies on variance, it is important to standardize features to unit variance before applying it when features have different scales.

### Linear discriminant analysis (LDA)

[Linear discriminant analysis](/wiki/linear_discriminant_analysis) (LDA) is a supervised dimensionality reduction technique that finds the linear combinations of features that best separate two or more classes. Unlike PCA, which maximizes variance, LDA maximizes the ratio of between-class variance to within-class variance (the Fisher criterion). LDA can reduce the dimensionality to at most C-1 dimensions, where C is the number of classes.

LDA is particularly useful as a preprocessing step for classification tasks because it explicitly optimizes class separability. However, it assumes that each class has the same covariance matrix (homoscedasticity), and its performance degrades when this assumption is violated.

### Factor analysis

Factor analysis is a statistical method that explains variability among observed, correlated variables in terms of a smaller number of unobserved latent variables called **factors**. While PCA extracts components that maximize variance, factor analysis models the data as arising from latent factors plus noise.

The key distinction is conceptual. PCA is a data reduction technique that finds orthogonal directions of maximum variance. Factor analysis, by contrast, assumes a generative model: each observed variable is a linear combination of latent factors plus an error term specific to that variable. The correlation between a variable and a factor is called the variable's **factor loading**.

Factor analysis is widely used in psychology, social sciences, and marketing where the goal is to identify interpretable latent constructs (such as "intelligence" or "customer satisfaction") from a large set of measured variables.

### Singular value decomposition (SVD)

[Singular value decomposition](/wiki/singular_value_decomposition) (SVD) factorizes a matrix into three component matrices: U, Sigma, and V^T. Truncated SVD retains only the top *k* singular values and their corresponding vectors, providing a low-rank approximation of the data. SVD is closely related to PCA and is particularly useful for sparse data, since it does not require computing the full covariance matrix. Truncated SVD is the standard approach for latent semantic analysis (LSA) in natural language processing.

### Random projections (Johnson-Lindenstrauss)

Random projection is a computationally efficient, data-agnostic method for dimensionality reduction. It works by projecting data onto a randomly generated lower-dimensional subspace. The theoretical foundation is the **Johnson-Lindenstrauss lemma** (1984), which states that a set of n points in high-dimensional Euclidean space can be embedded into a space of dimension O(log(n) / epsilon^2) such that all pairwise distances are preserved within a factor of (1 +/- epsilon). [3]

The key advantage of random projections is that the projection matrix depends only on the number of points and the desired distortion tolerance, not on the data itself. This makes random projections extremely fast and suitable for very large, high-dimensional datasets where computing eigenvectors would be prohibitively expensive.

Two common implementations are available in scikit-learn:

- **Gaussian random projection.** Uses a projection matrix with entries drawn from a normal distribution.
- **Sparse random projection.** Uses a sparse matrix with entries from {-1, 0, +1}, which is more memory-efficient. Achlioptas (2003) showed that this "database-friendly" approach preserves the theoretical guarantees of the lemma. [8]

### Independent component analysis (ICA)

Independent component analysis (ICA) is a technique that separates a multivariate signal into additive, statistically independent components. While PCA finds uncorrelated components that maximize variance, ICA finds components that are as statistically independent as possible, using measures of non-Gaussianity.

ICA was originally developed for blind source separation, the classic example being the "cocktail party problem" where multiple speech signals must be separated from a recording of overlapping conversations. In the context of dimensionality reduction, ICA is often applied to fMRI data, EEG signals, and financial time series where the underlying sources are assumed to be independent.

Common algorithms include FastICA (Hyvarinen and Oja, 1997) and JADE. ICA is typically preceded by PCA to reduce the number of dimensions before the independence optimization.

## Feature extraction: non-linear methods

Non-linear techniques are employed when the data's underlying structure lies on a curved manifold that cannot be adequately represented by linear combinations.

### t-distributed stochastic neighbor embedding (t-SNE)

t-SNE, introduced by Laurens van der Maaten and Geoffrey Hinton in 2008 in the *Journal of Machine Learning Research* (volume 9, pages 2579-2605), is a non-linear dimensionality reduction technique designed primarily for visualization of high-dimensional data in two or three dimensions. [11] The authors describe it as a method that "gives each datapoint a location in a two or three-dimensional map" and that "is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the centre of the map" than the original stochastic neighbor embedding. [11]

**How t-SNE works:**

1. Compute pairwise similarities between data points in the high-dimensional space using Gaussian kernels. Nearby points receive high similarity; distant points receive low similarity.
2. Define a similar probability distribution in the low-dimensional space using a Student's t-distribution (which has heavier tails than a Gaussian).
3. Minimize the Kullback-Leibler (KL) divergence between the two distributions using [gradient descent](/wiki/gradient_descent).

The use of the t-distribution in the low-dimensional space helps alleviate the "crowding problem," where points that are moderately distant in high-dimensional space are forced too close together in the lower-dimensional embedding.

| Property | Description |
|---|---|
| Type | Non-linear, unsupervised |
| Preserves | Local neighborhood structure |
| Key hyperparameter | Perplexity (roughly corresponds to the number of nearest neighbors considered; typical values range from 5 to 50) [11] |
| Computational complexity | O(n^2) for naive implementation; O(n log n) with Barnes-Hut approximation |
| Limitations | Non-deterministic (random initialization); does not preserve global structure well; cannot be applied to new data points without re-running; slow for very large datasets |

Important caveats when interpreting t-SNE plots: the sizes of clusters do not indicate their density, the distances between clusters do not reflect their true separation, and different runs with different random seeds can produce visually different results. It is advisable to run t-SNE multiple times and to experiment with different perplexity values.

### Uniform manifold approximation and projection (UMAP)

UMAP, introduced by Leland McInnes, John Healy, and James Melville in 2018, is a non-linear dimensionality reduction technique grounded in Riemannian geometry and algebraic topology. [12] UMAP constructs a topological representation of the high-dimensional data and then optimizes a low-dimensional [embedding](/wiki/embedding_space) that preserves this structure. The authors state that "the UMAP algorithm is competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance," and that it "has no computational restrictions on embedding dimension, making it viable as a general purpose dimension reduction technique for machine learning." [12]

**How UMAP works:**

1. Construct a weighted k-nearest-neighbor graph in the high-dimensional space.
2. Build a fuzzy simplicial complex that captures the topological structure of the data.
3. Optimize a low-dimensional embedding by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy simplicial complexes.

UMAP is founded on three core assumptions about the data: (1) the data is uniformly distributed on a Riemannian manifold, (2) the Riemannian metric is locally constant or can be approximated as such, and (3) the manifold is locally connected. [12]

| Property | Description |
|---|---|
| Type | Non-linear, unsupervised |
| Preserves | Both local and global structure (better global preservation than t-SNE) |
| Key hyperparameters | n_neighbors (controls local vs. global balance), min_dist (controls tightness of embedding) |
| Computational complexity | O(n^1.14) with approximate nearest neighbor search |
| Advantages over t-SNE | Faster; better preservation of global structure; supports embedding into arbitrary dimensions; can transform new data points |

### How do PCA, t-SNE, and UMAP differ?

The following table compares the three most commonly used dimensionality reduction techniques for different use cases.

| Aspect | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Non-linear | Non-linear |
| Speed | Very fast | Slow on large datasets | Fast |
| Global structure | Well preserved | Poorly preserved | Better preserved than t-SNE |
| Local structure | Not emphasized | Excellent | Excellent |
| Embedding dimensions | Any | Typically 2 or 3 | Any |
| New data points | Supports transform | Requires re-running on full dataset | Supports transform |
| Determinism | Deterministic | Non-deterministic | Non-deterministic (more stable with same seed) |
| Interpretability | High (loadings are meaningful) | Low (distances between clusters not meaningful) | Low (distances between clusters partially meaningful) |
| Best for | Preprocessing, noise reduction, when interpretability matters | Visualizing cluster structure in 2D | Visualization, preprocessing, large-scale analysis |
| Theoretical basis | Linear algebra (eigendecomposition) | Information theory (KL divergence) | Riemannian geometry and algebraic topology |

### Autoencoders for dimensionality reduction

[Autoencoders](/wiki/autoencoder) are a class of [neural networks](/wiki/neural_network) that learn to compress data into a lower-dimensional representation (the latent space) and then reconstruct the original data from this compressed form. In a landmark 2006 paper in *Science*, Geoffrey Hinton and Ruslan Salakhutdinov showed that deep autoencoder networks could learn low-dimensional codes that "work much better than principal components analysis as a tool to reduce the dimensionality of data." [9] An autoencoder consists of two parts:

- **Encoder.** Maps the input data to a lower-dimensional latent representation. The encoder progressively reduces the number of neurons in each layer, forcing the network to learn a compressed representation.
- **Decoder.** Reconstructs the original data from the latent representation. The decoder mirrors the encoder, progressively increasing the number of neurons.

The bottleneck layer (the layer with the fewest neurons) defines the latent space. The network is trained to minimize the reconstruction error (the difference between the input and the reconstructed output), which forces the latent space to capture the most important features of the data.

**Advantages over PCA:**

- Autoencoders can capture non-linear relationships, whereas PCA is limited to linear projections.
- Variational autoencoders (VAEs) learn a continuous, structured latent space that can be used for generative tasks.
- Denoising autoencoders learn representations that are robust to noise.

**Disadvantages:**

- Require more data and computational resources to train.
- The learned representation is harder to interpret than PCA components.
- Prone to overfitting on small datasets.

A simple linear autoencoder with one hidden layer and mean squared error loss learns the same subspace as PCA. The power of autoencoders comes from using non-linear activation functions and deeper architectures, which allow them to learn complex, non-linear manifold structures. [9]

### Kernel PCA

Kernel PCA (Scholkopf, Smola, and Muller, 1998) extends PCA to non-linear data by first mapping the data into a higher-dimensional feature space via a kernel function and then performing PCA in that space. [4] Common kernels include the radial basis function (RBF), polynomial, and sigmoid kernels. The kernel trick avoids explicitly computing the high-dimensional mapping, making the method computationally tractable.

### Manifold learning methods

Several classical manifold learning algorithms assume the data lies on or near a low-dimensional manifold embedded in the high-dimensional space.

| Technique | Year | Key idea |
|---|---|---|
| **Multidimensional scaling (MDS)** | 1952 | Preserves pairwise distances between points in the lower-dimensional space |
| **Isomap** | 2000 | Preserves geodesic distances on the data manifold using shortest-path graph distances [5] |
| **Locally linear embedding (LLE)** | 2000 | Preserves local linear relationships between neighboring points [6] |
| **Laplacian eigenmaps** | 2001 | Uses the graph Laplacian to find a low-dimensional embedding that preserves local distances |
| **Diffusion maps** | 2006 | Uses random walks on a graph to capture the multi-scale geometry of the data |

### Newer non-linear methods

Recent work has introduced methods that aim to better balance the preservation of local and global structure.

- **TriMap** (Amid and Warmuth, 2019) uses triplet constraints to preserve both local neighborhoods and global arrangement of the data.
- **PaCMAP** (Wang et al., 2021) introduces "mid-near pairs" that dynamically guide the optimization to capture global structure before refining local structure. In comprehensive benchmarks, PaCMAP consistently ranks among the top methods for preserving both local and global data geometry. [14]

## Feature selection vs. feature extraction: what is the difference?

Understanding the difference between feature selection and feature extraction is important for choosing the right approach.

| Aspect | Feature selection | Feature extraction |
|---|---|---|
| Output features | Subset of original features | New, transformed features |
| Interpretability | High (original feature meanings preserved) | Low to moderate (new features are combinations of originals) |
| Information loss | May discard useful information | Retains most information in fewer dimensions |
| Computational cost | Varies (filter methods are fast; wrapper methods are slow) | Varies (PCA is fast; autoencoders are slow) |
| Handling redundancy | Removes redundant features | Combines redundant features into compact representations |
| Best for | When interpretability is important; when features have clear domain meaning | When features are highly correlated; when visualization is the goal |
| Reversibility | Fully reversible (original features are retained) | Partially reversible (approximate reconstruction possible) |

In practice, feature selection and feature extraction are often complementary. A common pipeline uses [feature engineering](/wiki/feature_engineering) to create candidate features, feature selection to remove clearly irrelevant ones, and then PCA or another extraction method to reduce redundancy among the remaining features.

## How do you choose the right method?

Selecting a dimensionality reduction technique is not a one-size-fits-all decision. The choice depends on the goals of the analysis, data characteristics, and computational constraints. [13]

| Criterion | Recommendation |
|---|---|
| **Goal is visualization** | Use t-SNE or UMAP for 2D scatter plots. Use PCA if you also need interpretable axes. |
| **Goal is preprocessing for a downstream model** | Use PCA (or truncated SVD for sparse data). UMAP can also serve as a preprocessing step. |
| **Data is linearly separable or relationships are linear** | PCA or LDA (if labels are available). |
| **Data lies on a non-linear manifold** | UMAP, t-SNE, Isomap, or autoencoders. |
| **Dataset is very large (millions of samples)** | PCA (fast), random projections (very fast), or UMAP (scales well). Avoid t-SNE unless using accelerated implementations. |
| **Interpretability is critical** | PCA (loadings are interpretable) or feature selection methods. |
| **Need to transform new data points** | PCA, UMAP, random projections, or autoencoders. Avoid t-SNE and classical manifold methods that lack out-of-sample extensions. |
| **Supervised setting with class labels** | LDA for linear reduction; supervised UMAP for non-linear reduction. |
| **Very high-dimensional sparse data (text, counts)** | Truncated SVD (LSA) or sparse random projections. |

A practical workflow is to start with PCA to get a quick baseline and understand the linear structure, then try UMAP or t-SNE if the PCA visualization does not reveal clear patterns. When PCA is used as preprocessing before t-SNE, reducing to 50 dimensions first can significantly speed up t-SNE without losing important structure. [11]

## How is the quality of dimensionality reduction evaluated?

Assessing how well a dimensionality reduction method has performed is essential but not straightforward, because the "correct" low-dimensional representation is usually unknown. Several metrics and approaches exist.

### Reconstruction error

For methods that support reconstruction (PCA, autoencoders), the mean squared error between the original data and its reconstruction from the reduced representation is a natural quality measure. Lower reconstruction error indicates better preservation of the original information.

### Explained variance ratio

Specific to PCA, the cumulative explained variance ratio measures the fraction of total variance captured by the selected components. A ratio of 0.95 means the chosen components explain 95% of the variance in the data. [7]

### Trustworthiness and continuity

These metrics, introduced by Venna and Kaski (2006), evaluate local neighborhood preservation. [10]

- **Trustworthiness** measures whether points that appear close in the low-dimensional embedding were also close in the original space. Low trustworthiness indicates false neighbors (points that appear close but were actually far apart).
- **Continuity** measures whether points that were close in the original space remain close in the embedding. Low continuity indicates torn neighborhoods (nearby points that were separated by the reduction).

Scikit-learn provides a `trustworthiness` function that computes this metric directly. [15]

### Downstream task performance

Often the most practical evaluation is to measure how well the reduced data performs in a downstream task. For example, if dimensionality reduction is used as a preprocessing step for [clustering](/wiki/clustering), the quality of the resulting clusters (measured by silhouette score, adjusted Rand index, or similar metrics) provides an indirect assessment of the reduction quality.

### Co-ranking matrix and LCMC

The co-ranking matrix (Lee and Verleysen, 2009) compares the rank-order of neighbors in the original and reduced spaces. The local continuity meta-criterion (LCMC) summarizes the co-ranking matrix into a single score, capturing both local and global preservation.

## Implementation in scikit-learn

Scikit-learn provides implementations of many dimensionality reduction techniques under the `sklearn.decomposition` and `sklearn.manifold` modules. [15]

### PCA example

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize features before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA retaining 95% of variance
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)

print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")
print(f"Explained variance: {pca.explained_variance_ratio_.sum():.3f}")
```

### t-SNE example

```python
from sklearn.manifold import TSNE

# Reduce to 2D for visualization
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X_scaled)
```

### UMAP example

UMAP is available through the `umap-learn` package rather than scikit-learn, but it follows the same API.

```python
import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_2d = reducer.fit_transform(X_scaled)

# Transform new data points
X_new_2d = reducer.transform(X_new)
```

### Random projections example

```python
from sklearn.random_projection import SparseRandomProjection

transformer = SparseRandomProjection(n_components=100, random_state=42)
X_reduced = transformer.fit_transform(X)
```

## What is dimensionality reduction used for?

Dimensionality reduction is used across many domains.

### Data visualization

One of the most common applications is projecting high-dimensional data into 2D or 3D for visualization. t-SNE and UMAP are especially popular for visualizing clusters in single-cell RNA sequencing data, [word embeddings](/wiki/word_embedding), image features, and other complex datasets.

### Noise reduction

By projecting data onto its principal components or a learned latent space, dimensionality reduction can filter out noise that exists in minor dimensions. In image processing, retaining only the top principal components removes high-frequency noise while preserving the essential structure of the image.

### Compression

Dimensionality reduction is used in data and image compression. PCA and autoencoders can reduce the storage requirements for large datasets by encoding them in fewer dimensions, with the option to approximately reconstruct the original data when needed.

### Preprocessing for machine learning

Reducing the number of features before training a model can improve training speed, reduce overfitting, and sometimes improve model performance. PCA is commonly applied as a preprocessing step for [support vector machines](/wiki/support_vector_machine_svm) and other algorithms sensitive to the curse of dimensionality.

### Genomics and bioinformatics

In genomics, datasets may contain expression levels for tens of thousands of genes across relatively few samples. Dimensionality reduction is essential for identifying meaningful patterns, such as cell types in single-cell RNA-seq analysis, where UMAP has become the standard visualization tool.

### Recommender systems

[Collaborative filtering](/wiki/collaborative_filtering) and matrix factorization methods for [recommender systems](/wiki/recommender_system) rely on dimensionality reduction to identify latent factors that explain user preferences and item characteristics. SVD and non-negative matrix factorization (NMF) are the backbone of many recommendation algorithms.

### Natural language processing

In NLP, text data is often represented as high-dimensional vectors (one dimension per vocabulary term). Latent semantic analysis (via truncated SVD) reduces this space to capture semantic relationships between terms. More recently, learned embeddings from neural models project words and documents into dense, lower-dimensional spaces.

### Anomaly detection

Dimensionality reduction can reveal outliers that are not apparent in the full feature space. Points that have high reconstruction error under PCA or an autoencoder, or that appear isolated in a t-SNE/UMAP plot, may be anomalies worth investigating.

## References

1. Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." *Philosophical Magazine*, 2(11), 559-572.
2. Bellman, R. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
3. Johnson, W.B. and Lindenstrauss, J. (1984). "Extensions of Lipschitz mappings into a Hilbert space." *Contemporary Mathematics*, 26, 189-206.
4. Scholkopf, B., Smola, A., and Muller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." *Neural Computation*, 10(5), 1299-1319.
5. Tenenbaum, J.B., de Silva, V., and Langford, J.C. (2000). "A Global Geometric Framework for Nonlinear Dimensionality Reduction." *Science*, 290(5500), 2319-2323.
6. Roweis, S.T. and Saul, L.K. (2000). "Nonlinear Dimensionality Reduction by Locally Linear Embedding." *Science*, 290(5500), 2323-2326.
7. Jolliffe, I.T. (2002). *Principal Component Analysis*. 2nd Edition. Springer.
8. Achlioptas, D. (2003). "Database-friendly random projections: Johnson-Lindenstrauss with binary coins." *Journal of Computer and System Sciences*, 66(4), 671-687.
9. Hinton, G.E. and Salakhutdinov, R.R. (2006). "Reducing the Dimensionality of Data with Neural Networks." *Science*, 313(5786), 504-507.
10. Venna, J. and Kaski, S. (2006). "Local multidimensional scaling." *Neural Networks*, 19(6-7), 889-899.
11. van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605. https://www.jmlr.org/papers/v9/vandermaaten08a.html
12. McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint arXiv:1802.03426*. https://arxiv.org/abs/1802.03426
13. Nguyen, L.H. and Holmes, S. (2019). "Ten quick tips for effective dimensionality reduction." *PLOS Computational Biology*, 15(6), e1006907.
14. Wang, Y., Huang, H., Rudin, C., and Shaposhnik, Y. (2021). "Understanding How Dimension Reduction Tools Work: An Empirical Approach to Deciphering t-SNE, UMAP, TriMap, and PaCMAP for Data Visualization." *Journal of Machine Learning Research*, 22(201), 1-73.
15. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. https://scikit-learn.org/stable/modules/decomposition.html
