Principal component analysis (PCA) is an unsupervised learning technique for dimensionality reduction that identifies the directions of maximum variance in high-dimensional data and projects it onto a lower-dimensional subspace. It transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components, ordered so that the first component captures the most variance, the second captures the next most, and so on. PCA is one of the oldest and most widely used methods in machine learning, statistics, and data analysis, with applications ranging from noise reduction and visualization to feature engineering and face recognition [1].
PCA was invented by Karl Pearson in 1901 in his paper "On Lines and Planes of Closest Fit to Systems of Points in Space," published in the Philosophical Magazine [1]. Pearson approached the problem from a geometric perspective, seeking the line or plane that best fits a cloud of data points in a least-squares sense. He framed the technique as an extension of the principal axis theorem in mechanics, but he never used the term "principal components."
Harold Hotelling independently developed and formalized the method in 1933 in his paper "Analysis of a Complex of Statistical Variables into Principal Components," published in the Journal of Educational Psychology [2]. Hotelling gave the technique its modern name and recast it in statistical terms: finding orthogonal linear combinations of the original variables that have maximum variance. While Pearson's approach was purely geometric, Hotelling's formulation was statistical and introduced the eigenvalue-eigenvector framework that is standard today.
Throughout the mid-twentieth century, PCA saw growing use in psychometrics (factor analysis), meteorology, and economics. The development of electronic computers in the 1960s and 1970s made it practical to apply PCA to larger datasets, and it became a standard tool in multivariate statistics. In the 1990s and 2000s, PCA became a default preprocessing step in many machine learning pipelines, and variants such as kernel PCA and probabilistic PCA extended its reach to nonlinear and Bayesian settings.
PCA operates through a series of well-defined steps that transform raw data into a new coordinate system aligned with the directions of greatest variability.
Given a dataset with n observations and p features, arrange it as an n x p matrix X. Compute the mean of each feature (column) and subtract it from every observation so that each feature has zero mean. This centering step ensures that the principal components pass through the origin of the data cloud.
The covariance matrix C is a p x p symmetric matrix where each entry C(i,j) represents the covariance between feature i and feature j. It is computed as:
C = (1 / (n - 1)) * X^T X
The diagonal entries are the variances of each feature, and the off-diagonal entries capture how features co-vary. The covariance matrix encodes the second-order statistical structure of the data.
Compute the eigenvalues and eigenvectors of the covariance matrix C. Because C is symmetric and positive semi-definite, all eigenvalues are non-negative and the eigenvectors are orthogonal. Each eigenvector defines a direction in the original feature space, and its corresponding eigenvalue quantifies the variance of the data along that direction.
Sort the eigenvectors by their eigenvalues in descending order. The eigenvector with the largest eigenvalue is the first principal component (PC1), the one with the second-largest eigenvalue is PC2, and so on.
Select the top k eigenvectors (where k < p) and form a p x k projection matrix W. The reduced dataset is:
Z = X * W
Each row of Z is the projection of the original observation onto the k-dimensional subspace spanned by the top principal components. This lower-dimensional representation preserves as much variance as possible given the constraint of using only k dimensions.
More formally, PCA seeks to find a set of orthonormal vectors w_1, w_2, ..., w_p that maximize the variance of the projections of the data onto those vectors.
The first principal component is the solution to:
w_1 = argmax ||w|| = 1 { Var(X w) } = argmax ||w|| = 1 { w^T C w }
This is a constrained optimization problem solved via Lagrange multipliers, and the solution is the eigenvector of C corresponding to its largest eigenvalue. Each subsequent principal component maximizes variance subject to the constraint that it is orthogonal to all preceding components:
w_k = argmax ||w|| = 1, w perpendicular to w1,...,w(k-1) { w^T C w }
The total variance of the data equals the sum of all eigenvalues. The fraction of total variance captured by the first k components is:
Explained variance ratio = (lambda_1 + lambda_2 + ... + lambda_k) / (lambda_1 + lambda_2 + ... + lambda_p)
This ratio is the primary criterion for deciding how many components to retain.
A scree plot displays the eigenvalues (or the percentage of variance explained by each component) in descending order on the y-axis against the component number on the x-axis. The name comes from the geological term "scree," which refers to the loose rocks at the base of a cliff. In a scree plot, the principal components that capture meaningful structure appear as the steep part of the curve (the cliff), while the remaining components form the flat tail (the rubble) [3].
Common rules of thumb for selecting the number of components include:
| Method | Description |
|---|---|
| Cumulative variance threshold | Retain enough components to explain a target percentage of variance (commonly 90% or 95%) |
| Elbow method | Look for the "elbow" in the scree plot where the curve transitions from steep descent to a flat plateau |
| Kaiser criterion | Retain only components with eigenvalues greater than 1 (applicable when data is standardized) |
| Parallel analysis | Compare observed eigenvalues with eigenvalues from randomly generated data of the same dimensions |
In practice, the choice of k often involves a trade-off between compression (fewer components) and information preservation (more variance retained). Domain knowledge and downstream task performance also play a role.
PCA is most appropriate in the following situations:
| Scenario | Reason |
|---|---|
| High-dimensional data with many correlated features | PCA decorrelates features and reduces redundancy |
| Visualization of high-dimensional data | Projecting to 2 or 3 dimensions enables plotting |
| Preprocessing before applying other ML algorithms | Reducing dimensionality can speed up training and reduce overfitting |
| Noise reduction | Low-variance components often correspond to noise and can be discarded |
| Multicollinearity in regression | PCA removes collinearity by producing uncorrelated components |
| Exploratory data analysis | PCA reveals the dominant modes of variation in a dataset |
PCA is not appropriate when the relationships in the data are fundamentally nonlinear, when interpretability of individual features matters, or when the data does not have a meaningful covariance structure (for example, categorical data without ordinal encoding).
One of the most common uses of PCA is projecting high-dimensional data into two or three dimensions for visual inspection. By plotting observations along the first two principal components, analysts can identify clusters, outliers, and trends that would be invisible in the original feature space. This is standard practice in genomics, where researchers use PCA plots to visualize population structure from genotyping data, and in natural language processing, where word embeddings can be reduced for visualization.
When data contains noise, the noise tends to be spread across all components, while the true signal is concentrated in the top components. By reconstructing the data using only the top-k components and discarding the rest, PCA acts as a denoising filter. This technique is used in signal processing, image compression, and spectroscopy, where noisy measurements can be cleaned by projecting onto the principal subspace and back.
PCA is frequently used as a preprocessing step before applying classification or regression algorithms. Reducing the number of features can speed up training, reduce memory usage, and help prevent overfitting, especially when the number of features is comparable to or exceeds the number of observations. Algorithms like k-nearest neighbors, which suffer from the curse of dimensionality, often benefit from PCA preprocessing.
One of the most celebrated applications of PCA is the eigenfaces method for face recognition, introduced by Turk and Pentland in 1991 [4]. In this approach, a set of face images is treated as high-dimensional vectors (one dimension per pixel), and PCA is applied to find the principal components of the face space. These principal components, called eigenfaces, represent the dominant patterns of variation in facial appearance (lighting direction, facial expression, presence of glasses, and so on).
To recognize a new face, the system projects it onto the eigenface basis and compares the resulting low-dimensional representation against stored representations using a distance metric. Despite its simplicity, the eigenface method was one of the first practical automated face recognition systems and remains a pedagogically important example of PCA in action.
PCA is widely used in population genetics to visualize genetic variation across individuals. The first few principal components of genome-wide genotyping data often correspond to geographic or ancestral groupings, making PCA an essential tool for quality control and population structure analysis in genome-wide association studies.
In quantitative finance, PCA is applied to yield curve modeling, where the first three principal components of bond yields are commonly interpreted as "level," "slope," and "curvature" factors. PCA is also used for portfolio risk analysis, where the top components of stock return covariance matrices represent the dominant risk factors.
PCA and singular value decomposition (SVD) are closely related. Given a centered data matrix X (n x p), its SVD is:
X = U * S * V^T
where U (n x n) and V (p x p) are orthogonal matrices and S (n x p) is a diagonal matrix of singular values. The columns of V are the eigenvectors of X^TX, which is proportional to the covariance matrix. Therefore, the right singular vectors of X are the principal component directions, and the singular values are related to the eigenvalues of the covariance matrix by:
lambda_i = sigma_i^2 / (n - 1)
where sigma_i is the i-th singular value [5].
In practice, PCA is almost always computed via SVD rather than by explicitly forming and decomposing the covariance matrix. SVD is numerically more stable, especially when features are much more numerous than observations (the "wide data" case common in genomics and text analysis). Libraries like scikit-learn use truncated SVD internally when computing PCA.
Standard PCA is a linear method: it can only capture linear relationships among features. Kernel PCA, introduced by Scholkopf, Smola, and Muller in 1998, extends PCA to nonlinear settings using the kernel trick [6]. Instead of computing the covariance matrix in the original feature space, kernel PCA implicitly maps the data into a high-dimensional (potentially infinite-dimensional) feature space through a kernel function and performs PCA in that space.
Common kernels include:
| Kernel | Formula | Use case |
|---|---|---|
| Polynomial | (x^T y + c)^d | Moderate nonlinearity, combinatorial feature interactions |
| Radial basis function (RBF) | exp(-gamma * ||x - y||^2) | General-purpose nonlinear mapping |
| Sigmoid | tanh(alpha * x^T y + c) | Neural network-like transformation |
Kernel PCA can uncover nonlinear structure that standard PCA misses, such as circular or spiral-shaped clusters. However, it comes with trade-offs: the choice of kernel and its hyperparameters can be difficult, the method does not provide explicit principal components in the original feature space (making interpretation harder), and it requires computing an n x n kernel matrix, which can be expensive for large datasets.
Despite its widespread use, PCA has several well-known limitations.
Linearity. PCA assumes that the principal components are linear combinations of the original features. If the underlying structure of the data is nonlinear (for example, a Swiss roll manifold), PCA will fail to capture it. Kernel PCA addresses this partially, but at greater computational cost and with less interpretability.
Loss of interpretability. Each principal component is a weighted sum of all original features. Unlike the original features, which often have clear physical or semantic meaning, the principal components are abstract directions of variance. This makes it difficult to interpret what a given component represents, especially in domains where feature-level explanations are important.
Sensitivity to scaling. PCA is sensitive to the relative scales of features. A feature measured in thousands will dominate the variance and thus dominate the first principal component, even if it is not the most informative feature. For this reason, it is standard practice to standardize features (subtract the mean and divide by the standard deviation) before applying PCA.
Assumes variance equals importance. PCA equates the importance of a direction with its variance. In some problems, the most discriminative directions for classification may not align with the directions of greatest variance. For example, in a two-class problem, the direction that best separates the classes might have low overall variance. Linear discriminant analysis (LDA) addresses this by finding directions that maximize class separation rather than total variance.
Outlier sensitivity. Because PCA relies on the covariance matrix, it can be heavily influenced by outliers. Robust PCA variants (such as those based on the minimum covariance determinant estimator) have been developed to mitigate this.
For visualization of high-dimensional data, two nonlinear dimensionality reduction methods have largely supplanted PCA as the default tool: t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP).
t-SNE, introduced by van der Maaten and Hinton in 2008, converts pairwise similarities between data points into probability distributions and minimizes the divergence between the high-dimensional and low-dimensional distributions [7]. It excels at revealing local cluster structure and is widely used for visualizing single-cell RNA sequencing data, word embeddings, and image features. However, t-SNE can be slow on large datasets, is non-deterministic, does not preserve global distances well, and the resulting coordinates have no straightforward interpretation.
UMAP, introduced by McInnes, Healy, and Melville in 2018, constructs a topological representation of the high-dimensional data and optimizes a low-dimensional embedding to match that topology [8]. UMAP is generally faster than t-SNE, scales better to large datasets, and tends to preserve more of the global structure. It has become a popular alternative for exploratory analysis in biology, astronomy, and other fields.
| Feature | PCA | t-SNE | UMAP |
|---|---|---|---|
| Type | Linear | Nonlinear | Nonlinear |
| Preserves global structure | Yes | Poorly | Moderately |
| Preserves local structure | Moderately | Excellently | Excellently |
| Scalability | Excellent | Poor to moderate | Good |
| Interpretability of components | Moderate | Low | Low |
| Deterministic | Yes | No | No |
| Typical use | Preprocessing, denoising, general exploration | Cluster visualization | Cluster visualization, embedding |
It is worth noting that PCA and these nonlinear methods are not mutually exclusive. A common workflow is to first reduce data with PCA to 50 or 100 dimensions (which is fast and removes noise) and then apply t-SNE or UMAP for final 2D visualization.
Several variants of PCA have been developed to address specific needs:
Despite being over 120 years old, PCA remains one of the most frequently used tools in data analysis and machine learning. It is a standard preprocessing step in many ML pipelines, a default diagnostic tool in genomics and social science, and a building block for more complex methods. In the era of deep learning and large language models, PCA continues to find use in analyzing the internal representations of neural networks, compressing embedding spaces, and serving as a fast dimensionality reduction step before applying more computationally intensive methods.
The method's simplicity, speed, mathematical elegance, and widespread software support (built into NumPy, scikit-learn, R, MATLAB, and essentially every statistical computing environment) ensure that PCA will remain a foundational technique for years to come.