See also: dimensionality reduction, singular value decomposition, eigenvalue decomposition, covariance matrix, feature extraction
Principal component analysis (PCA) is a statistical technique for reducing the dimensionality of a dataset by transforming the original variables into a new set of uncorrelated variables called principal components. These components are ordered so that the first principal component captures the largest possible share of the total variance in the data, the second component captures the largest share of the remaining variance while being orthogonal to the first, and so on. PCA is one of the most widely used methods in machine learning, data science, and applied statistics for visualization, noise reduction, feature extraction, and exploratory data analysis.
The technique works by finding the directions (axes) in the original feature space along which the data varies the most. Mathematically, this amounts to computing the eigenvectors and eigenvalues of the data's covariance matrix, or equivalently performing a singular value decomposition (SVD) of the centered data matrix. By retaining only the top k components (those associated with the largest eigenvalues), analysts can approximate the original high-dimensional data in a much lower-dimensional space while preserving most of its structure.
PCA is used across nearly every quantitative discipline: genomics, finance, neuroscience, image processing, natural language processing, climate science, and many others. Despite its age (it dates back to 1901), it remains a default first step in many data analysis pipelines because of its simplicity, speed, and strong theoretical foundations.
Imagine you have a big pile of toy blocks scattered across the floor. Each block has a position described by how far it is to the left, how far forward it is, and how high up it is. That is three numbers for every block.
Now suppose you look at the blocks from directly above. You lose the height information, but you can still see the general pattern of where the blocks are spread out. You just went from three numbers to two numbers for each block, and you picked the view that shows the most spread. That is basically what PCA does: it finds the "best camera angle" so that when you flatten the data down to fewer numbers, you keep as much of the spread (variance) as possible.
The first principal component is like the direction in which the blocks are most spread out. The second principal component is the next most spread-out direction, but it has to be perpendicular (at a right angle) to the first. PCA finds these directions automatically by looking at how all the numbers in your data relate to each other.
PCA has a long history that spans multiple disciplines. Its development involved two independent lines of work that converged over the 20th century.
Karl Pearson introduced the core idea in his 1901 paper "On Lines and Planes of Closest Fit to Systems of Points in Space," published in Philosophical Magazine. Pearson approached the problem from a geometric perspective: given a cloud of points in p-dimensional space, he sought the line (or plane) that minimized the sum of squared perpendicular distances from the points to that line. This is equivalent to finding the direction of maximum variance. Pearson framed the problem as an analogue of the principal axis theorem in mechanics, where the principal axes of a rigid body are the axes about which rotation is simplest.
Harold Hotelling independently developed the method in his 1933 paper "Analysis of a Complex of Statistical Variables into Principal Components," published in the Journal of Educational Psychology. Hotelling took an algebraic approach, defining principal components as linear combinations of the original variables that maximize variance subject to orthogonality constraints. He also gave the method its modern name. Hotelling's formulation, based on eigendecomposition of the correlation or covariance matrix, became the standard algebraic derivation taught in textbooks.
The method acquired different names in different fields:
| Name | Field | Notes |
|---|---|---|
| Karhunen-Loeve transform (KLT) | Signal processing | Named after Kari Karhunen (1946) and Michel Loeve (1948), who developed the continuous analogue for stochastic processes |
| Hotelling transform | Statistics | Named after Harold Hotelling |
| Proper orthogonal decomposition (POD) | Mechanical engineering, fluid dynamics | Used for analyzing turbulence and structural vibrations |
| Empirical orthogonal functions (EOF) | Meteorology, climate science | Introduced by Edward Lorenz in the 1950s for weather pattern analysis |
| Discrete Kosambi-Karhunen-Loeve transform | Some engineering texts | Acknowledges the contribution of D.D. Kosambi (1943) |
The computational feasibility of PCA improved dramatically with the development of efficient algorithms for eigendecomposition and SVD in the second half of the 20th century. The LAPACK library (1992) and iterative methods such as the Lanczos algorithm made PCA practical for large datasets. Modern randomized SVD algorithms, such as those described by Halko, Martinsson, and Tropp (2011), further extended PCA to datasets with millions of observations and thousands of features.
This section presents the mathematical foundations of PCA. There are two equivalent perspectives: variance maximization (Hotelling's approach) and minimum reconstruction error (Pearson's approach).
Consider a dataset of n observations, each described by p variables. Arrange the data into an n x p matrix X, where each row is an observation and each column is a variable. Before applying PCA, the data must be centered by subtracting the column means:
X_centered = X - 1 * mu^T
where mu is the p-dimensional mean vector and 1 is a column vector of ones. After centering, each column of X_centered has zero mean.
The sample covariance matrix of the centered data is:
C = (1 / (n - 1)) * X_centered^T * X_centered
This is a p x p symmetric positive semidefinite matrix. The diagonal entries are the variances of individual variables, and the off-diagonal entries are the covariances between pairs of variables.
The first principal component direction w_1 is the unit vector that maximizes the variance of the projected data:
w1 = argmax{||w|| = 1} w^T C w
By the theory of Lagrange multipliers, the solution satisfies C w_1 = lambda_1 w_1, meaning w_1 is the eigenvector of C corresponding to the largest eigenvalue lambda_1. The variance of the data projected onto w_1 equals lambda_1.
The k-th principal component direction w_k is the unit vector that maximizes variance subject to being orthogonal to all previous directions w1, ..., w{k-1}:
wk = argmax{||w|| = 1, w^T w_j = 0 for j < k} w^T C w
The solution is the eigenvector of C corresponding to the k-th largest eigenvalue lambda_k.
Since C is symmetric and positive semidefinite, it admits the eigendecomposition:
C = W Lambda W^T
where W = [w_1, w_2, ..., w_p] is an orthogonal matrix whose columns are the eigenvectors, and Lambda = diag(lambda_1, lambda_2, ..., lambda_p) is a diagonal matrix of eigenvalues sorted in decreasing order (lambda_1 >= lambda_2 >= ... >= lambda_p >= 0).
The principal component scores (the transformed data) are computed as:
T = X_centered * W
The columns of T are the principal components. Each principal component t_k = X_centered * w_k is a linear combination of the original centered variables.
Pearson's formulation seeks the rank-L approximation of the data matrix that minimizes reconstruction error. If we project the data onto L principal component directions and then project back, the reconstruction is:
X_hat = T_L * W_L^T
where T_L contains the first L columns of T and W_L contains the first L columns of W. The Eckart-Young-Mirsky theorem guarantees that this rank-L approximation minimizes the squared Frobenius norm of the difference X_centered - X_hat among all rank-L matrices.
PCA is closely related to the SVD of the centered data matrix. The SVD of X_centered is:
X_centered = U Sigma V^T
where U is n x n orthogonal, Sigma is n x p diagonal with singular values sigma_1 >= sigma_2 >= ... >= sigma_min(n,p) >= 0, and V is p x p orthogonal.
The connections are:
| PCA quantity | SVD equivalent |
|---|---|
| Eigenvectors of covariance matrix (W) | Right singular vectors (V) |
| Eigenvalues of covariance matrix (lambda_k) | sigma_k^2 / (n - 1) |
| Principal component scores (T) | U * Sigma |
| Loading matrix | V |
In practice, computing PCA via SVD of X_centered is numerically more stable than explicitly forming and eigendecomposing the covariance matrix C, especially when p is large. Most software implementations (NumPy, scikit-learn, MATLAB, R) use SVD internally.
The proportion of total variance captured by the k-th principal component is:
PVE_k = lambda_k / (lambda_1 + lambda_2 + ... + lambda_p)
The cumulative proportion of variance explained by the first L components is:
CPVE_L = (lambda_1 + lambda_2 + ... + lambda_L) / (lambda_1 + lambda_2 + ... + lambda_p)
This quantity is central to choosing the number of components to retain.
The standard PCA procedure, as implemented in most software libraries, follows these steps:
Standardize the data (optional but recommended). If the variables are measured on different scales, standardize each variable to have zero mean and unit variance. This prevents variables with larger scales from dominating the principal components. When all variables share the same units and scale, centering alone (subtracting the mean) is sufficient.
Compute the covariance (or correlation) matrix. For centered data X_centered, compute C = (1/(n-1)) * X_centered^T * X_centered. If the data was standardized, this equals the correlation matrix.
Compute eigenvalues and eigenvectors. Solve C w = lambda w to obtain eigenvalue-eigenvector pairs. Sort them in decreasing order of eigenvalue. Alternatively (and more commonly), compute the SVD of X_centered directly.
Choose the number of components L. Use one or more criteria (see next section) to decide how many components to retain.
Form the projection matrix. Construct W_L = [w_1, ..., w_L], the p x L matrix of the first L eigenvectors.
Transform the data. Compute the L-dimensional representation: T_L = X_centered * W_L.
The time complexity depends on the method used:
| Method | Time complexity | Best when |
|---|---|---|
| Full eigendecomposition of covariance matrix | O(p^3) after O(np^2) to form covariance matrix | p is small |
| Full SVD of data matrix | O(min(n^2 p, n p^2)) | n and p are moderate |
| Truncated SVD (e.g., Lanczos, ARPACK) | O(npL) for L components | L is much smaller than min(n, p) |
| Randomized SVD | O(npL) with small constants | Very large datasets |
Selecting the right number of components L is a practical decision that balances dimensionality reduction against information loss. Several criteria are commonly used.
A scree plot graphs the eigenvalues (or proportion of variance explained) against the component number. The name comes from the geological term for rubble at the base of a cliff. The plot typically starts high on the left and drops off, eventually flattening into a "tail" of small eigenvalues. The analyst looks for the "elbow" or "knee" point where the curve transitions from steep decline to a flat tail. Components before the elbow are retained; those after it are discarded. The method is subjective but widely used as a visual aid.
The Kaiser-Guttman criterion retains all components whose eigenvalue exceeds 1 (when PCA is applied to the correlation matrix, i.e., standardized data). The rationale is that a component with an eigenvalue below 1 explains less variance than a single original variable and is therefore not worth retaining. This rule was proposed by Henry Kaiser in 1960 and is one of the most commonly applied stopping rules, though it can overestimate the number of components in datasets with many variables.
A common rule of thumb retains enough components to explain a specified percentage of total variance, often 80%, 90%, or 95%. For example, if the first five components collectively explain 92% of the total variance, an analyst targeting 90% would retain five components. This approach is straightforward but arbitrary in the choice of threshold.
Parallel analysis (Horn, 1965) compares the observed eigenvalues to eigenvalues from random data of the same dimensions. Components are retained only if their eigenvalue exceeds the corresponding value from the random data. This method has been shown to outperform both the scree plot and the Kaiser criterion in simulation studies.
In predictive modeling settings, the number of components can be chosen by cross-validation. The data is split into training and test folds, PCA is fit on the training data, and reconstruction error or downstream task performance is evaluated on the test data for different values of L. The value of L that minimizes test error (or maximizes task performance) is selected.
| Method | Type | Strengths | Weaknesses |
|---|---|---|---|
| Scree plot | Visual | Intuitive, widely understood | Subjective, ambiguous elbow |
| Kaiser criterion | Rule-based | Simple to apply | Can overestimate; only valid for correlation matrix |
| Cumulative variance | Threshold | Easy to explain | Threshold choice is arbitrary |
| Parallel analysis | Statistical | Empirically validated | Requires generating random data; less well known |
| Cross-validation | Data-driven | Accounts for downstream task | Computationally expensive |
The classical PCA algorithm assumes linearity, works in batch mode, and produces dense loadings. Several extensions relax these assumptions.
Kernel PCA, introduced by Scholkopf, Smola, and Muller in 1998 ("Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation), extends PCA to capture nonlinear structure in data. Instead of computing the covariance matrix in the original input space, kernel PCA implicitly maps the data into a high-dimensional (possibly infinite-dimensional) feature space via a kernel function and performs PCA there.
The key insight is that PCA in the feature space can be expressed entirely in terms of inner products between data points. The kernel trick replaces these inner products with a kernel function k(x_i, x_j) = phi(x_i)^T phi(x_j), where phi is the (implicit) mapping. This avoids computing the feature space coordinates explicitly.
The algorithm constructs an n x n kernel matrix K with entries K_{ij} = k(x_i, x_j), centers it, and eigendecomposes it. Common kernel choices include the radial basis function (RBF, or Gaussian) kernel, the polynomial kernel, and the sigmoid kernel.
Kernel PCA is useful when the data lies on a curved manifold and linear PCA fails to capture the underlying structure. However, it scales as O(n^2) in memory and O(n^3) in time due to the n x n kernel matrix, making it impractical for very large datasets without approximation techniques.
Probabilistic PCA (PPCA), developed by Tipping and Bishop (1999) in their paper "Probabilistic Principal Component Analysis" (Journal of the Royal Statistical Society, Series B), formulates PCA as a latent variable model. The generative model assumes:
x = W z + mu + epsilon
where z is a L-dimensional latent variable drawn from a standard normal distribution N(0, I), W is a p x L loading matrix, mu is the mean, and epsilon is isotropic Gaussian noise with variance sigma^2.
The parameters W and sigma^2 can be estimated by maximum likelihood. The maximum likelihood solution for W is related to the leading eigenvectors of the covariance matrix, scaled by the corresponding eigenvalues minus the noise variance.
Probabilistic PCA has several advantages over classical PCA: it provides a proper likelihood function for model comparison, it handles missing data naturally through the EM algorithm, and it can be extended to mixture models for clustering. The expectation-maximization (EM) algorithm for PPCA is computationally attractive when p is much larger than L, because it avoids the full p x p eigendecomposition.
Sparse PCA modifies the standard PCA objective to produce principal components with sparse loadings (many zero entries). Standard PCA loadings are dense, making interpretation difficult when p is large. Sparse PCA adds a sparsity-inducing penalty (typically an L1 or elastic net penalty) to the optimization problem.
The formulation by Zou, Hastie, and Tibshirani (2006), called "Sparse Principal Components Analysis" (Journal of Computational and Graphical Statistics), recasts PCA as a regression problem and adds an elastic net penalty:
minimize ||X - X B A^T||_F^2 + lambda * sum(||b_j||1) + sum(lambda{2,j} * ||b_j||_2^2)
subject to A^T A = I
The resulting components are easier to interpret because each component depends on only a subset of the original variables. This is valuable in fields like genomics, where identifying which genes contribute to each component is scientifically meaningful.
Incremental PCA (IPCA) processes the data in mini-batches rather than loading the entire dataset into memory. It updates the eigendecomposition incrementally as each batch arrives. This makes PCA feasible for datasets that do not fit in memory. The algorithm produces results that closely approximate batch PCA. Implementations are available in scikit-learn (IncrementalPCA), where it accepts a batch_size parameter controlling how many observations are processed at a time.
Robust PCA addresses the sensitivity of classical PCA to outliers. The most well-known formulation, by Candes, Li, Ma, and Wright (2011), decomposes the data matrix into a low-rank component plus a sparse component:
X = L + S
where L is low-rank (representing the principal components) and S is sparse (capturing outliers or corrupted entries). The decomposition is obtained by solving a convex optimization problem known as principal component pursuit (PCP). Robust PCA has applications in video surveillance (separating static background from moving foreground), face recognition, and any setting where data corruption is expected.
PCA is applied across a remarkably wide range of disciplines. This section covers some of the most prominent use cases.
One of the most celebrated applications of PCA in computer vision is the eigenfaces method for face recognition, introduced by Turk and Pentland in 1991. The approach treats each face image (say, 100 x 100 pixels) as a point in a 10,000-dimensional space. PCA is used to find the principal components of the set of training face images. These principal components, when reshaped back into image form, resemble ghostly faces and are therefore called "eigenfaces."
To recognize a new face, the system projects it onto the eigenface basis and compares its coordinates to those of known individuals using a distance metric. The method reduces the storage and computation required from thousands of pixel values to tens or hundreds of component scores.
Limitations of the eigenfaces method include sensitivity to lighting, pose, and facial expression. The most significant eigenfaces often encode illumination variation rather than identity-specific features. Subsequent methods such as Fisherfaces (using linear discriminant analysis) and deep learning approaches have improved on eigenfaces, but the method remains an instructive example of PCA in practice.
PCA is a standard tool in genomics for analyzing genetic variation across populations. In genome-wide association studies (GWAS), PCA is used to correct for population stratification, where systematic ancestry differences between case and control groups can cause spurious associations between genetic variants and disease.
The landmark paper by Price et al. (2006), "Principal components analysis corrects for stratification in genome-wide association studies" (Nature Genetics), demonstrated that including the top principal components as covariates in regression models effectively controls for confounding due to population structure. Typically, the top 10 to 20 principal components of a genotype matrix (individuals x SNPs) are included as covariates.
PCA of genetic data also reveals population structure: when the first two or three principal components are plotted, individuals from different geographic regions or ethnic groups form distinct clusters. This makes PCA a valuable exploratory tool in population genetics.
In NLP, PCA is used for:
In quantitative finance, PCA is used to analyze the covariance structure of asset returns. The first few principal components of bond yield curves, for example, correspond to interpretable factors: the first component captures the overall level of interest rates, the second captures the slope (difference between long- and short-term rates), and the third captures curvature. PCA-based factor models are used for risk management, portfolio construction, and yield curve modeling.
Neuroscience researchers use PCA to analyze high-dimensional neural recordings. When recording from hundreds or thousands of neurons simultaneously, PCA identifies the dominant patterns of co-activation. The resulting low-dimensional "neural trajectories" reveal how brain activity evolves during sensory processing, motor planning, and decision-making. PCA is often the first step before applying more specialized methods like factor analysis or Gaussian process factor analysis.
PCA can be used for lossy image compression. By expressing an image (or a collection of image patches) in terms of principal components and retaining only the top L components, the image can be stored using far fewer numbers than the original pixel values. While PCA-based compression is not competitive with modern codecs (JPEG, WebP), it illustrates the principle of optimal linear approximation and is commonly used as a teaching example.
Climate scientists use empirical orthogonal functions (the climate science name for PCA) to identify dominant spatial patterns of variability in atmospheric and oceanic data. For example, EOF analysis of sea surface temperature data reveals patterns such as the El Nino-Southern Oscillation (ENSO), the Pacific Decadal Oscillation, and the Atlantic Multidecadal Oscillation.
PCA is connected to many other techniques in linear algebra, statistics, and machine learning.
As described above, PCA is essentially equivalent to truncated SVD applied to centered data. The distinction in software (e.g., scikit-learn's PCA vs. TruncatedSVD) is that PCA centers the data first while TruncatedSVD does not, making the latter suitable for sparse matrices where centering would destroy sparsity.
Factor analysis and PCA are often confused, but they rest on different models. PCA is a variance-maximizing projection; it does not assume a generative model. Factor analysis assumes that the observed variables are linear combinations of a smaller number of latent factors plus variable-specific noise. Factor analysis estimates factor loadings, communalities, and unique variances. In practice, PCA and factor analysis often produce similar results when the noise variance is small relative to the signal, but they answer different questions.
Linear discriminant analysis is a supervised dimensionality reduction method that seeks projections maximizing class separation rather than total variance. While PCA is unsupervised, LDA uses class labels. For classification tasks, LDA projections often outperform PCA projections because PCA may discard low-variance directions that happen to carry discriminative information.
A linear autoencoder with a single hidden layer and mean squared error loss learns the same subspace as PCA. Specifically, the weights of such an autoencoder converge to span the principal subspace (though the individual weight vectors are not necessarily the eigenvectors themselves). Nonlinear autoencoders with multiple hidden layers and nonlinear activation functions generalize PCA by capturing nonlinear relationships, similar in spirit to kernel PCA but with greater flexibility. Autoencoders trained by backpropagation can handle much larger datasets and more complex data distributions than kernel PCA.
Independent component analysis seeks components that are statistically independent rather than merely uncorrelated. While PCA components are uncorrelated (second-order statistics), ICA components are independent (all orders of statistics). ICA is commonly used in signal processing applications such as blind source separation, for example separating mixed audio signals or removing artifacts from EEG recordings.
Classical MDS applied to a Euclidean distance matrix produces the same embedding as PCA (up to rotation and reflection). MDS generalizes to non-Euclidean dissimilarities, making it useful when only pairwise distances (not feature vectors) are available.
Mean centering is a prerequisite for PCA. Without centering, the first principal component tends to point toward the data centroid rather than capturing the direction of maximum variance. Virtually all PCA implementations center the data automatically.
When variables are measured on different scales (e.g., height in centimeters and weight in kilograms), variables with larger numerical ranges dominate the principal components. Standardizing each variable to zero mean and unit variance (i.e., working with the correlation matrix instead of the covariance matrix) prevents this. As a rule of thumb: if the variables have the same units, use the covariance matrix; if they have different units or wildly different scales, use the correlation matrix.
Classical PCA requires complete data. Common strategies for handling missing values include:
The loading matrix W shows how each original variable contributes to each principal component. A large absolute loading indicates that the variable is strongly associated with that component. Analysts examine loadings to give interpretive labels to components (e.g., "size factor" vs. "shape factor" in morphometrics). However, loadings from standard PCA are typically dense and hard to interpret, which motivates sparse PCA.
Given the truncated scores T_L and loadings W_L, the original data can be approximately reconstructed:
X_reconstructed = T_L * W_L^T + mu
The reconstruction error is the sum of the discarded eigenvalues: sum_{k=L+1}^{p} lambda_k. This provides a natural measure of information loss.
Despite its wide use, PCA has well-known limitations.
Linearity. PCA finds linear projections. If the data lies on a curved manifold, PCA may not capture the underlying structure. Kernel PCA, t-SNE, UMAP, and autoencoders are better suited to nonlinear data.
Variance does not equal importance. PCA assumes that the directions of greatest variance are the most informative. This is not always true. In classification tasks, for example, the most discriminative features may lie along low-variance directions that PCA would discard.
Sensitivity to outliers. Because PCA minimizes squared distances, outliers can disproportionately influence the results. A single extreme observation can distort the first principal component. Robust PCA methods address this.
Scale dependence. As noted above, PCA is sensitive to the relative scaling of variables. The results change if a variable is rescaled from meters to millimeters.
Orthogonality constraint. All principal components must be mutually orthogonal. This is mathematically convenient but may not reflect the true structure of the data. Oblique rotation methods from factor analysis relax this constraint.
Interpretability. Each principal component is a linear combination of all original variables, making it difficult to assign a clear meaning to each component. Sparse PCA addresses this issue.
Stationarity assumption. When applied to time series data, standard PCA assumes that the covariance structure is constant over time. If the data is non-stationary, the principal components may be misleading.
No class information. PCA is unsupervised and ignores class labels. For supervised learning tasks, methods like LDA or supervised PCA may be more appropriate.
PCA is available in all major scientific computing environments.
| Library / Tool | Language | Function / Class | Notes |
|---|---|---|---|
| scikit-learn | Python | sklearn.decomposition.PCA | Also offers IncrementalPCA, KernelPCA, SparsePCA |
| NumPy | Python | numpy.linalg.svd or numpy.linalg.eigh | Low-level; users must center data manually |
| R | R | prcomp(), princomp() | prcomp uses SVD (preferred); princomp uses eigendecomposition |
| MATLAB | MATLAB | pca() | Built-in since R2012b |
| Spark MLlib | Scala / Python | PCA in pyspark.ml.feature | Distributed PCA for big data |
| TensorFlow | Python | tf.linalg.svd | Can run on GPU |
| Julia | Julia | MultivariateStats.fit(PCA, X) | Part of the MultivariateStats.jl package |