Principal Component Analysis (PCA)

Machine Learning Mathematics Statistics

32 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v6 · 6,340 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Principal component analysis (PCA) is an unsupervised learning technique for dimensionality reduction that identifies the orthogonal directions of maximum variance in high-dimensional data and projects observations onto a lower-dimensional subspace spanned by those directions. It transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components, ordered so that the first component captures the greatest variance, the second captures the next most under an orthogonality constraint, and so on. PCA is one of the oldest and most widely used methods in machine learning, statistics, and data analysis, with applications spanning noise reduction, visualization, feature engineering, face recognition, population genetics, finance, and the inspection of internal representations inside modern neural networks ^[1]^[2]^[5]. In their 2016 review, Ian Jolliffe and Jorge Cadima describe PCA as "a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss," achieved "by creating new uncorrelated variables that successively maximize variance" ^[5].

PCA can be understood from three equivalent viewpoints: as the eigendecomposition of the sample covariance matrix, as the singular value decomposition of the centered data matrix, or as the linear projection that minimizes squared reconstruction error. Each view leads to the same set of components, but they motivate different algorithms and different generalizations such as kernel PCA, probabilistic PCA, sparse PCA, and robust PCA. Despite being more than 120 years old, PCA remains a default first-pass tool for exploratory analysis, a workhorse preprocessing step in many ML pipelines, and a building block inside more recent methods. Pearson's 1901 paper and Hotelling's 1933 paper remain among the most cited in all of statistics, and Jolliffe and Cadima's 2016 review alone has accumulated more than 7,000 citations ^[5].

history: when was PCA invented?

pearson 1901

PCA was invented by Karl Pearson in 1901 in his paper "On Lines and Planes of Closest Fit to Systems of Points in Space," published in The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, Series 6, volume 2, issue 11, pages 559 to 572 ^[1]. Pearson approached the problem from a purely geometric perspective: given a cloud of points in space, what line or plane best represents the data in the sense of minimizing the sum of squared perpendicular distances from the points to the fit? He framed the technique as an extension of the principal axis theorem in mechanics. Although Pearson derived what we now call the first principal component and discussed extensions to higher dimensions, he never used the term "principal components," and his treatment was geometric rather than statistical. Pearson's paper predates the wide availability of mechanical and electronic computing, so the method existed as a mathematical idea decades before practical large-scale use became possible.

hotelling 1933

Harold Hotelling independently rediscovered and formalized the method in 1933 in his paper "Analysis of a Complex of Statistical Variables into Principal Components," published in two parts in the Journal of Educational Psychology, volume 24, issue 6, pages 417 to 441 and volume 24, issue 7, pages 498 to 520 ^[2]. Hotelling was motivated by problems in psychometrics, in particular the question of whether multiple measured indicators of mental ability (reading speed, arithmetic skill, memory tests, and so on) could be summarized by a smaller number of latent factors. He gave the technique its modern name, recast it in statistical terms as the search for orthogonal linear combinations of the original variables with maximum variance, and explicitly developed the eigenvalue and eigenvector machinery that is standard today. Hotelling also worked out the algebraic and inferential structure of the method in considerable detail, including computational shortcuts suitable for the desk calculators of the era.

postwar adoption

Throughout the mid twentieth century, PCA spread through psychometrics (where it was often confused with the closely related but distinct factor analysis), meteorology (where the components are sometimes called empirical orthogonal functions, or EOFs), and economics. The method became practical for moderately sized problems with the arrival of electronic computers in the 1960s and 1970s, and it became a textbook fixture in multivariate statistics. The publication of Ian Jolliffe's monograph Principal Component Analysis in 1986 (with a substantially expanded second edition in 2002) consolidated the method's place as a foundational tool ^[5]. By the 1990s and 2000s, PCA had become a default preprocessing step in many machine learning pipelines, and a wave of variants extended it to nonlinear, sparse, probabilistic, robust, and streaming settings.

mathematical formulation

PCA can be derived in several equivalent ways. The two most common derivations are the variance-maximization view and the reconstruction-error-minimization view; both produce the same components.

data setup and centering

Let X be an n by p data matrix with one observation per row and one feature per column. The sample mean of feature j is the average of column j, and the centered data matrix is

\mathbf{X}_c = \mathbf{X} - \mathbf{1} \mu^\top

where $\mathbf{1}$ is an n by 1 vector of ones and $\mu$ is the p by 1 vector of feature means. After centering, every feature has zero mean. Centering is essential because PCA is defined in terms of variance about the mean; without it, the first component would tend to point toward the centroid of the data rather than capturing the dominant axis of variation. When features are measured on different scales, it is also standard to divide each feature by its sample standard deviation, producing a standardized data matrix; otherwise large-scale features dominate the covariance matrix purely because of their units.

covariance matrix and eigendecomposition

The p by p sample covariance matrix is

\mathbf{C} = \frac{1}{n - 1} \mathbf{X}_c^\top \mathbf{X}_c

The diagonal entries of C are the per-feature variances and the off-diagonal entries are pairwise covariances. Because C is symmetric and positive semi-definite, it admits an eigendecomposition

\mathbf{C} = \mathbf{V} \Lambda \mathbf{V}^\top

where V is a p by p orthogonal matrix whose columns are the eigenvectors of C and $\Lambda$ is a diagonal matrix whose entries $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_p \ge 0$ are the corresponding eigenvalues sorted in descending order. The columns of V are the principal axes (or loadings) and the eigenvalue $\lambda_k$ is exactly the variance of the data along the k-th axis. The first k columns of $\mathbf{V}$ form the projection matrix $\mathbf{W}_k$ , and the projected data are

\mathbf{Z} = \mathbf{X}_c \mathbf{W}_k

Each row of Z is a k-dimensional principal component score for the corresponding observation. The total variance of the data equals the trace of C, which equals the sum of the eigenvalues, so the fraction of total variance retained by the first k components is

\text{explained variance ratio} = \frac{\lambda_1 + \cdots + \lambda_k}{\lambda_1 + \cdots + \lambda_p}

This ratio is the most common quantitative criterion for choosing k.

variance maximization view

PCA can also be derived as a sequence of constrained optimization problems. The first principal axis is the unit vector w_1 that maximizes the variance of the projected scores:

\mathbf{w}_1 = \arg\max_{\lVert \mathbf{w} \rVert = 1} \mathrm{Var}(\mathbf{X}_c \mathbf{w}) = \arg\max_{\lVert \mathbf{w} \rVert = 1} \mathbf{w}^\top \mathbf{C} \mathbf{w}

The Lagrangian for this constrained problem is $L(\mathbf{w}, \mu) = \mathbf{w}^\top \mathbf{C} \mathbf{w} - \mu (\lVert \mathbf{w} \rVert^2 - 1)$ , and setting its gradient to zero gives $\mathbf{C} \mathbf{w} = \mu \mathbf{w}$ , the eigenvalue equation. The solution is the eigenvector of C with the largest eigenvalue, and the maximized variance equals $\lambda_1$ . The k-th principal axis solves the same problem subject to the additional constraint that $\mathbf{w}_k$ is orthogonal to $\mathbf{w}_1, \ldots, \mathbf{w}_{k-1}$ , and the solution is the eigenvector with the k-th largest eigenvalue.

reconstruction error view

A third perspective characterizes PCA as the linear projection that minimizes squared reconstruction error. Among all rank-k linear projections, PCA produces the projection $P_k = \mathbf{W}_k \mathbf{W}_k^\top$ such that

E = \sum_i \lVert x_i - \mu - P_k (x_i - \mu) \rVert^2

is minimized. This is a consequence of the Eckart-Young-Mirsky theorem on best low-rank approximation in the Frobenius norm. From this viewpoint PCA is a lossy compression scheme: storing the k component scores and the p by k loading matrix lets one approximately reconstruct each original observation, with mean squared reconstruction error equal to the sum of the discarded eigenvalues.

the SVD route: why is PCA computed with SVD?

In practice, PCA is almost never computed by explicitly forming the covariance matrix and running an eigendecomposition. The preferred route is the singular value decomposition of the centered data matrix

\mathbf{X}_c = \mathbf{U} \Sigma \mathbf{V}^\top

where U is n by min(n, p) with orthonormal columns, Sigma is a min(n, p) by min(n, p) diagonal matrix of non-negative singular values sigma_1 >= sigma_2 >= ..., and V is p by min(n, p) with orthonormal columns. Substituting into the covariance matrix gives

\mathbf{C} = \frac{1}{n - 1} \mathbf{V} \Sigma^2 \mathbf{V}^\top

so the right singular vectors $\mathbf{V}$ are exactly the principal axes and the eigenvalues are related to the singular values by $\lambda_k = \sigma_k^2 / (n - 1)$ . The principal component scores can be read off directly as $\mathbf{Z} = \mathbf{U}_k \Sigma_k$ , where $\mathbf{U}_k$ and $\Sigma_k$ are the truncations to the first k columns and rows.

Going through SVD is preferred for numerical reasons. Forming X_c^T X_c squares the condition number of X_c, so any rounding noise in the input is amplified when the covariance matrix is computed. SVD operates on X_c directly and is therefore much more accurate when features are nearly collinear or when n is much smaller than p. SVD is also unavoidable when n is smaller than p: the p by p covariance matrix is rank-deficient, but the n by n matrix X_c X_c^T has the same nonzero eigenvalues, and SVD picks them up automatically. Finally, randomized SVD provides a fast route to the top components for large matrices (see the section on randomized PCA below).

choosing the number of components: how many components should you keep?

Deciding how many components to keep is typically the most subjective step in a PCA workflow. Several rules of thumb are in widespread use, and a careful analyst usually consults more than one.

Method	How it works	Strengths	Weaknesses
Cumulative variance threshold	Retain the smallest k such that cumulative explained variance reaches a target (commonly 90 or 95 percent)	Simple, interpretable, application-aligned	Threshold choice is arbitrary
Scree plot / elbow	Plot eigenvalues in descending order and pick the "elbow" where the curve flattens	Visual, often obvious in practice	Subjective; not all data have a clear elbow
Kaiser criterion	Keep components with eigenvalue greater than 1 on standardized data	No subjective threshold	Tends to over-retain components when p is large
Parallel analysis	Compare observed eigenvalues against eigenvalues from random data with the same shape and keep components that exceed the random baseline	Statistically motivated, robust in practice	Requires simulation or permutation
Cross-validation	Choose k by held-out reconstruction error or downstream task accuracy	Tied to the actual use case	Computationally expensive
Bayesian model selection	Place priors on component variances and let the data switch off unneeded components, as in Bayesian PCA	Automatic, principled	Requires fitting a probabilistic model

The scree plot, originally proposed by Raymond Cattell in 1966, takes its name from the geological term for the loose rocks at the base of a cliff: meaningful components form the steep cliff face and the rest form a flat scree of small eigenvalues ^[3]. In modern practice, cumulative variance and cross-validation against a downstream task are the most common criteria; the Kaiser criterion is now widely viewed as too liberal except in tightly controlled psychometric settings.

Once the principal components have been computed, several closely related transforms are useful in different contexts.

Reconstruction. Given the loadings W_k and the scores Z, one can approximately reconstruct the original data as $\hat{\mathbf{X}} = \mathbf{Z} \mathbf{W}_k^\top + \mu$ . The expected squared reconstruction error per observation equals the sum of the discarded eigenvalues, $\lambda_{k+1} + \cdots + \lambda_p$ . For images, audio, or other measurements, reconstruction with a small number of components produces a smoothed, denoised approximation that retains the dominant patterns while discarding small-amplitude variation.

PCA whitening. Whitening rescales the principal component scores so that each whitened feature has unit variance and the components are uncorrelated. The PCA whitening transform is

x_{\mathrm{PCAwhite}} = \Lambda^{-1/2} \mathbf{V}^\top x_c

where $x_c$ is a centered observation. After this transform the covariance matrix of the whitened data is the identity, which is sometimes a desirable preprocessing step for downstream models that assume isotropic inputs ^[10].

ZCA whitening. ZCA (zero-phase component analysis) whitening uses an extra rotation back into the original feature basis:

x_{\mathrm{ZCAwhite}} = \mathbf{V} \Lambda^{-1/2} \mathbf{V}^\top x_c

The covariance is still the identity, but ZCA-whitened images stay close to the originals in pixel space, whereas PCA-whitened images look scrambled because they are expressed in the eigenvector basis. ZCA whitening was popular as a preprocessing step for early convolutional networks on small image datasets such as CIFAR-10, and it is the unique whitening transform that minimizes mean squared distance from the original data ^[10].

Sphering and Mahalanobis distance. PCA whitening is equivalent to expressing distances in the Mahalanobis metric of the original data, which makes whitened Euclidean distance a natural similarity measure for downstream nearest-neighbor or clustering algorithms.

variants and extensions

A large family of methods generalizes standard PCA to handle nonlinearity, sparsity, uncertainty, outliers, streaming data, very large matrices, and structured data such as functions or signals. The most influential are summarized below.

Variant	Year and authors	Core idea	Typical use
Standard PCA	Pearson 1901; Hotelling 1933	Eigendecomposition of covariance, or SVD of centered data	General linear dimensionality reduction
Kernel PCA	Schoelkopf, Smola, Mueller 1998	PCA in a kernel-induced feature space via the kernel trick	Nonlinear feature extraction, denoising, novelty detection
Probabilistic PCA	Tipping and Bishop 1999	Gaussian latent variable model with isotropic noise; closed-form maximum likelihood and EM	Missing data, mixture models, principled likelihood
Bayesian PCA	Bishop 1999	Place priors over loadings, use ARD or variational inference to switch off unneeded components	Automatic dimensionality selection
Sparse PCA	Zou, Hastie, Tibshirani 2006	Add an L1 penalty on loadings to force most entries to zero	Interpretable components in genomics, finance
Robust PCA	Candes, Li, Ma, Wright 2011	Decompose a matrix into a low-rank part plus a sparse outlier part by minimizing nuclear norm plus L1 norm	Background subtraction in video, outlier-robust analysis
Incremental PCA	Levy and Lindenbaum 2000; Ross et al. 2008	Update an existing PCA basis as new observations arrive in mini-batches	Streaming data, datasets too large to fit in memory
Randomized PCA	Halko, Martinsson, Tropp 2011	Approximate the top components using random projections and a small number of passes over the matrix	Very large or out-of-core matrices
Functional PCA	Ramsay and Silverman 1990s onward	PCA on smooth functional observations such as growth curves or spectra	Functional data analysis
Common spatial patterns (CSP)	Koles 1990s; widely used in BCI	Joint diagonalization of two class-conditional covariance matrices; reduces to PCA when one covariance is the identity	EEG-based brain-computer interfaces

kernel PCA

Standard PCA only captures linear structure. Kernel PCA, introduced by Bernhard Schoelkopf, Alex Smola, and Klaus-Robert Mueller in their 1998 paper "Nonlinear Component Analysis as a Kernel Eigenvalue Problem" in Neural Computation, volume 10, pages 1299 to 1319, lifts PCA into a high-dimensional feature space implicitly defined by a kernel function k(x, y), then runs PCA in that space ^[6]. Concretely, kernel PCA computes the n by n centered kernel matrix K with $K_{ij} = k(x_i, x_j)$ , eigendecomposes it, and uses the eigenvectors to express each principal component as a linear combination of the input observations. The components capture nonlinear directions of variance in the original space, including curved or circular structure that linear PCA cannot represent. Common kernels include the polynomial kernel $(x^\top y + c)^d$ , the radial basis function kernel $\exp(-\gamma \lVert x - y \rVert^2)$ , and the sigmoid kernel $\tanh(\alpha x^\top y + c)$ . The trade-off is that the n by n kernel matrix can be expensive to form and store for large datasets, the choice of kernel and its hyperparameters is itself a modeling problem, and there is no straightforward way to recover an explicit principal axis in the original feature space, which limits interpretability.

probabilistic PCA

Michael Tipping and Christopher Bishop's 1999 paper "Probabilistic Principal Component Analysis" recast PCA as a latent variable model with a Gaussian latent variable z of dimension k and a Gaussian observation model $x = \mathbf{W} z + \mu + \epsilon$ , where $\epsilon$ has isotropic noise with covariance $\sigma^2 I$ ^[7]. The maximum likelihood solution recovers the standard PCA loadings up to rotation, but the probabilistic formulation has several practical advantages: it provides a principled likelihood that can be used for model comparison, it can be fit with the expectation-maximization algorithm even when some entries of the data matrix are missing, it generalizes naturally to mixtures of probabilistic PCAs for clustering, and it sets the stage for Bayesian extensions in which priors over the loadings allow automatic relevance determination of unneeded components. Bishop's 1999 "Variational Principal Components" paper used variational inference to make the Bayesian version computationally efficient and to switch off unneeded components automatically through continuous hyperparameters rather than discrete model search ^[11].

sparse PCA

A standard principal component is a dense linear combination of all original features, which can be hard to interpret in domains such as genomics where the loadings on individual genes are themselves of scientific interest. Hui Zou, Trevor Hastie, and Robert Tibshirani's 2006 paper "Sparse Principal Component Analysis" reformulated PCA as a regression problem and added an elastic net penalty on the loadings, forcing most entries to zero and producing components with a small number of nonzero loadings ^[8]. The result is interpretable at the cost of slightly reduced explained variance and loss of strict orthogonality among the components. Sparse PCA is widely used in gene-expression analysis, finance, and any setting where the goal is feature selection within an unsupervised dimensionality reduction step.

robust PCA

Standard PCA is sensitive to outliers because a single anomalous observation can shift the covariance matrix substantially. Emmanuel Candes, Xiaodong Li, Yi Ma, and John Wright's 2011 paper "Robust Principal Component Analysis?" proposed decomposing a data matrix M into a low-rank component L and a sparse component S by solving the convex program known as Principal Component Pursuit, which minimizes the nuclear norm of L (a convex surrogate for rank) plus a weighted L1 norm of S (a convex surrogate for sparsity) ^[9]. Under suitable incoherence conditions, exact recovery is guaranteed even when a positive fraction of the entries of M are arbitrarily corrupted. Robust PCA has become a standard tool for background subtraction in video (where the static background is the low-rank part and moving objects are the sparse part), face recognition under variable illumination, and outlier-resistant analysis in many applied fields.

incremental and randomized PCA

For large datasets that do not fit in memory, two complementary techniques are widely used. Incremental PCA processes the data in mini-batches and updates an existing PCA basis as new batches arrive, building on the sequential Karhunen-Loeve algorithm of Levy and Lindenbaum (2000) and the visual-tracking work of David Ross and colleagues (2008). Incremental PCA has memory complexity proportional to batch_size times p rather than n times p, which makes it the standard choice for streaming data and out-of-core computation in libraries such as scikit-learn. Randomized PCA, based on the framework of Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp (2011), draws a small Gaussian random matrix Omega, computes $Y = \mathbf{X} \Omega$ , orthonormalizes Y to get an approximate range basis Q, and then runs an exact but small SVD on $Q^\top \mathbf{X}$ ^[4]. The resulting algorithm requires only a few passes over the matrix, scales to billions of rows, and produces approximations whose error is provably close to the optimal truncated SVD. Scikit-learn defaults to randomized SVD inside its PCA implementation when the input is larger than 500 by 500 and the requested number of components is less than 80 percent of the smaller dimension ^[12].

functional PCA and other structured variants

When each observation is a function rather than a vector (a growth curve, a temperature profile over a year, a spectrum), James Ramsay and Bernard Silverman's framework of functional data analysis treats observations as elements of a function space, expands them in a smooth basis such as splines, and applies PCA to the basis coefficients to extract dominant modes of variation. The resulting functional principal components are themselves functions, and they have proved valuable in chemometrics, growth biology, and longitudinal medical data. Other structured variants include tensor PCA for multi-way arrays, sparse functional PCA, multilevel PCA for hierarchical data, and common spatial patterns (CSP) in EEG analysis. CSP is mathematically a joint diagonalization of two class-conditional covariance matrices and reduces exactly to PCA when one of those matrices is the identity; it remains the dominant feature extraction method for motor-imagery brain-computer interfaces.

comparison with other dimensionality reduction methods

PCA sits in a much larger family of dimensionality reduction and manifold learning techniques. The main alternatives are summarized below.

Method	Linear or nonlinear	Supervised	Deterministic	Preserves global structure	Preserves local structure	Typical scale
PCA	Linear	No	Yes	Yes	Moderately	Excellent (millions of rows)
Linear discriminant analysis (LDA)	Linear	Yes	Yes	Yes	Moderately	Excellent
Kernel PCA	Nonlinear	No	Yes	Moderately	Moderately	Moderate (n by n kernel matrix)
Multidimensional scaling (MDS)	Nonlinear	No	Yes (classical) or No	Yes (classical)	Moderately	Moderate
Isomap	Nonlinear	No	Yes	Yes	Yes	Moderate
Locally linear embedding (LLE)	Nonlinear	No	Yes	Poorly	Yes	Moderate
t-SNE	Nonlinear	No	No	Poorly	Excellently	Poor to moderate
UMAP	Nonlinear	No	No	Moderately	Excellently	Good
Autoencoders	Nonlinear (or linear)	No (typically)	No	Depends on architecture	Depends on architecture	Good with GPUs

A few comparisons deserve explicit comment.

how does PCA differ from LDA?

PCA is unsupervised: it picks the directions of greatest variance regardless of any class labels. Linear discriminant analysis is supervised: it picks the directions that maximize the ratio of between-class variance to within-class variance and is therefore better suited to classification. The directions chosen by PCA and LDA can be very different: a feature with high overall variance but no class information will dominate PCA but be ignored by LDA.

how does PCA differ from t-SNE and UMAP?

t-SNE (van der Maaten and Hinton 2008) and UMAP (McInnes, Healy, and Melville 2018) are nonlinear methods built specifically for low-dimensional visualization of cluster structure. They tend to reveal local groupings far more vividly than PCA, but they distort global distances, are stochastic, and provide no closed-form mapping for new points. PCA is linear, deterministic, fast, and provides an explicit projection that can be applied to new data, but it cannot unfold a curved manifold. The two families are complementary: a common workflow is to run PCA first to compress to about 50 components (which removes noise and speeds up the rest of the pipeline) and then apply t-SNE or UMAP to those components for the final 2D visualization.

how does PCA differ from autoencoders?

A linear autoencoder with squared-error loss and an information bottleneck is mathematically equivalent to PCA: the optimal weights span the same subspace as the principal axes (although they need not coincide with them up to rotation). Nonlinear autoencoders with hidden layers and nonlinear activations can learn nonlinear codes that PCA cannot represent. The trade-offs are familiar from deep learning more generally: autoencoders need more data, more compute, and careful regularization, but can capture richer structure when those resources are available.

use cases: what is PCA used for?

exploratory visualization

Projecting onto the first two or three principal components is the simplest and often most informative way to look at a high-dimensional dataset. Plots of PC1 versus PC2 reveal clusters, outliers, gradients, and the overall shape of the data cloud. In single-cell RNA sequencing, in word embedding analysis, and in the analysis of large image datasets, PCA plots are a standard first diagnostic; analysts often follow up with t-SNE or UMAP for finer cluster visualization but consult the PCA plot first to understand the global geometry.

preprocessing for machine learning

Reducing dimensionality with PCA before training a downstream model can speed up training, reduce memory consumption, decorrelate the inputs, and act as a mild regularizer that mitigates overfitting in the small-n large-p regime. Algorithms that suffer from the curse of dimensionality (k-nearest neighbors, kernel density estimation, Gaussian mixture models) often benefit most. Many tabular ML pipelines also use PCA to remove multicollinearity before applying linear or logistic regression.

noise reduction

When a signal of interest is concentrated in the top components and noise is spread across the rest, reconstructing from the top components and discarding the rest yields a denoised version of the data. This idea underlies PCA-based denoising in image processing, spectroscopy, magnetic resonance imaging, and seismic analysis, where the low-eigenvalue "tail" is dominated by measurement noise.

lossy compression

Storing the loadings W_k and the scores Z is far cheaper than storing the full data matrix when k is much smaller than p. PCA-based compression is rarely competitive with hand-tuned codecs for natural images or audio, but it is a generic, application-agnostic compression scheme that has been used for everything from hyperspectral imagery to gene expression panels.

eigenfaces and computer vision

The most celebrated application of PCA in computer vision is the eigenfaces method introduced by Matthew Turk and Alex Pentland in 1991 in Journal of Cognitive Neuroscience, volume 3, pages 71 to 86 ^[13]. Each face image is treated as a single high-dimensional vector (one entry per pixel). PCA is applied to a training set of aligned face images, and the leading principal components are reshaped back into images called eigenfaces. These eigenfaces represent dominant patterns of variation across faces (lighting direction, presence of glasses, facial hair, expression). To recognize a new face, the system projects it onto the eigenface basis and compares the resulting low-dimensional code against stored codes using a distance metric. On a database of over 2,500 face images of 16 individuals, Turk and Pentland reported approximately 96 percent correct classification averaged over variation in lighting, 85 percent averaged over orientation, and 64 percent averaged over scale, and noted that "the eigenface approach does provide a practical solution that is well fitted to the problem of face recognition" ^[13]. The eigenface system was one of the first practical automated face recognition methods, and although it has been superseded by deep convolutional networks for state-of-the-art performance, it remains a pedagogically central example of PCA in action.

genomics and population structure

PCA is the standard tool for exploring population structure from genome-wide single nucleotide polymorphism (SNP) data. The leading principal components of a SNP matrix often correspond to geographic or ancestral groupings. The famous Novembre et al. (2008) study in Nature applied PCA to about 3,000 European individuals genotyped at over 500,000 SNPs and found that the first two principal components recover an almost perfect map of Europe, placing roughly 90 percent of individuals within about 700 km of their reported geographic origin ^[18]. PCA is also used as a routine quality-control step in genome-wide association studies, where adjusting for the leading components controls for population stratification.

finance

In quantitative finance, PCA is applied to interest rate curves to extract dominant factors, with the first three components conventionally interpreted as parallel shifts (level), tilts (slope), and changes in convexity (curvature). Robert Litterman and Jose Scheinkman's foundational 1991 study "Common Factors Affecting Bond Returns" used PCA on US Treasury returns and found that essentially all of the variation in bond returns is explained by these three factors, with empirical studies since reporting that they capture more than 99 percent of yield-curve movement ^[19]. PCA is also a building block for statistical risk models and for portfolio construction strategies that target exposure to specific risk factors while hedging others.

embedding and activation analysis

In modern deep learning, PCA is used to inspect the high-dimensional representations produced by neural networks. Word embeddings such as word2vec and GloVe are commonly visualized by projecting them to two or three principal components. Inside large language models, researchers in mechanistic interpretability use PCA on hidden activations to identify low-dimensional subspaces that correspond to interpretable concepts such as truthfulness, sentiment, or refusal direction. Recent work from Anthropic, Google DeepMind, and academic groups has found that many concepts are encoded along approximately linear directions in activation space, which makes PCA an unexpectedly useful diagnostic for understanding what a network has learned.

limitations

PCA is widely useful but has well-known weaknesses that should shape how it is applied.

Linearity. PCA can only capture linear directions of variance. Data lying on a curved manifold (a Swiss roll, a circle, the surface of a sphere) cannot be unfolded by PCA. Kernel PCA, autoencoders, and methods such as Isomap, t-SNE, and UMAP address this at greater computational and interpretive cost.

Scale sensitivity. A feature measured in thousands will dominate the variance of any unscaled dataset and will dominate the first principal component, regardless of whether it carries useful information. Standardizing each feature to zero mean and unit variance before applying PCA is the standard remedy and is often built into off-the-shelf pipelines.

Variance is not always importance. PCA equates the importance of a direction with its variance. In supervised settings the most discriminative direction may have low overall variance, in which case PCA throws away exactly the information you want to keep; this is one motivation for LDA and related supervised projections.

Outlier sensitivity. Because the covariance matrix is sensitive to extreme values, a small number of outliers can pull the leading components toward themselves. Robust PCA, the minimum covariance determinant estimator, and trimmed PCA variants mitigate this.

Loss of interpretability. Each principal component is a dense linear combination of all original features, which makes it hard to attach a clean meaning to a given component. Sparse PCA produces interpretable components at the cost of explained variance.

Gaussian assumptions. Although PCA is defined without any explicit distributional assumption, several of its statistical properties (optimality of the truncated reconstruction, equivalence to maximum likelihood under Tipping and Bishop's probabilistic PCA) rely on a Gaussian model. Heavy-tailed or strongly non-Gaussian data may be better served by methods such as independent component analysis (ICA) or nonlinear alternatives.

Categorical data. PCA assumes continuous, additive features. For categorical data, methods such as multiple correspondence analysis (MCA) play an analogous role and are mathematically related to PCA on a suitably transformed contingency table.

implementation: how is PCA implemented in software?

PCA is implemented in essentially every numerical computing environment in widespread use. The most common interfaces are listed below.

Library	Function or class	Notes
scikit-learn	sklearn.decomposition.PCA	Default svd_solver = 'auto' picks between full LAPACK SVD and randomized SVD based on problem size ^[12]
scikit-learn	sklearn.decomposition.IncrementalPCA	Mini-batch PCA for out-of-core data
scikit-learn	sklearn.decomposition.KernelPCA	Polynomial, RBF, sigmoid, cosine, and precomputed kernels
scikit-learn	sklearn.decomposition.SparsePCA, MiniBatchSparsePCA	L1-penalized loadings
scikit-learn	sklearn.decomposition.TruncatedSVD	Truncated SVD on sparse matrices, equivalent to PCA without centering
NumPy	numpy.linalg.svd	Full SVD; underlying LAPACK routine is gesdd
SciPy	scipy.sparse.linalg.svds	ARPACK-based partial SVD for sparse matrices
SciPy	scipy.linalg.svd	Wrapper around LAPACK with extra options
PyTorch	torch.pca_lowrank, torch.svd_lowrank	GPU-accelerated randomized PCA / SVD
TensorFlow	tf.linalg.svd	Full SVD on tensors, including GPU/TPU
R	prcomp, princomp	prcomp uses SVD and is preferred over princomp
MATLAB	pca, svd	Built-in toolbox functions
Julia	LinearAlgebra.svd, MultivariateStats.fit(PCA, X)	High-performance native implementations

For most use cases, the recommended workflow is to standardize the features, call the library's PCA function with a target number of components or a target explained variance, and inspect the resulting scree plot before deciding what to do downstream. For very large matrices, prefer randomized or incremental variants; for sparse inputs, prefer truncated SVD; for nonlinear structure, prefer kernel PCA or an autoencoder.

a worked example: Iris

The Iris dataset is the canonical pedagogical example. It contains 150 observations of three species of iris flower (setosa, versicolor, virginica), with four measured features: sepal length, sepal width, petal length, and petal width, all in centimeters. After standardizing each feature, the correlation matrix has eigenvalues approximately 2.91, 0.92, 0.15, and 0.02, so the first two principal components together capture about (2.91 + 0.92) / 4 = 95.8 percent of the variance. PC1 has roughly equal positive loadings on petal length, petal width, and sepal length and a small negative loading on sepal width; it is essentially a "size" axis that cleanly separates setosa (small petals) from the other two species. PC2 is dominated by sepal width and separates the two larger species less cleanly. A scatter plot of PC1 versus PC2 shows three loosely clustered groups, with setosa well separated and versicolor and virginica overlapping at their boundaries. This single 2D plot conveys most of the relevant structure in the dataset and is a textbook illustration of why PCA remains a default first move in exploratory analysis.

current relevance

More than 120 years after Pearson's original paper, PCA is still one of the most heavily used tools in data analysis, statistics, and machine learning. It is a standard preprocessing step in scikit-learn pipelines and in the typical undergraduate ML curriculum, a default diagnostic in genomics and the social sciences, and a building block inside more recent methods such as randomized SVD-based recommender systems and low-rank adaptation of large neural networks. In the era of large language models and foundation models, PCA continues to find use in three roles: as a fast first-pass dimensionality reduction step before applying more expensive nonlinear methods, as a probe for inspecting the internal representations of neural networks (mechanistic interpretability research relies heavily on PCA of activation tensors), and as a building block inside hybrid pipelines that combine classical and learned representations. Despite the rise of t-SNE, UMAP, autoencoders, and contrastive embeddings, PCA's combination of simplicity, speed, mathematical transparency, deterministic output, and universal software support ensures that it will remain a foundational technique for the foreseeable future.

References

Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." *The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science*, 2(11), 559-572. https://doi.org/10.1080/14786440109462720 ↩
Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." *Journal of Educational Psychology*, 24(6), 417-441 and 24(7), 498-520. https://doi.org/10.1037/h0071325 ↩
Cattell, R.B. (1966). "The Scree Test for the Number of Factors." *Multivariate Behavioral Research*, 1(2), 245-276. https://doi.org/10.1207/s15327906mbr0102_10 ↩
Halko, N., Martinsson, P.-G., and Tropp, J.A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." *SIAM Review*, 53(2), 217-288. https://doi.org/10.1137/090771806 ↩
Jolliffe, I.T. and Cadima, J. (2016). "Principal component analysis: a review and recent developments." *Philosophical Transactions of the Royal Society A*, 374, 20150202. https://doi.org/10.1098/rsta.2015.0202 ↩
Schoelkopf, B., Smola, A., and Mueller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." *Neural Computation*, 10(5), 1299-1319. https://doi.org/10.1162/089976698300017467 ↩
Tipping, M.E. and Bishop, C.M. (1999). "Probabilistic Principal Component Analysis." *Journal of the Royal Statistical Society: Series B*, 61(3), 611-622. https://doi.org/10.1111/1467-9868.00196 ↩
Zou, H., Hastie, T., and Tibshirani, R. (2006). "Sparse Principal Component Analysis." *Journal of Computational and Graphical Statistics*, 15(2), 265-286. https://doi.org/10.1198/106186006X113430 ↩
Candes, E.J., Li, X., Ma, Y., and Wright, J. (2011). "Robust Principal Component Analysis?" *Journal of the ACM*, 58(3), Article 11. https://doi.org/10.1145/1970392.1970395 ↩
Kessy, A., Lewin, A., and Strimmer, K. (2018). "Optimal Whitening and Decorrelation." *The American Statistician*, 72(4), 309-314. https://doi.org/10.1080/00031305.2016.1277159 ↩
Bishop, C.M. (1999). "Variational Principal Components." *Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN'99)*, 509-514. https://www.microsoft.com/en-us/research/publication/variational-principal-components/ ↩
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html ↩
Turk, M. and Pentland, A. (1991). "Eigenfaces for Recognition." *Journal of Cognitive Neuroscience*, 3(1), 71-86. https://doi.org/10.1162/jocn.1991.3.1.71 ↩
van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605. https://jmlr.org/papers/v9/vandermaaten08a.html
McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv preprint* arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Ross, D., Lim, J., Lin, R.-S., and Yang, M.-H. (2008). "Incremental Learning for Robust Visual Tracking." *International Journal of Computer Vision*, 77(1-3), 125-141. https://doi.org/10.1007/s11263-007-0075-7
Ramsay, J.O. and Silverman, B.W. (2005). *Functional Data Analysis* (2nd ed.). Springer. https://doi.org/10.1007/b98888
Novembre, J. et al. (2008). "Genes mirror geography within Europe." *Nature*, 456, 98-101. https://doi.org/10.1038/nature07331 ↩
Litterman, R. and Scheinkman, J. (1991). "Common Factors Affecting Bond Returns." *Journal of Fixed Income*, 1(1), 54-61. https://doi.org/10.3905/jfi.1991.692347 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit