Principal component analysis (PCA) is an unsupervised learning technique for dimensionality reduction that identifies the orthogonal directions of maximum variance in high-dimensional data and projects observations onto a lower-dimensional subspace spanned by those directions. It transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components, ordered so that the first component captures the greatest variance, the second captures the next most under an orthogonality constraint, and so on. PCA is one of the oldest and most widely used methods in machine learning, statistics, and data analysis, with applications spanning noise reduction, visualization, feature engineering, face recognition, population genetics, finance, and the inspection of internal representations inside modern neural networks [1][2][5].
PCA can be understood from three equivalent viewpoints: as the eigendecomposition of the sample covariance matrix, as the singular value decomposition of the centered data matrix, or as the linear projection that minimizes squared reconstruction error. Each view leads to the same set of components, but they motivate different algorithms and different generalizations such as kernel PCA, probabilistic PCA, sparse PCA, and robust PCA. Despite being more than 120 years old, PCA remains a default first-pass tool for exploratory analysis, a workhorse preprocessing step in many ML pipelines, and a building block inside more recent methods.
PCA was invented by Karl Pearson in 1901 in his paper "On Lines and Planes of Closest Fit to Systems of Points in Space," published in The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, volume 2, issue 11, pages 559 to 572 [1]. Pearson approached the problem from a purely geometric perspective: given a cloud of points in space, what line or plane best represents the data in the sense of minimizing the sum of squared perpendicular distances from the points to the fit? He framed the technique as an extension of the principal axis theorem in mechanics. Although Pearson derived what we now call the first principal component and discussed extensions to higher dimensions, he never used the term "principal components," and his treatment was geometric rather than statistical. Pearson's paper predates the wide availability of mechanical and electronic computing, so the method existed as a mathematical idea decades before practical large-scale use became possible.
Harold Hotelling independently rediscovered and formalized the method in 1933 in his paper "Analysis of a Complex of Statistical Variables into Principal Components," published in two parts in the Journal of Educational Psychology, volume 24, pages 417 to 441 and 498 to 520 [2]. Hotelling was motivated by problems in psychometrics, in particular the question of whether multiple measured indicators of mental ability (reading speed, arithmetic skill, memory tests, and so on) could be summarized by a smaller number of latent factors. He gave the technique its modern name, recast it in statistical terms as the search for orthogonal linear combinations of the original variables with maximum variance, and explicitly developed the eigenvalue and eigenvector machinery that is standard today. Hotelling also worked out the algebraic and inferential structure of the method in considerable detail, including computational shortcuts suitable for the desk calculators of the era.
Throughout the mid twentieth century, PCA spread through psychometrics (where it was often confused with the closely related but distinct factor analysis), meteorology (where the components are sometimes called empirical orthogonal functions, or EOFs), and economics. The method became practical for moderately sized problems with the arrival of electronic computers in the 1960s and 1970s, and it became a textbook fixture in multivariate statistics. The publication of Ian Jolliffe's monograph Principal Component Analysis in 1986 (with a substantially expanded second edition in 2002) consolidated the method's place as a foundational tool [5]. By the 1990s and 2000s, PCA had become a default preprocessing step in many machine learning pipelines, and a wave of variants extended it to nonlinear, sparse, probabilistic, robust, and streaming settings.
PCA can be derived in several equivalent ways. The two most common derivations are the variance-maximization view and the reconstruction-error-minimization view; both produce the same components.
Let X be an n by p data matrix with one observation per row and one feature per column. The sample mean of feature j is the average of column j, and the centered data matrix is
X_c = X - 1 mu^T
where 1 is an n by 1 vector of ones and mu is the p by 1 vector of feature means. After centering, every feature has zero mean. Centering is essential because PCA is defined in terms of variance about the mean; without it, the first component would tend to point toward the centroid of the data rather than capturing the dominant axis of variation. When features are measured on different scales, it is also standard to divide each feature by its sample standard deviation, producing a standardized data matrix; otherwise large-scale features dominate the covariance matrix purely because of their units.
The p by p sample covariance matrix is
C = (1 / (n - 1)) X_c^T X_c
The diagonal entries of C are the per-feature variances and the off-diagonal entries are pairwise covariances. Because C is symmetric and positive semi-definite, it admits an eigendecomposition
C = V Lambda V^T
where V is a p by p orthogonal matrix whose columns are the eigenvectors of C and Lambda is a diagonal matrix whose entries lambda_1 >= lambda_2 >= ... >= lambda_p >= 0 are the corresponding eigenvalues sorted in descending order. The columns of V are the principal axes (or loadings) and the eigenvalue lambda_k is exactly the variance of the data along the k-th axis. The first k columns of V form the projection matrix W_k, and the projected data are
Z = X_c W_k
Each row of Z is a k-dimensional principal component score for the corresponding observation. The total variance of the data equals the trace of C, which equals the sum of the eigenvalues, so the fraction of total variance retained by the first k components is
explained variance ratio = (lambda_1 + ... + lambda_k) / (lambda_1 + ... + lambda_p)
This ratio is the most common quantitative criterion for choosing k.
PCA can also be derived as a sequence of constrained optimization problems. The first principal axis is the unit vector w_1 that maximizes the variance of the projected scores:
w_1 = argmax over ||w|| = 1 of Var(X_c w) = argmax over ||w|| = 1 of w^T C w
The Lagrangian for this constrained problem is L(w, mu) = w^T C w - mu (||w||^2 - 1), and setting its gradient to zero gives C w = mu w, the eigenvalue equation. The solution is the eigenvector of C with the largest eigenvalue, and the maximized variance equals lambda_1. The k-th principal axis solves the same problem subject to the additional constraint that w_k is orthogonal to w1, ..., w(k-1), and the solution is the eigenvector with the k-th largest eigenvalue.
A third perspective characterizes PCA as the linear projection that minimizes squared reconstruction error. Among all rank-k linear projections, PCA produces the projection P_k = W_k W_k^T such that
E = sum over i of ||x_i - mu - P_k (x_i - mu)||^2
is minimized. This is a consequence of the Eckart-Young-Mirsky theorem on best low-rank approximation in the Frobenius norm. From this viewpoint PCA is a lossy compression scheme: storing the k component scores and the p by k loading matrix lets one approximately reconstruct each original observation, with mean squared reconstruction error equal to the sum of the discarded eigenvalues.
In practice, PCA is almost never computed by explicitly forming the covariance matrix and running an eigendecomposition. The preferred route is the singular value decomposition of the centered data matrix
X_c = U Sigma V^T
where U is n by min(n, p) with orthonormal columns, Sigma is a min(n, p) by min(n, p) diagonal matrix of non-negative singular values sigma_1 >= sigma_2 >= ..., and V is p by min(n, p) with orthonormal columns. Substituting into the covariance matrix gives
C = (1 / (n - 1)) V Sigma^2 V^T
so the right singular vectors V are exactly the principal axes and the eigenvalues are related to the singular values by lambda_k = sigma_k^2 / (n - 1). The principal component scores can be read off directly as Z = U_k Sigma_k, where U_k and Sigma_k are the truncations to the first k columns and rows.
Going through SVD is preferred for numerical reasons. Forming X_c^T X_c squares the condition number of X_c, so any rounding noise in the input is amplified when the covariance matrix is computed. SVD operates on X_c directly and is therefore much more accurate when features are nearly collinear or when n is much smaller than p. SVD is also unavoidable when n is smaller than p: the p by p covariance matrix is rank-deficient, but the n by n matrix X_c X_c^T has the same nonzero eigenvalues, and SVD picks them up automatically. Finally, randomized SVD provides a fast route to the top components for large matrices (see the section on randomized PCA below).
Deciding how many components to keep is typically the most subjective step in a PCA workflow. Several rules of thumb are in widespread use, and a careful analyst usually consults more than one.
| Method | How it works | Strengths | Weaknesses |
|---|---|---|---|
| Cumulative variance threshold | Retain the smallest k such that cumulative explained variance reaches a target (commonly 90 or 95 percent) | Simple, interpretable, application-aligned | Threshold choice is arbitrary |
| Scree plot / elbow | Plot eigenvalues in descending order and pick the "elbow" where the curve flattens | Visual, often obvious in practice | Subjective; not all data have a clear elbow |
| Kaiser criterion | Keep components with eigenvalue greater than 1 on standardized data | No subjective threshold | Tends to over-retain components when p is large |
| Parallel analysis | Compare observed eigenvalues against eigenvalues from random data with the same shape and keep components that exceed the random baseline | Statistically motivated, robust in practice | Requires simulation or permutation |
| Cross-validation | Choose k by held-out reconstruction error or downstream task accuracy | Tied to the actual use case | Computationally expensive |
| Bayesian model selection | Place priors on component variances and let the data switch off unneeded components, as in Bayesian PCA | Automatic, principled | Requires fitting a probabilistic model |
The scree plot, originally proposed by Raymond Cattell in 1966, takes its name from the geological term for the loose rocks at the base of a cliff: meaningful components form the steep cliff face and the rest form a flat scree of small eigenvalues [3]. In modern practice, cumulative variance and cross-validation against a downstream task are the most common criteria; the Kaiser criterion is now widely viewed as too liberal except in tightly controlled psychometric settings.
Once the principal components have been computed, several closely related transforms are useful in different contexts.
Reconstruction. Given the loadings W_k and the scores Z, one can approximately reconstruct the original data as X_hat = Z Wk^T + mu. The expected squared reconstruction error per observation equals the sum of the discarded eigenvalues, lambda(k+1) + ... + lambda_p. For images, audio, or other measurements, reconstruction with a small number of components produces a smoothed, denoised approximation that retains the dominant patterns while discarding small-amplitude variation.
PCA whitening. Whitening rescales the principal component scores so that each whitened feature has unit variance and the components are uncorrelated. The PCA whitening transform is
x_PCAwhite = Lambda^(-1/2) V^T x_c
where x_c is a centered observation. After this transform the covariance matrix of the whitened data is the identity, which is sometimes a desirable preprocessing step for downstream models that assume isotropic inputs [10].
ZCA whitening. ZCA (zero-phase component analysis) whitening uses an extra rotation back into the original feature basis:
x_ZCAwhite = V Lambda^(-1/2) V^T x_c
The covariance is still the identity, but ZCA-whitened images stay close to the originals in pixel space, whereas PCA-whitened images look scrambled because they are expressed in the eigenvector basis. ZCA whitening was popular as a preprocessing step for early convolutional networks on small image datasets such as CIFAR-10, and it is the unique whitening transform that minimizes mean squared distance from the original data [10].
Sphering and Mahalanobis distance. PCA whitening is equivalent to expressing distances in the Mahalanobis metric of the original data, which makes whitened Euclidean distance a natural similarity measure for downstream nearest-neighbor or clustering algorithms.
A large family of methods generalizes standard PCA to handle nonlinearity, sparsity, uncertainty, outliers, streaming data, very large matrices, and structured data such as functions or signals. The most influential are summarized below.
| Variant | Year and authors | Core idea | Typical use |
|---|---|---|---|
| Standard PCA | Pearson 1901; Hotelling 1933 | Eigendecomposition of covariance, or SVD of centered data | General linear dimensionality reduction |
| Kernel PCA | Schoelkopf, Smola, Mueller 1998 | PCA in a kernel-induced feature space via the kernel trick | Nonlinear feature extraction, denoising, novelty detection |
| Probabilistic PCA | Tipping and Bishop 1999 | Gaussian latent variable model with isotropic noise; closed-form maximum likelihood and EM | Missing data, mixture models, principled likelihood |
| Bayesian PCA | Bishop 1999 | Place priors over loadings, use ARD or variational inference to switch off unneeded components | Automatic dimensionality selection |
| Sparse PCA | Zou, Hastie, Tibshirani 2006 | Add an L1 penalty on loadings to force most entries to zero | Interpretable components in genomics, finance |
| Robust PCA | Candes, Li, Ma, Wright 2011 | Decompose a matrix into a low-rank part plus a sparse outlier part by minimizing nuclear norm plus L1 norm | Background subtraction in video, outlier-robust analysis |
| Incremental PCA | Levy and Lindenbaum 2000; Ross et al. 2008 | Update an existing PCA basis as new observations arrive in mini-batches | Streaming data, datasets too large to fit in memory |
| Randomized PCA | Halko, Martinsson, Tropp 2011 | Approximate the top components using random projections and a small number of passes over the matrix | Very large or out-of-core matrices |
| Functional PCA | Ramsay and Silverman 1990s onward | PCA on smooth functional observations such as growth curves or spectra | Functional data analysis |
| Common spatial patterns (CSP) | Koles 1990s; widely used in BCI | Joint diagonalization of two class-conditional covariance matrices; reduces to PCA when one covariance is the identity | EEG-based brain-computer interfaces |
Standard PCA only captures linear structure. Kernel PCA, introduced by Bernhard Schoelkopf, Alex Smola, and Klaus-Robert Mueller in 1998, lifts PCA into a high-dimensional feature space implicitly defined by a kernel function k(x, y), then runs PCA in that space [6]. Concretely, kernel PCA computes the n by n centered kernel matrix K with K_ij = k(x_i, x_j), eigendecomposes it, and uses the eigenvectors to express each principal component as a linear combination of the input observations. The components capture nonlinear directions of variance in the original space, including curved or circular structure that linear PCA cannot represent. Common kernels include the polynomial kernel (x^T y + c)^d, the radial basis function kernel exp(-gamma ||x - y||^2), and the sigmoid kernel tanh(alpha x^T y + c). The trade-off is that the n by n kernel matrix can be expensive to form and store for large datasets, the choice of kernel and its hyperparameters is itself a modeling problem, and there is no straightforward way to recover an explicit principal axis in the original feature space, which limits interpretability.
Michael Tipping and Christopher Bishop's 1999 paper "Probabilistic Principal Component Analysis" recast PCA as a latent variable model with a Gaussian latent variable z of dimension k and a Gaussian observation model x = W z + mu + epsilon, where epsilon has isotropic noise with covariance sigma^2 I [7]. The maximum likelihood solution recovers the standard PCA loadings up to rotation, but the probabilistic formulation has several practical advantages: it provides a principled likelihood that can be used for model comparison, it can be fit with the expectation-maximization algorithm even when some entries of the data matrix are missing, it generalizes naturally to mixtures of probabilistic PCAs for clustering, and it sets the stage for Bayesian extensions in which priors over the loadings allow automatic relevance determination of unneeded components. Bishop's 1999 "Variational Principal Components" paper used variational inference to make the Bayesian version computationally efficient and to switch off unneeded components automatically through continuous hyperparameters rather than discrete model search [11].
A standard principal component is a dense linear combination of all original features, which can be hard to interpret in domains such as genomics where the loadings on individual genes are themselves of scientific interest. Hui Zou, Trevor Hastie, and Robert Tibshirani's 2006 paper "Sparse Principal Component Analysis" reformulated PCA as a regression problem and added an elastic net penalty on the loadings, forcing most entries to zero and producing components with a small number of nonzero loadings [8]. The result is interpretable at the cost of slightly reduced explained variance and loss of strict orthogonality among the components. Sparse PCA is widely used in gene-expression analysis, finance, and any setting where the goal is feature selection within an unsupervised dimensionality reduction step.
Standard PCA is sensitive to outliers because a single anomalous observation can shift the covariance matrix substantially. Emmanuel Candes, Xiaodong Li, Yi Ma, and John Wright's 2011 paper "Robust Principal Component Analysis?" proposed decomposing a data matrix M into a low-rank component L and a sparse component S by solving the convex program known as Principal Component Pursuit, which minimizes the nuclear norm of L (a convex surrogate for rank) plus a weighted L1 norm of S (a convex surrogate for sparsity) [9]. Under suitable incoherence conditions, exact recovery is guaranteed even when a positive fraction of the entries of M are arbitrarily corrupted. Robust PCA has become a standard tool for background subtraction in video (where the static background is the low-rank part and moving objects are the sparse part), face recognition under variable illumination, and outlier-resistant analysis in many applied fields.
For large datasets that do not fit in memory, two complementary techniques are widely used. Incremental PCA processes the data in mini-batches and updates an existing PCA basis as new batches arrive, building on the sequential Karhunen-Loeve algorithm of Levy and Lindenbaum (2000) and the visual-tracking work of David Ross and colleagues (2008). Incremental PCA has memory complexity proportional to batch_size times p rather than n times p, which makes it the standard choice for streaming data and out-of-core computation in libraries such as scikit-learn. Randomized PCA, based on the framework of Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp (2011), draws a small Gaussian random matrix Omega, computes Y = X Omega, orthonormalizes Y to get an approximate range basis Q, and then runs an exact but small SVD on Q^T X [4]. The resulting algorithm requires only a few passes over the matrix, scales to billions of rows, and produces approximations whose error is provably close to the optimal truncated SVD. Scikit-learn defaults to randomized SVD inside its PCA implementation when the input is at least 500 by 500 and the requested number of components is less than 80 percent of the smaller dimension [12].
When each observation is a function rather than a vector (a growth curve, a temperature profile over a year, a spectrum), James Ramsay and Bernard Silverman's framework of functional data analysis treats observations as elements of a function space, expands them in a smooth basis such as splines, and applies PCA to the basis coefficients to extract dominant modes of variation. The resulting functional principal components are themselves functions, and they have proved valuable in chemometrics, growth biology, and longitudinal medical data. Other structured variants include tensor PCA for multi-way arrays, sparse functional PCA, multilevel PCA for hierarchical data, and common spatial patterns (CSP) in EEG analysis. CSP is mathematically a joint diagonalization of two class-conditional covariance matrices and reduces exactly to PCA when one of those matrices is the identity; it remains the dominant feature extraction method for motor-imagery brain-computer interfaces.
PCA sits in a much larger family of dimensionality reduction and manifold learning techniques. The main alternatives are summarized below.
| Method | Linear or nonlinear | Supervised | Deterministic | Preserves global structure | Preserves local structure | Typical scale |
|---|---|---|---|---|---|---|
| PCA | Linear | No | Yes | Yes | Moderately | Excellent (millions of rows) |
| Linear discriminant analysis (LDA) | Linear | Yes | Yes | Yes | Moderately | Excellent |
| Kernel PCA | Nonlinear | No | Yes | Moderately | Moderately | Moderate (n by n kernel matrix) |
| Multidimensional scaling (MDS) | Nonlinear | No | Yes (classical) or No | Yes (classical) | Moderately | Moderate |
| Isomap | Nonlinear | No | Yes | Yes | Yes | Moderate |
| Locally linear embedding (LLE) | Nonlinear | No | Yes | Poorly | Yes | Moderate |
| t-SNE | Nonlinear | No | No | Poorly | Excellently | Poor to moderate |
| UMAP | Nonlinear | No | No | Moderately | Excellently | Good |
| Autoencoders | Nonlinear (or linear) | No (typically) | No | Depends on architecture | Depends on architecture | Good with GPUs |
A few comparisons deserve explicit comment.
PCA versus LDA. PCA is unsupervised: it picks the directions of greatest variance regardless of any class labels. Linear discriminant analysis is supervised: it picks the directions that maximize the ratio of between-class variance to within-class variance and is therefore better suited to classification. The directions chosen by PCA and LDA can be very different: a feature with high overall variance but no class information will dominate PCA but be ignored by LDA.
PCA versus t-SNE and UMAP. t-SNE (van der Maaten and Hinton 2008) and UMAP (McInnes, Healy, and Melville 2018) are nonlinear methods built specifically for low-dimensional visualization of cluster structure. They tend to reveal local groupings far more vividly than PCA, but they distort global distances, are stochastic, and provide no closed-form mapping for new points. PCA is linear, deterministic, fast, and provides an explicit projection that can be applied to new data, but it cannot unfold a curved manifold. The two families are complementary: a common workflow is to run PCA first to compress to about 50 components (which removes noise and speeds up the rest of the pipeline) and then apply t-SNE or UMAP to those components for the final 2D visualization.
PCA versus autoencoders. A linear autoencoder with squared-error loss and an information bottleneck is mathematically equivalent to PCA: the optimal weights span the same subspace as the principal axes (although they need not coincide with them up to rotation). Nonlinear autoencoders with hidden layers and nonlinear activations can learn nonlinear codes that PCA cannot represent. The trade-offs are familiar from deep learning more generally: autoencoders need more data, more compute, and careful regularization, but can capture richer structure when those resources are available.
Projecting onto the first two or three principal components is the simplest and often most informative way to look at a high-dimensional dataset. Plots of PC1 versus PC2 reveal clusters, outliers, gradients, and the overall shape of the data cloud. In single-cell RNA sequencing, in word embedding analysis, and in the analysis of large image datasets, PCA plots are a standard first diagnostic; analysts often follow up with t-SNE or UMAP for finer cluster visualization but consult the PCA plot first to understand the global geometry.
Reducing dimensionality with PCA before training a downstream model can speed up training, reduce memory consumption, decorrelate the inputs, and act as a mild regularizer that mitigates overfitting in the small-n large-p regime. Algorithms that suffer from the curse of dimensionality (k-nearest neighbors, kernel density estimation, Gaussian mixture models) often benefit most. Many tabular ML pipelines also use PCA to remove multicollinearity before applying linear or logistic regression.
When a signal of interest is concentrated in the top components and noise is spread across the rest, reconstructing from the top components and discarding the rest yields a denoised version of the data. This idea underlies PCA-based denoising in image processing, spectroscopy, magnetic resonance imaging, and seismic analysis, where the low-eigenvalue "tail" is dominated by measurement noise.
Storing the loadings W_k and the scores Z is far cheaper than storing the full data matrix when k is much smaller than p. PCA-based compression is rarely competitive with hand-tuned codecs for natural images or audio, but it is a generic, application-agnostic compression scheme that has been used for everything from hyperspectral imagery to gene expression panels.
The most celebrated application of PCA in computer vision is the eigenfaces method introduced by Matthew Turk and Alex Pentland in 1991 [13]. Each face image is treated as a single high-dimensional vector (one entry per pixel). PCA is applied to a training set of aligned face images, and the leading principal components are reshaped back into images called eigenfaces. These eigenfaces represent dominant patterns of variation across faces (lighting direction, presence of glasses, facial hair, expression). To recognize a new face, the system projects it onto the eigenface basis and compares the resulting low-dimensional code against stored codes using a distance metric. The eigenface system was one of the first practical automated face recognition methods, and although it has been superseded by deep convolutional networks for state-of-the-art performance, it remains a pedagogically central example of PCA in action.
PCA is the standard tool for exploring population structure from genome-wide single nucleotide polymorphism (SNP) data. The leading principal components of a SNP matrix often correspond to geographic or ancestral groupings; the famous Novembre et al. (2008) study showed that PC1 versus PC2 of European genotypes recovers an almost perfect map of Europe. PCA is also used as a routine quality-control step in genome-wide association studies, where adjusting for the leading components controls for population stratification.
In quantitative finance, PCA is applied to interest rate curves to extract dominant factors, with the first three components conventionally interpreted as parallel shifts (level), tilts (slope), and changes in convexity (curvature). PCA is also a building block for statistical risk models and for portfolio construction strategies that target exposure to specific risk factors while hedging others.
In modern deep learning, PCA is used to inspect the high-dimensional representations produced by neural networks. Word embeddings such as word2vec and GloVe are commonly visualized by projecting them to two or three principal components. Inside large language models, researchers in mechanistic interpretability use PCA on hidden activations to identify low-dimensional subspaces that correspond to interpretable concepts such as truthfulness, sentiment, or refusal direction. Recent work from Anthropic, Google DeepMind, and academic groups has found that many concepts are encoded along approximately linear directions in activation space, which makes PCA an unexpectedly useful diagnostic for understanding what a network has learned.
PCA is widely useful but has well-known weaknesses that should shape how it is applied.
Linearity. PCA can only capture linear directions of variance. Data lying on a curved manifold (a Swiss roll, a circle, the surface of a sphere) cannot be unfolded by PCA. Kernel PCA, autoencoders, and methods such as Isomap, t-SNE, and UMAP address this at greater computational and interpretive cost.
Scale sensitivity. A feature measured in thousands will dominate the variance of any unscaled dataset and will dominate the first principal component, regardless of whether it carries useful information. Standardizing each feature to zero mean and unit variance before applying PCA is the standard remedy and is often built into off-the-shelf pipelines.
Variance is not always importance. PCA equates the importance of a direction with its variance. In supervised settings the most discriminative direction may have low overall variance, in which case PCA throws away exactly the information you want to keep; this is one motivation for LDA and related supervised projections.
Outlier sensitivity. Because the covariance matrix is sensitive to extreme values, a small number of outliers can pull the leading components toward themselves. Robust PCA, the minimum covariance determinant estimator, and trimmed PCA variants mitigate this.
Loss of interpretability. Each principal component is a dense linear combination of all original features, which makes it hard to attach a clean meaning to a given component. Sparse PCA produces interpretable components at the cost of explained variance.
Gaussian assumptions. Although PCA is defined without any explicit distributional assumption, several of its statistical properties (optimality of the truncated reconstruction, equivalence to maximum likelihood under Tipping and Bishop's probabilistic PCA) rely on a Gaussian model. Heavy-tailed or strongly non-Gaussian data may be better served by methods such as independent component analysis (ICA) or nonlinear alternatives.
Categorical data. PCA assumes continuous, additive features. For categorical data, methods such as multiple correspondence analysis (MCA) play an analogous role and are mathematically related to PCA on a suitably transformed contingency table.
PCA is implemented in essentially every numerical computing environment in widespread use. The most common interfaces are listed below.
| Library | Function or class | Notes |
|---|---|---|
| scikit-learn | sklearn.decomposition.PCA | Default svd_solver = 'auto' picks between full LAPACK SVD and randomized SVD based on problem size [12] |
| scikit-learn | sklearn.decomposition.IncrementalPCA | Mini-batch PCA for out-of-core data |
| scikit-learn | sklearn.decomposition.KernelPCA | Polynomial, RBF, sigmoid, cosine, and precomputed kernels |
| scikit-learn | sklearn.decomposition.SparsePCA, MiniBatchSparsePCA | L1-penalized loadings |
| scikit-learn | sklearn.decomposition.TruncatedSVD | Truncated SVD on sparse matrices, equivalent to PCA without centering |
| NumPy | numpy.linalg.svd | Full SVD; underlying LAPACK routine is gesdd |
| SciPy | scipy.sparse.linalg.svds | ARPACK-based partial SVD for sparse matrices |
| SciPy | scipy.linalg.svd | Wrapper around LAPACK with extra options |
| PyTorch | torch.pca_lowrank, torch.svd_lowrank | GPU-accelerated randomized PCA / SVD |
| TensorFlow | tf.linalg.svd | Full SVD on tensors, including GPU/TPU |
| R | prcomp, princomp | prcomp uses SVD and is preferred over princomp |
| MATLAB | pca, svd | Built-in toolbox functions |
| Julia | LinearAlgebra.svd, MultivariateStats.fit(PCA, X) | High-performance native implementations |
For most use cases, the recommended workflow is to standardize the features, call the library's PCA function with a target number of components or a target explained variance, and inspect the resulting scree plot before deciding what to do downstream. For very large matrices, prefer randomized or incremental variants; for sparse inputs, prefer truncated SVD; for nonlinear structure, prefer kernel PCA or an autoencoder.
The Iris dataset is the canonical pedagogical example. It contains 150 observations of three species of iris flower (setosa, versicolor, virginica), with four measured features: sepal length, sepal width, petal length, and petal width, all in centimeters. After standardizing each feature, the covariance matrix has eigenvalues approximately 2.93, 0.92, 0.15, and 0.02, so the first two principal components together capture about (2.93 + 0.92) / 4 = 96 percent of the variance. PC1 has roughly equal positive loadings on petal length, petal width, and sepal length and a small negative loading on sepal width; it is essentially a "size" axis that cleanly separates setosa (small petals) from the other two species. PC2 is dominated by sepal width and separates the two larger species less cleanly. A scatter plot of PC1 versus PC2 shows three loosely clustered groups, with setosa well separated and versicolor and virginica overlapping at their boundaries. This single 2D plot conveys most of the relevant structure in the dataset and is a textbook illustration of why PCA remains a default first move in exploratory analysis.
More than 120 years after Pearson's original paper, PCA is still one of the most heavily used tools in data analysis, statistics, and machine learning. It is a standard preprocessing step in scikit-learn pipelines and in the typical undergraduate ML curriculum, a default diagnostic in genomics and the social sciences, and a building block inside more recent methods such as randomized SVD-based recommender systems and low-rank adaptation of large neural networks. In the era of large language models and foundation models, PCA continues to find use in three roles: as a fast first-pass dimensionality reduction step before applying more expensive nonlinear methods, as a probe for inspecting the internal representations of neural networks (mechanistic interpretability research relies heavily on PCA of activation tensors), and as a building block inside hybrid pipelines that combine classical and learned representations. Despite the rise of t-SNE, UMAP, autoencoders, and contrastive embeddings, PCA's combination of simplicity, speed, mathematical transparency, deterministic output, and universal software support ensures that it will remain a foundational technique for the foreseeable future.
[1] Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572. https://doi.org/10.1080/14786440109462720
[2] Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24(6), 417-441 and 24(7), 498-520. https://doi.org/10.1037/h0071325
[3] Cattell, R.B. (1966). "The Scree Test for the Number of Factors." Multivariate Behavioral Research, 1(2), 245-276. https://doi.org/10.1207/s15327906mbr0102_10
[4] Halko, N., Martinsson, P.-G., and Tropp, J.A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." SIAM Review, 53(2), 217-288. https://doi.org/10.1137/090771806
[5] Jolliffe, I.T. and Cadima, J. (2016). "Principal component analysis: a review and recent developments." Philosophical Transactions of the Royal Society A, 374, 20150202. https://doi.org/10.1098/rsta.2015.0202
[6] Schoelkopf, B., Smola, A., and Mueller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." Neural Computation, 10(5), 1299-1319. https://doi.org/10.1162/089976698300017467
[7] Tipping, M.E. and Bishop, C.M. (1999). "Probabilistic Principal Component Analysis." Journal of the Royal Statistical Society: Series B, 61(3), 611-622. https://doi.org/10.1111/1467-9868.00196
[8] Zou, H., Hastie, T., and Tibshirani, R. (2006). "Sparse Principal Component Analysis." Journal of Computational and Graphical Statistics, 15(2), 265-286. https://doi.org/10.1198/106186006X113430
[9] Candes, E.J., Li, X., Ma, Y., and Wright, J. (2011). "Robust Principal Component Analysis?" Journal of the ACM, 58(3), Article 11. https://doi.org/10.1145/1970392.1970395
[10] Kessy, A., Lewin, A., and Strimmer, K. (2018). "Optimal Whitening and Decorrelation." The American Statistician, 72(4), 309-314. https://doi.org/10.1080/00031305.2016.1277159
[11] Bishop, C.M. (1999). "Variational Principal Components." Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN'99), 509-514. https://www.microsoft.com/en-us/research/publication/variational-principal-components/
[12] Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
[13] Turk, M. and Pentland, A. (1991). "Eigenfaces for Recognition." Journal of Cognitive Neuroscience, 3(1), 71-86. https://doi.org/10.1162/jocn.1991.3.1.71
[14] van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." Journal of Machine Learning Research, 9, 2579-2605. https://jmlr.org/papers/v9/vandermaaten08a.html
[15] McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv preprint arXiv:1802.03426. https://arxiv.org/abs/1802.03426
[16] Ross, D., Lim, J., Lin, R.-S., and Yang, M.-H. (2008). "Incremental Learning for Robust Visual Tracking." International Journal of Computer Vision, 77(1-3), 125-141. https://doi.org/10.1007/s11263-007-0075-7
[17] Ramsay, J.O. and Silverman, B.W. (2005). Functional Data Analysis (2nd ed.). Springer. https://doi.org/10.1007/b98888
[18] Novembre, J. et al. (2008). "Genes mirror geography within Europe." Nature, 456, 98-101. https://doi.org/10.1038/nature07331