Linear Discriminant Analysis
Last reviewed
Apr 28, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 3,937 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
17 citations
Review status
Source-backed
Revision
v1 · 3,937 words
Add missing citations, update stale details, or suggest a clearer explanation.
Linear Discriminant Analysis (LDA) is a classical statistical method for classification and dimensionality reduction that finds a linear combination of features which best separates two or more classes. It was introduced by the British statistician and geneticist Ronald A. Fisher in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems," where he applied it to the iris dataset compiled by botanist Edgar Anderson [1]. LDA serves two related purposes: as a classifier that assigns observations to one of several groups, and as a feature extraction technique that projects high dimensional data onto a lower dimensional subspace chosen to maximize between class separation [2][3].
Under the assumption that each class is drawn from a multivariate Gaussian distribution with a common covariance matrix, LDA coincides with the Bayes optimal classifier and produces linear decision boundaries. When that common covariance assumption is dropped, the related method Quadratic Discriminant Analysis produces quadratic boundaries. As a dimensionality reduction tool, LDA is conceptually distinct from Principal Component Analysis: PCA maximizes total variance and ignores class labels, while LDA seeks directions that separate the classes [4].
The abbreviation LDA is also used for Latent Dirichlet Allocation, an unrelated topic model introduced by Blei, Ng, and Jordan in 2003. The two methods share only the acronym. This article describes Fisher's Linear Discriminant Analysis.
Linear Discriminant Analysis originates with Ronald Fisher, who in 1936 published "The Use of Multiple Measurements in Taxonomic Problems" in the Annals of Eugenics [1]. Fisher discriminated between two species of iris using four floral measurements (sepal length, sepal width, petal length, petal width) collected by Edgar Anderson in the Gaspe Peninsula. His solution was a linear combination of the four measurements whose values for the two species were maximally separated relative to within species spread. The resulting linear function became Fisher's linear discriminant, and Anderson's data became the iris dataset, one of the most cited datasets in pattern recognition.
Fisher's original derivation was geometric: he maximized a ratio of between class to within class variance, with no explicit distributional assumption. The probabilistic interpretation, in which LDA arises as the Bayes optimal rule under a Gaussian model with shared covariance, came later and is now the standard textbook framing.
The extension to more than two classes was developed by C. R. Rao in 1948 in his paper "The Utilization of Multiple Measurements in Problems of Biological Classification" [5]. Rao introduced what is now called multiple discriminant analysis, replacing the single Fisher discriminant with a set of up to c-1 discriminant directions for c classes. This formulation gave LDA its modern matrix form involving the within class scatter matrix S_w and the between class scatter matrix S_b.
Through the 1950s and 1960s LDA became a workhorse in statistics, biology, and economics. Edward Altman's 1968 Z score for corporate bankruptcy used five financial ratios in a discriminant function and is the most famous applied LDA model [6]. In the 1990s LDA found a second life in computer vision through the Fisherfaces method of Belhumeur, Hespanha, and Kriegman, which combined PCA with LDA for face recognition [7]. More recently Probabilistic LDA became central to speaker verification systems built on i vectors and x vectors.
LDA can be derived in two equivalent ways: as Fisher's variance ratio criterion, or as the Bayes optimal classifier under a Gaussian model with shared covariance. The two derivations yield the same projection direction and the same decision rule.
Let the training data consist of feature vectors x ∈ R^d belonging to one of two classes. Denote the class conditional mean vectors as μ_1 and μ_2, and the class conditional covariance matrices as Σ_1 and Σ_2. Fisher assumed equal covariances: Σ_1 = Σ_2 = Σ_w, the common within class covariance.
Fisher's discriminant seeks a direction w ∈ R^d such that, when each point is projected to y = w^T x, the projected class means are as far apart as possible relative to the projected within class spread. This is Fisher's criterion:
J(w) = ( w^T (μ_1 - μ_2) )^2 / ( w^T Σ_w w )
The numerator is the squared distance between the projected means, often called the between class variance along w. The denominator is the projected within class variance. Maximizing J(w) with respect to w gives the closed form solution
w ∝ Σ_w^{-1} (μ_1 - μ_2)
This direction is called Fisher's linear discriminant. To turn it into a classifier, one chooses a threshold c and assigns a new point x to class 1 if w^T x > c and to class 2 otherwise. Under the equal Gaussian covariance assumption, the optimal threshold derived from Bayes theorem is
c = (1/2) w^T (μ_1 + μ_2) - log( π_1 / π_2 )
where π_1 and π_2 are the class priors. With equal priors the threshold lies halfway between the projected means.
For c classes, C. R. Rao generalized Fisher's criterion using two scatter matrices [5]. Let N_i be the number of training samples in class i, with class mean μ_i, and let μ be the overall mean. The within class scatter matrix is
S_w = Σ_{i=1}^{c} Σ_{x ∈ class i} (x - μ_i) (x - μ_i)^T
and the between class scatter matrix is
S_b = Σ_{i=1}^{c} N_i (μ_i - μ) (μ_i - μ)^T
The total scatter satisfies S_t = S_w + S_b. The multi class objective is to find a projection matrix W ∈ R^{d × k} that maximizes a ratio of determinants or a trace ratio such as
W^* = argmax_W tr( (W^T S_w W)^{-1} (W^T S_b W) )
The solution is given by the eigenvectors of S_w^{-1} S_b corresponding to its largest eigenvalues. Because S_b has rank at most c - 1 (the c class means span an affine subspace of dimension c - 1), there are at most c - 1 non zero generalized eigenvalues, and LDA can therefore reduce dimensionality to at most c - 1 features regardless of how large d is. For a binary problem this means LDA always projects to one dimension; for ten classes it can project to nine.
In practice the eigendecomposition of S_w^{-1} S_b is replaced by a numerically stable procedure: whiten the data with respect to S_w using Cholesky or singular value decomposition, then eigendecompose the transformed S_b. This avoids forming S_w^{-1} explicitly and is the route taken by most production implementations.
LDA admits a clean probabilistic derivation as a generative classifier. Assume the class conditional density is multivariate Gaussian with mean μ_k and shared covariance Σ:
p(x | y = k) = (2π)^{-d/2} |Σ|^{-1/2} exp( -(1/2) (x - μ_k)^T Σ^{-1} (x - μ_k) )
Let π_k = P(y = k) denote the class prior. By Bayes theorem the posterior is p(y = k | x) ∝ π_k p(x | y = k). Taking the logarithm and dropping terms that do not depend on the class label gives the discriminant function
δ_k(x) = x^T Σ^{-1} μ_k - (1/2) μ_k^T Σ^{-1} μ_k + log π_k
The decision rule assigns x to the class with the largest δ_k(x). Notice that the quadratic term x^T Σ^{-1} x cancels because the covariance is shared across classes, so each δ_k is linear in x. The set of points where δ_k(x) = δ_l(x) is therefore a hyperplane, which is the geometric reason LDA produces linear decision boundaries.
This derivation also shows that LDA is the Bayes optimal classifier under its assumptions: if the data really are Gaussian with shared covariance and the priors are correct, no other classifier can achieve lower expected error. In practice the assumptions are rarely exact, but LDA still performs well as a low variance estimator when training data is limited.
The parameters are estimated from training data by maximum likelihood:
π_k = N_k / N
μ_k = (1 / N_k) Σ_{x in class k} x
Σ = (1 / (N - c)) Σ_{k=1}^{c} Σ_{x in class k} (x - μ_k)(x - μ_k)^T
The pooled covariance estimator divides by N - c for an unbiased estimate.
Quadratic Discriminant Analysis (QDA) shares LDA's generative Gaussian framework but allows each class to have its own covariance matrix Σ_k. The discriminant function becomes
δ_k(x) = -(1/2) log |Σ_k| - (1/2) (x - μ_k)^T Σ_k^{-1} (x - μ_k) + log π_k
which is quadratic in x. The decision boundary between any two classes is a quadric surface (an ellipsoid, hyperboloid, paraboloid, or pair of hyperplanes) rather than a hyperplane.
| Property | LDA | QDA |
|---|---|---|
| Covariance assumption | Single shared Σ | Per class Σ_k |
| Decision boundary | Linear (hyperplane) | Quadratic (quadric) |
| Number of covariance parameters | d(d+1)/2 | c · d(d+1)/2 |
| Bias | Higher when covariances really differ | Lower |
| Variance | Lower (fewer parameters) | Higher (more parameters) |
| Typical regime where it wins | Small N, similar covariances | Large N, clearly different covariances |
| Required minimum samples per class | Few | At least d + 1 per class for invertible Σ_k |
| Reduces dimensionality | Yes (to c - 1) | No native reduction |
LDA is preferred when training data is scarce or when classes have similar shapes; QDA is preferred when there is enough data per class to estimate per class covariances reliably and those covariances clearly differ. Regularized Discriminant Analysis (RDA), introduced by Friedman in 1989, interpolates smoothly between LDA, QDA, and a fully diagonal model via two tuning parameters [8].
Principal Component Analysis is the most common alternative dimensionality reduction technique and is sometimes confused with LDA. Both produce linear projections, both involve eigendecompositions, and both are heavily used as preprocessing steps. They differ in their objectives.
| Property | PCA | LDA |
|---|---|---|
| Supervised | No, ignores labels | Yes, uses class labels |
| Objective | Maximum total variance | Maximum class separability |
| Eigenproblem | Cov(X) | S_w^{-1} S_b |
| Output dimensionality | Up to min(N, d) | At most c - 1 |
| Useful for | Compression, denoising, visualization | Class discrimination, feature extraction for classification |
| Robust to label noise | Yes | No, depends directly on labels |
A common pitfall: PCA can discard precisely the directions that distinguish classes if those directions have small variance compared to within class spread. LDA avoids this because its objective explicitly references the class structure. Conversely PCA is valuable when class labels are unavailable, noisy, or when the goal is data exploration.
It is common to combine the two. The Fisherfaces method [7] runs PCA first to reduce dimension below the sample size, ensuring S_w is invertible, then applies LDA in the PCA subspace. This pipeline is a robust default in genomics and small sample image problems.
Logistic regression is the classic discriminative counterpart to LDA. Both produce linear decision boundaries in the two class case, but they differ in how those boundaries are estimated.
Under the equal covariance Gaussian model, LDA and logistic regression both yield log posteriors of the form log P(y = 1 | x) / P(y = 0 | x) = β_0 + β^T x. LDA estimates the coefficients indirectly, by fitting class means and a pooled covariance and then solving for β. Logistic regression estimates the coefficients directly by maximum likelihood on the conditional distribution P(y | x), without modeling p(x).
The practical implications are summarized in The Elements of Statistical Learning [4]. When the Gaussian assumption holds, LDA is more efficient: its variance is roughly 30 percent lower than logistic regression's at the maximum likelihood estimate. When the Gaussian assumption is badly violated, for instance when features are categorical or heavy tailed, logistic regression tends to be more robust because it makes no assumption about the input distribution. On high dimensional problems with text or count features, regularized logistic regression usually outperforms LDA.
LDA handles multi class classification natively through the log linear discriminants δ_k(x), whereas logistic regression must be extended via a softmax (multinomial logistic) formulation. Both methods admit regularization, though regularized LDA via shrinkage of the covariance matrix is less commonly taught than ridge or lasso logistic regression.
LDA's strong performance depends on several assumptions that should be checked before applying it:
N < d), the within class scatter S_w becomes singular and cannot be inverted. Remedies include shrinkage, regularization, or a PCA preprocessing step.A common practical issue is singular within class scatter when N < d, the small sample size problem. The classical fixes are regularized LDA, replacing S_w with S_w + λI for some λ > 0; shrinkage LDA, using an analytically derived intensity such as the Ledoit Wolf or Oracle Approximating Shrinkage estimator; or PCA LDA, which projects to a subspace where S_w is non singular before applying LDA [7][8].
Many generalizations of LDA have been developed to address its limitations or to extend it to new settings.
Σ to be diagonal. Useful for high dimensional small sample problems such as microarray classification.LDA has been applied across many fields, often as a competitive baseline that is hard to beat without significantly more data or model complexity.
Linear Discriminant Analysis is included in essentially every major statistical and machine learning package.
sklearn.discriminant_analysis.LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis. The LDA class supports SVD, eigen, and least squares solvers with built in Ledoit Wolf shrinkage.MASS::lda and MASS::qda. The klaR package adds klaR::rda for regularized discriminant analysis, and mda adds mixture discriminant analysis.fitcdiscr in the Statistics and Machine Learning Toolbox, supporting linear, quadratic, diagonal, and pseudoquadratic discriminants.discrim lda and discrim qda.PROC DISCRIM and PROC STEPDISC for variable selection.LDA class is for Latent Dirichlet Allocation), but linear algebra primitives can approximate one.Consider a simplified two class version of Fisher's iris dataset with two features, sepal length and sepal width, and two species, Iris setosa (class 1) and Iris versicolor (class 2). Suppose the estimated class means are μ_1 = (5.0, 3.4) for setosa and μ_2 = (5.9, 2.8) for versicolor, and the pooled within class covariance is
Σ_w = [ 0.20 0.05 ]
[ 0.05 0.10 ]
The difference of means is μ_1 - μ_2 = (-0.9, 0.6). To find Fisher's direction, compute Σ_w^{-1} (μ_1 - μ_2). The inverse of the pooled covariance is approximately
Σ_w^{-1} ≈ [ 5.71 -2.86 ]
[ -2.86 11.43 ]
Multiplying out gives w ≈ (-6.86, 9.43) (up to rounding). To classify a new flower with measurements x = (5.5, 3.1), the LDA discriminant compares
δ_1(x) - δ_2(x) = w^T (x - (μ_1 + μ_2)/2) + log(π_1 / π_2)
With equal priors π_1 = π_2 = 0.5 and the midpoint (μ_1 + μ_2)/2 = (5.45, 3.10), the term (x - midpoint) = (0.05, 0.0), and w^T (x - midpoint) ≈ -6.86 · 0.05 + 9.43 · 0.0 ≈ -0.34. Since this is negative, the example is closer to the versicolor side and would be classified as Iris versicolor. (In Fisher's original paper using all four features the two species are linearly separable with zero training error, which is part of why the iris example became iconic.)
This calculation captures the essence of LDA. Training reduces to estimating means, a covariance, and class priors; inference reduces to a dot product and a threshold comparison. There are no iterative optimizers, learning rates, or randomness, which is part of LDA's enduring appeal as a baseline.
LDA is no longer the leading method on the most demanding modern benchmarks. The rise of support vector machines in the late 1990s and of gradient boosting and deep learning in the 2010s pushed LDA out of the headline ML competitions, where the equal Gaussian assumption is rarely realistic.
Nevertheless, LDA continues to be used heavily and intentionally in several settings:
c - 1 projection is used as a low dimensional summary of class structure for visualization or as input to a downstream model.LDA sits at the intersection of three traditions in classification: the geometric (Fisher's variance ratio), the probabilistic (Bayes optimal generative model), and the spectral (eigendecomposition of S_w^{-1} S_b). Even when LDA is not the final model, the concepts of feature extraction, discriminative projection, and class conditional generative modeling that grew out of it permeate contemporary statistical learning.