# Linear Discriminant Analysis

> Source: https://aiwiki.ai/wiki/linear_discriminant_analysis
> Updated: 2026-06-23
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Linear Discriminant Analysis** (**LDA**) is a classical statistical method for [classification](/wiki/classification) and [dimensionality reduction](/wiki/dimensionality_reduction) that finds the linear combination of features which best separates two or more classes by maximizing the ratio of between class scatter to within class scatter. It was introduced by the British statistician and geneticist [Ronald A. Fisher](/wiki/ronald_fisher) in his 1936 paper "The Use of Multiple Measurements in Taxonomic Problems," published in the *Annals of Eugenics* (vol. 7, no. 2, pp. 179-188), where he applied it to the [iris dataset](/wiki/iris_dataset) of 150 flowers compiled by botanist Edgar Anderson [1]. LDA serves two related purposes: as a **classifier** that assigns observations to one of several groups, and as a supervised **feature extraction** technique that projects high dimensional data onto a lower dimensional subspace, of dimension at most `c - 1` for `c` classes, chosen to maximize between class separation [2][3][16].

Under the assumption that each class is drawn from a multivariate [Gaussian distribution](/wiki/gaussian_distribution) with a common covariance matrix, LDA coincides with the Bayes optimal classifier and produces linear decision boundaries. The scikit-learn user guide notes that LDA "is a special case of QDA, where the Gaussians for each class are assumed to share the same covariance matrix," and that as a result "LDA has a linear decision surface" [16]. When that common covariance assumption is dropped, the related method [Quadratic Discriminant Analysis](/wiki/quadratic_discriminant_analysis) produces quadratic boundaries. As a dimensionality reduction tool, LDA is conceptually distinct from [Principal Component Analysis](/wiki/principal_component_analysis_pca): [PCA](/wiki/pca) maximizes total variance and ignores class labels, while LDA seeks directions that separate the classes [4].

The abbreviation LDA is also used for **[Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation)**, an unrelated [topic model](/wiki/topic_model) introduced by Blei, Ng, and Jordan in 2003. The two methods share only the acronym. This article describes Fisher's Linear Discriminant Analysis.

## Who invented LDA and when?

Linear Discriminant Analysis originates with [Ronald Fisher](/wiki/ronald_fisher), who in 1936 published "The Use of Multiple Measurements in Taxonomic Problems" in the *Annals of Eugenics* [1]. Fisher discriminated between two species of iris using four floral measurements (sepal length, sepal width, petal length, petal width) collected by Edgar Anderson in the Gaspe Peninsula. According to the data description, the samples were "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus" [17]. His solution was a linear combination of the four measurements whose values for the two species were maximally separated relative to within species spread. The resulting linear function became **Fisher's linear discriminant**, and Anderson's data, 150 flowers spanning three species (*Iris setosa*, *Iris versicolor*, *Iris virginica*) at 50 samples each, became the [iris dataset](/wiki/iris_dataset), one of the most cited datasets in pattern recognition [17].

Fisher's original derivation was geometric: he maximized a ratio of between class to within class variance, with no explicit distributional assumption. The probabilistic interpretation, in which LDA arises as the Bayes optimal rule under a Gaussian model with shared covariance, came later and is now the standard textbook framing.

The extension to more than two classes was developed by [C. R. Rao](/wiki/c_r_rao) in 1948 in his paper "The Utilization of Multiple Measurements in Problems of Biological Classification" [5]. Rao introduced what is now called **multiple discriminant analysis**, replacing the single Fisher discriminant with a set of up to `c-1` discriminant directions for `c` classes. This formulation gave LDA its modern matrix form involving the within class scatter matrix `S_w` and the between class scatter matrix `S_b`.

Through the 1950s and 1960s LDA became a workhorse in statistics, biology, and economics. Edward Altman's 1968 Z score for corporate bankruptcy used five financial ratios in a discriminant function and is the most famous applied LDA model [6]. In the 1990s LDA found a second life in computer vision through the [Fisherfaces](/wiki/fisherfaces) method of Belhumeur, Hespanha, and Kriegman, which combined PCA with LDA for [face recognition](/wiki/face_recognition) [7]. More recently Probabilistic LDA became central to [speaker verification](/wiki/speaker_verification) systems built on [i vectors](/wiki/i_vector) and [x vectors](/wiki/x_vector).

## How does the mathematics of LDA work?

LDA can be derived in two equivalent ways: as Fisher's variance ratio criterion, or as the Bayes optimal classifier under a Gaussian model with shared covariance. The two derivations yield the same projection direction and the same decision rule.

### Two class case

Let the training data consist of feature vectors `x ∈ R^d` belonging to one of two classes. Denote the class conditional mean vectors as `μ_1` and `μ_2`, and the class conditional covariance matrices as `Σ_1` and `Σ_2`. Fisher assumed equal covariances: `Σ_1 = Σ_2 = Σ_w`, the common within class covariance.

Fisher's discriminant seeks a direction `w ∈ R^d` such that, when each point is projected to `y = w^T x`, the projected class means are as far apart as possible relative to the projected within class spread. This is **Fisher's criterion**:

```
J(w) = ( w^T (μ_1 - μ_2) )^2  /  ( w^T Σ_w w )
```

The numerator is the squared distance between the projected means, often called the between class variance along `w`. The denominator is the projected within class variance. Maximizing `J(w)` with respect to `w` gives the closed form solution

```
w  ∝  Σ_w^{-1} (μ_1 - μ_2)
```

This direction is called Fisher's linear discriminant. To turn it into a classifier, one chooses a threshold `c` and assigns a new point `x` to class 1 if `w^T x > c` and to class 2 otherwise. Under the equal Gaussian covariance assumption, the optimal threshold derived from [Bayes theorem](/wiki/bayes_theorem) is

```
c = (1/2) w^T (μ_1 + μ_2)  -  log( π_1 / π_2 )
```

where `π_1` and `π_2` are the class priors. With equal priors the threshold lies halfway between the projected means.

### Multi class case

For `c` classes, [C. R. Rao](/wiki/c_r_rao) generalized Fisher's criterion using two scatter matrices [5]. Let `N_i` be the number of training samples in class `i`, with class mean `μ_i`, and let `μ` be the overall mean. The **within class scatter matrix** is

```
S_w  =  Σ_{i=1}^{c}  Σ_{x ∈ class i}  (x - μ_i) (x - μ_i)^T
```

and the **between class scatter matrix** is

```
S_b  =  Σ_{i=1}^{c}  N_i (μ_i - μ) (μ_i - μ)^T
```

The total scatter satisfies `S_t = S_w + S_b`. The multi class objective is to find a projection matrix `W ∈ R^{d × k}` that maximizes a ratio of determinants or a trace ratio such as

```
W^*  =  argmax_W  tr( (W^T S_w W)^{-1} (W^T S_b W) )
```

The solution is given by the [eigenvectors](/wiki/eigenvector) of `S_w^{-1} S_b` corresponding to its largest eigenvalues. Because `S_b` has rank at most `c - 1` (the `c` class means span an affine subspace of dimension `c - 1`), there are at most `c - 1` non zero generalized eigenvalues, and LDA can therefore reduce dimensionality to at most `c - 1` features regardless of how large `d` is. The scikit-learn documentation puts it directly: "implicit in the LDA classifier, there is a dimensionality reduction by linear projection onto a K-1 dimensional space" [16]. For a binary problem this means LDA always projects to one dimension; for ten classes it can project to nine.

In practice the [eigendecomposition](/wiki/eigendecomposition) of `S_w^{-1} S_b` is replaced by a numerically stable procedure: whiten the data with respect to `S_w` using Cholesky or [singular value decomposition](/wiki/singular_value_decomposition), then eigendecompose the transformed `S_b`. This avoids forming `S_w^{-1}` explicitly and is the route taken by most production implementations.

## Why does LDA produce a linear decision boundary?

LDA admits a clean probabilistic derivation as a generative classifier, and that derivation explains exactly why its boundaries are linear. Assume the class conditional density is multivariate Gaussian with mean `μ_k` and shared covariance `Σ`:

```
p(x | y = k)  =  (2π)^{-d/2} |Σ|^{-1/2} exp( -(1/2) (x - μ_k)^T Σ^{-1} (x - μ_k) )
```

Let `π_k = P(y = k)` denote the class prior. By [Bayes theorem](/wiki/bayes_theorem) the posterior is `p(y = k | x) ∝ π_k p(x | y = k)`. Taking the logarithm and dropping terms that do not depend on the class label gives the **discriminant function**

```
δ_k(x)  =  x^T Σ^{-1} μ_k  -  (1/2) μ_k^T Σ^{-1} μ_k  +  log π_k
```

The decision rule assigns `x` to the class with the largest `δ_k(x)`. Notice that the quadratic term `x^T Σ^{-1} x` cancels because the covariance is shared across classes, so each `δ_k` is **linear in x**. The set of points where `δ_k(x) = δ_l(x)` is therefore a hyperplane, which is the geometric reason LDA produces linear decision boundaries.

This derivation also shows that LDA is the **Bayes optimal classifier** under its assumptions: if the data really are Gaussian with shared covariance and the priors are correct, no other classifier can achieve lower expected error. In practice the assumptions are rarely exact, but LDA still performs well as a low variance estimator when training data is limited.

The parameters are estimated from training data by maximum likelihood:

```
π_k    =  N_k / N
μ_k    =  (1 / N_k)  Σ_{x in class k}  x
Σ       =  (1 / (N - c))  Σ_{k=1}^{c}  Σ_{x in class k}  (x - μ_k)(x - μ_k)^T
```

The pooled covariance estimator divides by `N - c` for an unbiased estimate.

## How does LDA differ from QDA?

[Quadratic Discriminant Analysis](/wiki/quadratic_discriminant_analysis) (QDA) shares LDA's generative Gaussian framework but allows each class to have its own covariance matrix `Σ_k`. The discriminant function becomes

```
δ_k(x)  =  -(1/2) log |Σ_k|  -  (1/2) (x - μ_k)^T Σ_k^{-1} (x - μ_k)  +  log π_k
```

which is **quadratic** in `x`. The scikit-learn guide summarizes the practical consequence: "Linear Discriminant Analysis can only learn linear boundaries, while Quadratic Discriminant Analysis can learn quadratic boundaries and is therefore more flexible" [16]. The decision boundary between any two classes is a quadric surface (an ellipsoid, hyperboloid, paraboloid, or pair of hyperplanes) rather than a hyperplane.

| Property | LDA | QDA |
| --- | --- | --- |
| Covariance assumption | Single shared `Σ` | Per class `Σ_k` |
| Decision boundary | Linear (hyperplane) | Quadratic (quadric) |
| Number of covariance parameters | `d(d+1)/2` | `c · d(d+1)/2` |
| Bias | Higher when covariances really differ | Lower |
| Variance | Lower (fewer parameters) | Higher (more parameters) |
| Typical regime where it wins | Small `N`, similar covariances | Large `N`, clearly different covariances |
| Required minimum samples per class | Few | At least `d + 1` per class for invertible `Σ_k` |
| Reduces dimensionality | Yes (to `c - 1`) | No native reduction |

LDA is preferred when training data is scarce or when classes have similar shapes; QDA is preferred when there is enough data per class to estimate per class covariances reliably and those covariances clearly differ. **Regularized Discriminant Analysis** (RDA), introduced by Friedman in 1989, interpolates smoothly between LDA, QDA, and a fully diagonal model via two tuning parameters [8].

## How does LDA differ from PCA?

[Principal Component Analysis](/wiki/principal_component_analysis_pca) is the most common alternative dimensionality reduction technique and is sometimes confused with LDA. The key distinction is supervision: PCA is unsupervised and ignores class labels, while LDA is supervised and uses them. Both produce linear projections, both involve eigendecompositions, and both are heavily used as preprocessing steps. They differ in their objectives.

| Property | PCA | LDA |
| --- | --- | --- |
| Supervised | No, ignores labels | Yes, uses class labels |
| Objective | Maximum total variance | Maximum class separability |
| Eigenproblem | `Cov(X)` | `S_w^{-1} S_b` |
| Output dimensionality | Up to `min(N, d)` | At most `c - 1` |
| Useful for | Compression, denoising, visualization | Class discrimination, feature extraction for classification |
| Robust to label noise | Yes | No, depends directly on labels |

A common pitfall: PCA can discard precisely the directions that distinguish classes if those directions have small variance compared to within class spread. LDA avoids this because its objective explicitly references the class structure. Conversely PCA is valuable when class labels are unavailable, noisy, or when the goal is data exploration.

It is common to combine the two. The **Fisherfaces** method [7] runs PCA first to reduce dimension below the sample size, ensuring `S_w` is invertible, then applies LDA in the PCA subspace. This pipeline is a robust default in genomics and small sample image problems.

## How does LDA differ from logistic regression?

[Logistic regression](/wiki/logistic_regression) is the classic discriminative counterpart to LDA. Both produce linear decision boundaries in the two class case, but they differ in how those boundaries are estimated.

Under the equal covariance Gaussian model, LDA and logistic regression both yield log posteriors of the form `log P(y = 1 | x) / P(y = 0 | x) = β_0 + β^T x`. LDA estimates the coefficients indirectly, by fitting class means and a pooled covariance and then solving for `β`. Logistic regression estimates the coefficients directly by maximum likelihood on the conditional distribution `P(y | x)`, without modeling `p(x)`.

The efficiency tradeoff was quantified by Bradley Efron in 1975. When the Gaussian assumption holds, LDA is the more efficient estimator: Efron found that logistic regression is "between one half and two thirds as effective as normal discrimination for statistically interesting values of the parameters," that is, an asymptotic relative efficiency of roughly 50 to 67 percent [18]. The practical implications are also summarized in *The Elements of Statistical Learning* [4]. When the Gaussian assumption is badly violated, for instance when features are categorical or heavy tailed, logistic regression tends to be more robust because it makes no assumption about the input distribution. On high dimensional problems with text or count features, regularized logistic regression usually outperforms LDA.

LDA handles multi class classification natively through the log linear discriminants `δ_k(x)`, whereas logistic regression must be extended via a softmax (multinomial logistic) formulation. Both methods admit [regularization](/wiki/regularization), though regularized LDA via [shrinkage](/wiki/shrinkage_estimator) of the covariance matrix is less commonly taught than ridge or lasso logistic regression.

## What are the assumptions and limitations of LDA?

LDA's strong performance depends on several assumptions that should be checked before applying it:

- **Multivariate Gaussian class conditional densities.** Each class is assumed to be approximately normal in the feature space. Skewed or heavy tailed data can degrade the model. Power transforms or rank based features sometimes help.
- **Equal covariance across classes.** This is the assumption that makes the decision boundary linear. When violated, [QDA](/wiki/quadratic_discriminant_analysis) or RDA is preferable.
- **Continuous features.** Categorical or binary features violate the Gaussian assumption. [Naive Bayes](/wiki/naive_bayes) with appropriate likelihoods, or logistic regression, are usually better fits.
- **Sufficient samples per class.** With fewer samples than features (`N < d`), the within class scatter `S_w` becomes singular and cannot be inverted. Remedies include shrinkage, regularization, or a PCA preprocessing step.
- **Outlier sensitivity.** Class means and covariances are sensitive to outliers. Robust variants based on the minimum covariance determinant estimator help in the presence of contamination.
- **Linear separability in the projected space.** If the optimal boundary is highly nonlinear, no linear projection will recover it; kernel LDA or nonlinear methods become necessary.

A common practical issue is **singular within class scatter** when `N < d`, the small sample size problem. The classical fixes are **regularized LDA**, replacing `S_w` with `S_w + λI` for some `λ > 0`; **shrinkage LDA**, using an analytically derived intensity such as the [Ledoit Wolf](/wiki/ledoit_wolf) or Oracle Approximating Shrinkage estimator; or PCA LDA, which projects to a subspace where `S_w` is non singular before applying LDA [7][8][15].

## Variants and extensions

Many generalizations of LDA have been developed to address its limitations or to extend it to new settings.

- **Quadratic Discriminant Analysis (QDA).** Drops the equal covariance assumption; produces quadratic boundaries.
- **Regularized Discriminant Analysis (RDA).** Friedman 1989 [8]. Interpolates between LDA, QDA, and a diagonal model via two regularization parameters.
- **Diagonal LDA.** Constrains `Σ` to be diagonal. Useful for high dimensional small sample problems such as microarray classification.
- **Shrinkage LDA.** Replaces the sample covariance with a shrinkage estimator. The Ledoit Wolf and Oracle Approximating Shrinkage estimators provide closed form intensities and are the default in many libraries [15].
- **Fisherfaces.** Belhumeur, Hespanha, and Kriegman 1997 [7]. PCA followed by LDA, applied to face images. A landmark method in early 2000s [face recognition](/wiki/face_recognition).
- **Kernel Fisher Discriminant (KFD).** Mika et al. 1999 [9]. Performs LDA in a feature space defined by a [kernel method](/wiki/kernel_method), enabling nonlinear class boundaries. Closely related to support vector machines.
- **Generalized Discriminant Analysis (GDA).** Baudat and Anouar 2000 [10]. Another formulation of kernel LDA with multi class extensions.
- **Local Fisher Discriminant Analysis (LFDA).** Sugiyama 2006 [11]. Combines Fisher's criterion with locality preserving structure, useful when classes have multimodal distributions.
- **Heteroscedastic LDA (HLDA).** Kumar and Andreou 1998. Allows class specific covariance for the discarded directions; popular in speech recognition.
- **Probabilistic LDA (PLDA).** Prince and Elder 2007 [12]. A latent variable formulation widely used in face and [speaker verification](/wiki/speaker_verification) systems, modeling each observation as a latent identity vector plus within class noise.
- **Sparse LDA.** Imposes an L1 penalty on the discriminant directions for interpretable variable selection.
- **Incremental and online LDA.** Extends LDA to streaming data without recomputing the eigendecomposition from scratch.

## What is LDA used for?

LDA has been applied across many fields, often as a competitive baseline that is hard to beat without significantly more data or model complexity.

- **Bankruptcy prediction.** Altman's 1968 Z score is the canonical financial application of LDA, using five accounting ratios to discriminate solvent from bankrupt firms [6]. It was estimated on a sample of 66 manufacturing companies, 33 that had filed for bankruptcy and 33 that had not, and a firm scoring below 1.81 is flagged as distressed while a score above 2.99 is considered safe [6]. It is still taught in finance and used as a screening tool. See [Altman Z score](/wiki/altman_z_score).
- **Face recognition.** Fisherfaces [7] use LDA on PCA reduced face images and led the field through the late 1990s and 2000s. Even after deep learning, LDA remains useful as a scoring layer on top of learned face embeddings.
- **Speaker verification.** Modern speaker recognition pipelines extract [i vectors](/wiki/i_vector) or neural [x vectors](/wiki/x_vector) and score pairs with [PLDA](/wiki/plda). LDA is also frequently used as a discriminative dimensionality reduction step before PLDA.
- **Genomics and bioinformatics.** Diagonal or shrinkage LDA is a standard classifier for gene expression microarrays and other very high dimensional small sample problems.
- **Medical diagnosis.** Discrimination between disease states using clinical biomarkers, biopsy measurements, or imaging features.
- **Brain computer interfaces and EEG.** Shrinkage LDA is widely used in BCI research because EEG features are roughly Gaussian and training data is limited.
- **Marketing.** Segmenting customers into known classes based on demographic and transactional features.
- **Chemometrics and remote sensing.** Classifying products or land cover from spectroscopic or multispectral measurements.
- **Handwritten digit recognition.** A historical benchmark; LDA was an early standard on MNIST style tasks before convolutional networks took over.

## Implementations

Linear Discriminant Analysis is included in essentially every major statistical and machine learning package.

- **[scikit-learn](/wiki/scikit_learn)** provides `sklearn.discriminant_analysis.LinearDiscriminantAnalysis` and `QuadraticDiscriminantAnalysis`. The LDA class supports SVD, eigen, and least squares solvers with built in Ledoit Wolf shrinkage [16].
- **[R](/wiki/r_programming_language)** ships LDA in `MASS::lda` and `MASS::qda`. The `klaR` package adds `klaR::rda` for regularized discriminant analysis, and `mda` adds mixture discriminant analysis.
- **[MATLAB](/wiki/matlab)** provides `fitcdiscr` in the Statistics and Machine Learning Toolbox, supporting linear, quadratic, diagonal, and pseudoquadratic discriminants.
- **Stata** provides `discrim lda` and `discrim qda`.
- **SAS** provides `PROC DISCRIM` and `PROC STEPDISC` for variable selection.
- **SPSS** offers Discriminant Analysis as a menu driven procedure, and **Weka** ships discriminant classifiers in its functions package.
- **Spark MLlib** has no LDA classifier (its `LDA` class is for Latent Dirichlet Allocation), but linear algebra primitives can approximate one.

## Worked example

Consider a simplified two class version of Fisher's [iris dataset](/wiki/iris_dataset) with two features, sepal length and sepal width, and two species, *Iris setosa* (class 1) and *Iris versicolor* (class 2). Suppose the estimated class means are `μ_1 = (5.0, 3.4)` for setosa and `μ_2 = (5.9, 2.8)` for versicolor, and the pooled within class covariance is

```
Σ_w = [ 0.20   0.05 ]
      [ 0.05   0.10 ]
```

The difference of means is `μ_1 - μ_2 = (-0.9, 0.6)`. To find Fisher's direction, compute `Σ_w^{-1} (μ_1 - μ_2)`. The inverse of the pooled covariance is approximately

```
Σ_w^{-1} ≈ [  5.71  -2.86 ]
            [ -2.86  11.43 ]
```

Multiplying out gives `w ≈ (-6.86, 9.43)` (up to rounding). To classify a new flower with measurements `x = (5.5, 3.1)`, the LDA discriminant compares

```
δ_1(x) - δ_2(x)  =  w^T (x - (μ_1 + μ_2)/2)  +  log(π_1 / π_2)
```

With equal priors `π_1 = π_2 = 0.5` and the midpoint `(μ_1 + μ_2)/2 = (5.45, 3.10)`, the term `(x - midpoint) = (0.05, 0.0)`, and `w^T (x - midpoint) ≈ -6.86 · 0.05 + 9.43 · 0.0 ≈ -0.34`. Since this is negative, the example is closer to the versicolor side and would be classified as *Iris versicolor*. (In Fisher's original paper using all four features, *Iris setosa* is linearly separable from the other two species with zero training error, which is part of why the iris example became iconic.)

This calculation captures the essence of LDA. Training reduces to estimating means, a covariance, and class priors; inference reduces to a dot product and a threshold comparison. There are no iterative optimizers, learning rates, or randomness, which is part of LDA's enduring appeal as a baseline.

## Is LDA still used today?

LDA is no longer the leading method on the most demanding modern benchmarks. The rise of [support vector machines](/wiki/support_vector_machine_svm) in the late 1990s and of [gradient boosting](/wiki/gradient_boosting) and deep learning in the 2010s pushed LDA out of the headline ML competitions, where the equal Gaussian assumption is rarely realistic.

Nevertheless, LDA continues to be used heavily and intentionally in several settings:

- **Speech and speaker recognition.** PLDA scoring on x vector embeddings remains the dominant approach in many speaker verification systems, including call center authentication and forensic voice comparison.
- **Bankruptcy prediction and credit scoring.** Altman's Z score and its descendants remain in active use, valued for interpretability and regulatory acceptability.
- **Brain computer interfaces.** Shrinkage LDA is a standard classifier in motor imagery and P300 BCIs because of its stability with very few labeled trials.
- **High dimensional small sample problems.** In genomics and chemometrics, regularized or diagonal LDA frequently matches or beats more complex models when data is scarce.
- **Education and benchmarking.** LDA is one of the cleanest pedagogical examples of a generative classifier and the bias variance tradeoff, and it serves as a sanity check baseline.
- **Feature engineering.** LDA's `c - 1` projection is used as a low dimensional summary of class structure for visualization or as input to a downstream model.

LDA sits at the intersection of three traditions in classification: the geometric (Fisher's variance ratio), the probabilistic (Bayes optimal generative model), and the spectral (eigendecomposition of `S_w^{-1} S_b`). Even when LDA is not the final model, the concepts of [feature extraction](/wiki/feature_extraction), discriminative projection, and class conditional generative modeling that grew out of it permeate contemporary statistical learning.

## See also

- [Quadratic Discriminant Analysis](/wiki/quadratic_discriminant_analysis)
- [Principal Component Analysis](/wiki/principal_component_analysis_pca)
- [Logistic regression](/wiki/logistic_regression)
- [Naive Bayes](/wiki/naive_bayes)
- [Bayes theorem](/wiki/bayes_theorem)
- [Fisherfaces](/wiki/fisherfaces)
- [Eigenfaces](/wiki/eigenfaces)
- [PLDA](/wiki/plda)
- [Latent Dirichlet Allocation](/wiki/latent_dirichlet_allocation)
- [Iris dataset](/wiki/iris_dataset)
- [Ronald Fisher](/wiki/ronald_fisher)
- [C. R. Rao](/wiki/c_r_rao)

## References

1. Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." *Annals of Eugenics*, 7(2), 179-188. https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x
2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 4. https://hastie.su.domains/ElemStatLearn/
3. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 4. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
4. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). *Pattern Classification* (2nd ed.). Wiley. https://www.wiley.com/en-us/Pattern+Classification%2C+2nd+Edition-p-9780471056690
5. Rao, C. R. (1948). "The Utilization of Multiple Measurements in Problems of Biological Classification." *Journal of the Royal Statistical Society, Series B*, 10(2), 159-203. https://www.jstor.org/stable/2983775
6. Altman, E. I. (1968). "Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy." *Journal of Finance*, 23(4), 589-609. https://onlinelibrary.wiley.com/doi/10.1111/j.1540-6261.1968.tb00843.x
7. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). "Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 19(7), 711-720. https://ieeexplore.ieee.org/document/598228
8. Friedman, J. H. (1989). "Regularized Discriminant Analysis." *Journal of the American Statistical Association*, 84(405), 165-175. https://www.tandfonline.com/doi/abs/10.1080/01621459.1989.10478752
9. Mika, S., Ratsch, G., Weston, J., Scholkopf, B., & Muller, K. R. (1999). "Fisher Discriminant Analysis with Kernels." *IEEE Neural Networks for Signal Processing IX Workshop*, 41-48. https://ieeexplore.ieee.org/document/788121
10. Baudat, G., & Anouar, F. (2000). "Generalized Discriminant Analysis Using a Kernel Approach." *Neural Computation*, 12(10), 2385-2404. https://direct.mit.edu/neco/article/12/10/2385/6385
11. Sugiyama, M. (2006). "Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction." *Proceedings of the 23rd International Conference on Machine Learning*, 905-912. https://dl.acm.org/doi/10.1145/1143844.1143958
12. Prince, S. J. D., & Elder, J. H. (2007). "Probabilistic Linear Discriminant Analysis for Inferences About Identity." *IEEE 11th International Conference on Computer Vision*, 1-8. https://ieeexplore.ieee.org/document/4409052
13. Welling, M. (2005). "Fisher Linear Discriminant Analysis." Tutorial, University of Toronto. https://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf
14. McLachlan, G. J. (2004). *Discriminant Analysis and Statistical Pattern Recognition*. Wiley. https://onlinelibrary.wiley.com/doi/book/10.1002/0471725293
15. Ledoit, O., & Wolf, M. (2004). "A Well Conditioned Estimator for Large Dimensional Covariance Matrices." *Journal of Multivariate Analysis*, 88(2), 365-411. https://www.sciencedirect.com/science/article/pii/S0047259X03000964
16. scikit-learn developers. "Linear and Quadratic Discriminant Analysis." scikit-learn user guide. https://scikit-learn.org/stable/modules/lda_qda.html
17. Wikipedia contributors. "Iris flower data set." https://en.wikipedia.org/wiki/Iris_flower_data_set
18. Efron, B. (1975). "The Efficiency of Logistic Regression Compared to Normal Discriminant Analysis." *Journal of the American Statistical Association*, 70(352), 892-898. https://www.tandfonline.com/doi/abs/10.1080/01621459.1975.10480319
}