Normal distribution
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 6,252 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 6,252 words
Add missing citations, update stale details, or suggest a clearer explanation.
The normal distribution, also called the Gaussian distribution, is a continuous probability distribution defined by two parameters: a mean μ and a variance σ². Its density is the bell curve that almost every introductory statistics course pins to its first chapter, and the one that almost every machine learning paper assumes somewhere in the fine print.
In shorthand, a random variable X is normally distributed when X ~ N(μ, σ²) and its probability distribution has density:
f(x; μ, σ²) = (1 / (σ √(2π))) · exp(-(x - μ)² / (2σ²))
The distribution is symmetric around μ, has thin (sub-exponential) tails, and is closed under linear combinations. That last property is what makes it cooperate so well with everything else in statistics. Sums of independent normals stay normal. Linear transformations of normals stay normal. Marginals and conditionals of multivariate normals stay normal. There is no other distribution that is this well-behaved while still being non-trivial, which is part of why it has become the default model for noise, errors, prior beliefs, latent variables, and weight initializations across modern machine learning.
The other reason for its dominance is the central limit theorem, which says that the sum of many independent finite-variance contributions tends to a normal distribution regardless of the shape of the individual pieces. So even when individual sources of randomness are not normal, their aggregates often are. That is why measurement errors, polling results, residuals from a well-specified linear regression, and many natural phenomena all end up looking Gaussian.
Whether they actually are Gaussian is a separate question, and one this article also tries to take seriously. Plenty of real-world data is not normal, and assuming it is can blow up in your face.
A random variable X follows a normal distribution with mean μ and variance σ² > 0 if its probability density function is:
f(x; μ, σ²) = (1 / (σ √(2π))) · exp(-(x - μ)² / (2σ²))
for x ∈ ℝ. The two parameters have the usual interpretation. The mean μ is a real number that locates the center of the distribution. The variance σ² is positive and controls the spread; the standard deviation σ = √(σ²) is the more common scale measure because it shares units with x.
The density integrates to 1 over the real line, which can be verified using the Gaussian integral ∫ exp(-x²) dx = √π over (-∞, ∞). The normalizing constant 1 / (σ √(2π)) exists precisely to make this integral equal 1.
The standard normal distribution is the special case μ = 0, σ² = 1. It is written N(0, 1) and its density is usually denoted φ(z):
φ(z) = (1 / √(2π)) · exp(-z² / 2)
Any normal X ~ N(μ, σ²) can be standardized by subtracting the mean and dividing by the standard deviation. The standardized variable Z = (X - μ) / σ follows N(0, 1), and the value z is called the z-score of x. Statistical tables, and the entire ecosystem of hypothesis tests built around the normal, lean on this trick because there is only one standard normal distribution to tabulate.
The CDF is:
Φ(z) = ∫_{-∞}^{z} φ(t) dt
Unlike the density, the CDF has no closed-form expression in elementary functions. It is usually written in terms of the error function erf:
Φ(z) = (1/2) · (1 + erf(z / √2))
This is mostly a notation issue. Numerical libraries compute Φ via rational approximations, asymptotic expansions for the tails, or piecewise polynomial fits. NumPy uses scipy.special.ndtr, which calls into a C implementation derived from Cephes; PyTorch's torch.special.ndtr does the analogous thing on the GPU.
For the normal distribution, the mean, median, and mode all coincide at μ. This is one of the few distributions where this is true exactly, and it is part of why "average" is so often treated as a synonym for "typical" in everyday speech. The variance is σ², so the standard deviation is σ. Skewness is 0 because the density is symmetric, and the kurtosis is 3 (excess kurtosis 0). Kurtosis 3 is the reference value other distributions are compared against, which is why excess kurtosis is reported relative to it.
A useful rule of thumb for the normal distribution is that about 68% of the mass falls within one standard deviation of the mean, about 95% within two, and about 99.7% within three. The full quantile picture is a bit more nuanced.
| Interval around μ | Probability |
|---|---|
| μ ± 1σ | ≈ 0.6827 |
| μ ± 2σ | ≈ 0.9545 |
| μ ± 3σ | ≈ 0.9973 |
| μ ± 4σ | ≈ 0.999937 |
| μ ± 5σ | ≈ 0.9999994 |
| μ ± 6σ | ≈ 0.999999998 |
The last few rows are why physicists insist on "five sigma" for discoveries (about 1 in 3.5 million under the null) and why finance practitioners get nervous when they see "six sigma" events happening in a single decade. If your data is actually Gaussian, six-sigma days should not happen on a human timescale. If they do, your data is not Gaussian.
The round numbers people quote (68, 95, 99.7) are slightly off from the actual values. The two-sigma probability is closer to 95.45%, not 95%. The exact two-sided 95% interval uses 1.95996σ, and the exact 99% interval uses 2.5758σ. Most reference tables stop at four decimal places of z because that is all anyone ever needs in practice.
The earliest derivation of the normal density appears in Abraham de Moivre's 1733 paper Approximatio ad Summam Terminorum Binomii (a + b)^n in Seriem expansi, a private supplement to his 1738 second edition of The Doctrine of Chances. De Moivre was studying the binomial distribution, the probability of getting k heads in n flips of a fair or biased coin, and he wanted to evaluate the central terms when n is large.
For a symmetric binomial with p = 1/2, the central term C(n, n/2) · 2^(-n) is hard to compute directly when n is large, so de Moivre used Stirling's approximation (which Stirling and de Moivre had both worked on) and showed that the binomial probability mass near the center decays approximately as exp(-2k² / n) where k measures distance from the mean in standard-deviation units. This is the normal density, although de Moivre did not call it that, did not parameterize it in modern (μ, σ²) notation, and saw it as a tool for binomial calculations rather than a distribution in its own right.
De Moivre's result is the first instance of what would later be called the central limit theorem for the special case of i.i.d. Bernoulli sums.
The more general statement came from Pierre-Simon Laplace (Laplace). In his 1774 Mémoire sur la probabilité des causes par les événements and culminating in the 1812 Théorie analytique des probabilités, Laplace proved that sums of independent random variables, suitably normalized, converge to the normal distribution under fairly broad conditions. He used this result to argue that astronomical observation errors, which are sums of many small independent perturbations, should be expected to follow what we now call a Gaussian distribution.
Laplace's contribution is often underrated next to Gauss's, but the central limit theorem is the deeper reason the normal distribution shows up everywhere. De Moivre's coin-flipping result is a special case; Laplace's theorem is the principle.
The distribution gets its most common name from Carl Friedrich Gauss (Gauss). In his 1809 Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, Gauss developed the method of least squares as a tool for fitting orbits to noisy astronomical observations. He showed that if observation errors are normally distributed, the maximum-likelihood estimate of the orbital parameters coincides with the least-squares estimate, and that the resulting estimator is in a precise sense optimal.
Gauss arrived at the normal density essentially by reverse-engineering the assumption: if the arithmetic mean of repeated measurements is the most probable value of the true quantity (a postulate that was already widely used by astronomers), then the error distribution must take the form exp(-h² · ε²) up to a constant. The argument is short, surprisingly clever, and not quite a proof of the centrality of the normal distribution, but it gave astronomers a principled reason to use least squares and the bell curve became permanently associated with his name.
The Gauss-Laplace synthesis was already complete by the 1810s. By the time the 19th century was over, the normal distribution had become the default model for measurement error in essentially every quantitative science.
Adolphe Quetelet in the 1830s and 1840s did something that, depending on how charitable you are, was either inspired or deeply confused. He took the normal distribution out of astronomy and applied it to human characteristics: heights of conscripts, chest sizes of Scottish soldiers, crime rates, suicide rates. He observed that many of these distributions looked roughly bell-shaped, and proposed his concept of l'homme moyen, the "average man," as a kind of statistical ideal type whose deviations represented social or biological error.
This was the moment when the normal distribution became a generic model for empirical phenomena rather than a specifically astronomical tool. Quetelet's interpretation of variation as "error around the average man" has not aged well, but his methodological move (that the normal distribution applies to social data as well as physical measurements) was enormously influential and helped launch the field of mathematical statistics.
The word "normal" in this technical sense was coined relatively late. Karl Pearson is usually credited with popularizing it in 1893, although Charles Sanders Peirce, Francis Galton, and Wilhelm Lexis had all used variants of the term in the 1870s and 1880s. Pearson himself later expressed regret about the choice, writing in 1920: "Many years ago I called the Laplace-Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal.'"
He was right. The connotation has caused real damage in fields that uncritically apply the Gaussian as a default. Calling something "normal" suggests that deviations from it are pathological, which is exactly the wrong intuition for distributions like the Cauchy or the t-distribution where the "abnormality" is simply having a heavier tail.
If X ~ N(μ_X, σ_X²) and Y ~ N(μ_Y, σ_Y²) are independent, then for any constants a and b:
aX + bY ~ N(a μ_X + b μ_Y, a² σ_X² + b² σ_Y²)
This property generalizes to any finite number of independent normals, and it is the workhorse identity behind almost every analytic calculation involving Gaussians. It is also why the sample mean of a normal sample is exactly normal, not merely approximately normal: a sum of n independent N(μ, σ²) variables divided by n is N(μ, σ²/n) on the nose.
Among all probability distributions on the real line with a given mean and variance, the normal distribution has the maximum differential entropy. This is a precise sense in which the Gaussian is the "least informative" or "most uncertain" distribution consistent with first and second moments. From the perspective of information theory, if all you know about a quantity is its mean and variance, the most honest distribution to assume is the normal one.
This is one of the deepest justifications for the Gaussian as a default model. It is not just convenient, it is the unique solution to a maximum-entropy problem under second-moment constraints.
The normal distribution is one of only three stable distributions with finite variance (the others being trivial cases). Stability here means that a sum of i.i.d. normals is itself normal up to a location and scale shift. This is the algebraic version of the central limit theorem: if you keep adding finite-variance contributions, you converge to the unique stable law with finite variance, which is the normal.
The normal distribution is its own conjugate prior under Gaussian likelihood with known variance. If you assume X | μ ~ N(μ, σ²) with σ² known, and you place a prior μ ~ N(μ_0, τ_0²), then after observing n samples with sample mean x̄ the posterior is also normal:
μ | data ~ N(μ_n, τ_n²)
where the posterior mean μ_n is a precision-weighted average of the prior mean and the data mean, and the posterior precision 1/τ_n² is the sum of the prior precision 1/τ_0² and the data precision n/σ². This is the cleanest possible update rule in Bayesian inference and is the reason Gaussian assumptions show up so often in Bayesian models.
The normal distribution has tails that decay faster than any exponential. For large z, P(Z > z) ≈ φ(z) / z, which means the density falls off as exp(-z²/2). This is much faster decay than the exponential or Laplace distributions, and it is one of the reasons the Gaussian is sometimes a bad model for real data: extreme events that look impossible under a Gaussian assumption may actually be common in heavier-tailed distributions.
The extension of the normal distribution to d dimensions is the multivariate normal. A random vector X ∈ ℝ^d is multivariate normal with mean vector μ ∈ ℝ^d and covariance matrix Σ ∈ ℝ^(d×d) (symmetric positive definite) if its joint density is:
f(x; μ, Σ) = (1 / ((2π)^(d/2) · |Σ|^(1/2))) · exp(-(1/2) · (x - μ)^T Σ^(-1) (x - μ))
The quantity (x - μ)^T Σ^(-1) (x - μ) is the squared Mahalanobis distance from x to μ. Level sets of the density are ellipsoids whose principal axes are the eigenvectors of Σ and whose lengths are proportional to √(λ_i), where λ_i are the eigenvalues.
Marginal and conditional distributions of a multivariate normal are themselves normal, with means and covariances given by closed-form expressions in terms of the original μ and Σ. This is the property that makes Gaussian processes work, since a GP can be defined as a collection of random variables of which any finite subset has a joint multivariate normal distribution, and conditioning on observations is just a linear-algebra operation.
The special case Σ = σ² I gives an isotropic Gaussian, which is what people mean when they say "sample from N(0, I)." In high dimensions, isotropic Gaussians have a lot of geometry that is not at all obvious from the 1D picture: most of the mass concentrates in a thin spherical shell at radius √d, and any two random samples are nearly orthogonal. This concentration of measure is one of the reasons high-dimensional Gaussians behave so differently from intuition built on 2D bell curves.
The central limit theorem is the main reason the Gaussian is everywhere. The classical Lindeberg-Lévy version says: if X_1, X_2, ..., X_n are i.i.d. random variables with mean μ and finite variance σ², then the standardized sample mean
√n · (X̄_n - μ) / σ
converges in distribution to N(0, 1) as n → ∞.
The theorem has dozens of generalizations. Lyapunov and Lindeberg-Feller relax the i.i.d. assumption to allow non-identically-distributed variables under a uniform integrability condition. Martingale CLTs handle dependent sequences. Stable distribution theory tells you what happens when the finite-variance assumption fails: the limit is no longer Gaussian but a stable law with infinite variance.
What this means in practice for machine learning: any quantity that is a sum or average of many small independent contributions (gradients across a mini-batch, neuron activations under random weights, errors in an ensemble of predictors) will tend toward a Gaussian as the number of contributions grows. This is why Gaussian noise models often work even when the underlying data-generating process is not itself Gaussian; you are typically modeling residuals or aggregates rather than raw observations.
The theorem can also mislead. The CLT only kicks in once n is large, and the rate of convergence depends on higher moments of the underlying distribution. For heavy-tailed inputs the convergence is painfully slow, and using a Gaussian approximation for n = 30 (the textbook rule of thumb) on a power-law-tailed dataset will give wildly wrong tail probabilities.
Most computational work with normals starts from a uniform random number generator and converts uniforms to normals. The classical methods are:
The Box-Muller transform, published by George E. P. Box and Mervin Muller in 1958, takes two independent uniform samples U_1, U_2 ∈ (0, 1) and produces two independent N(0, 1) samples:
Z_1 = √(-2 ln U_1) · cos(2π U_2) Z_2 = √(-2 ln U_1) · sin(2π U_2)
The derivation is a polar-coordinate change of variables on the joint density of two independent standard normals. It is exact (modulo floating-point), but it requires calls to log, sqrt, sin, and cos, which used to be expensive on older hardware.
George Marsaglia's polar method is a rejection-sampling variant of Box-Muller that avoids the trigonometric calls. You sample U_1, U_2 uniformly in (-1, 1), reject if U_1² + U_2² > 1, and otherwise compute the Box-Muller transform using U_1 / √s and U_2 / √s as the cosine and sine, where s = U_1² + U_2². It is faster than Box-Muller in practice on most hardware but has a rejection rate of about 21%.
The Ziggurat algorithm, published by Marsaglia and Wai Wan Tsang in 2000, is the method most modern numerical libraries use under the hood. It approximates the normal density with a stack of horizontal rectangles and a tail region, samples uniformly from the union, and accepts most samples without any expensive function evaluation. NumPy's default_rng().standard_normal() and PyTorch's torch.randn both use Ziggurat-style methods, sometimes with vendor-specific tweaks.
For multivariate normals N(μ, Σ), the standard procedure is to compute the Cholesky factor Σ = L L^T, draw Z ~ N(0, I) using one of the methods above, and return X = μ + L Z. For very high-dimensional Σ this is the bottleneck, and various structured-covariance tricks (low-rank plus diagonal, Toeplitz, banded) are used in practice to avoid the full O(d³) Cholesky.
Given n i.i.d. samples x_1, ..., x_n from N(μ, σ²), the maximum likelihood estimators are:
μ̂ = (1/n) · Σ x_i = x̄ σ̂² = (1/n) · Σ (x_i - x̄)²
The MLE for the variance is biased: E[σ̂²] = ((n - 1) / n) · σ². The unbiased estimator divides by n - 1 instead of n:
s² = (1 / (n - 1)) · Σ (x_i - x̄)²
This is Bessel's correction, named after Friedrich Bessel. The factor of (n - 1) compensates for the fact that x̄ is itself estimated from the data, which uses up one degree of freedom. The biased and unbiased estimators differ noticeably for small n and become indistinguishable as n → ∞.
For linear regression with Gaussian errors, the MLE for the regression coefficients coincides with ordinary least squares. This is the connection Gauss exploited in 1809: maximizing a Gaussian likelihood is mathematically the same as minimizing the mean squared error. Squared error is not just a convenient loss function, it is the negative log likelihood under a Gaussian noise model.
If you know σ in advance, the z-test uses the fact that (X̄_n - μ_0) / (σ / √n) ~ N(0, 1) under the null hypothesis μ = μ_0. You compare the observed z to the standard normal CDF and read off a p-value.
In practice σ is almost never known, so you estimate it from the sample using s, and the resulting test statistic (X̄_n - μ_0) / (s / √n) follows Student's t-distribution with n - 1 degrees of freedom. This is the t-test, derived by William Sealy Gosset in 1908 under the pen name "Student." For n much larger than 30 the t-distribution is essentially indistinguishable from the standard normal, but for small samples the t has heavier tails to account for the additional uncertainty in s.
The F-test for variance ratios and the chi-squared test for variance both also assume Gaussianity. All three of these tests are standard fare in statistical practice and they all rest on the same set of distributional assumptions about the underlying data.
A two-sided 95% confidence interval for μ when σ is known is:
x̄ ± 1.96 · σ / √n
When σ is unknown, replace 1.96 with the appropriate t-quantile (for n = 30 this is about 2.04; for n = 10 it is 2.26). The interval is constructed so that, over many repeated samples, 95% of the constructed intervals will contain the true μ.
A persistent problem with normality assumptions is that they are usually not checked. The following table summarizes the standard tests practitioners use to decide whether a normal model is reasonable.
| Test | Statistic / approach | Strengths | Weaknesses |
|---|---|---|---|
| Shapiro-Wilk | W = (Σ a_i x_(i))² / Σ (x_i - x̄)² | Most powerful for small to moderate samples (n < 5000) | Less powerful in very large samples; rejects on tiny deviations |
| Kolmogorov-Smirnov | sup | F_n(x) - F(x) | |
| Anderson-Darling | weighted KS putting more mass on tails | Better than KS at detecting tail departures | Reference distribution depends on parameters being estimated or known |
| Lilliefors | KS variant for estimated μ, σ | Adjusts KS for parameter estimation | Lower power than Shapiro-Wilk |
| Jarque-Bera | n · (skew²/6 + (kurt - 3)²/24) | Simple, common in econometrics | Asymptotic; unreliable for small n |
| QQ plot | scatter of sample quantiles vs theoretical | Visual, intuitive, shows where deviations occur | Subjective |
The two most common tests in machine learning practice are Shapiro-Wilk (built into scipy.stats.shapiro) and the visual QQ plot. The QQ plot in particular is often more useful than a hypothesis test, because it tells you not just whether the data deviates from normal but where (in the tails, in the center, only in one direction) and by how much.
There is a subtle problem with normality tests in modern data: with n = 10000 or more, virtually any real dataset will fail a Shapiro-Wilk test even if the deviation from normal is small enough to be irrelevant for downstream inference. The hypothesis test is not the right question; the right question is whether the deviation matters for whatever you are doing.
The Gaussian shows up in machine learning in more places than would fit in any one section. The following list covers the major ones.
| Application | How the normal shows up |
|---|---|
| Weight initialization | Xavier/Glorot, He, and Kaiming initializations all sample weights from a normal (or related uniform) with variance tuned to the layer's fan-in |
| Batch normalization | Layer outputs are normalized to roughly N(0, 1) before being scaled and shifted, which speeds up training |
| Variational autoencoder (VAE) | Latent variables modeled as multivariate normals, KL term computed in closed form against an N(0, I) prior |
| Gaussian processes | Function-space prior is multivariate normal, posterior under Gaussian likelihood is also multivariate normal |
| Diffusion models / DDPM | Forward process adds Gaussian noise; reverse process is a learned Gaussian transition |
| Reparameterization trick | ε ~ N(0, I) reparameterizes z = μ + σ · ε to enable backprop through stochastic nodes |
| Stochastic optimization | SGD-like methods inject Gaussian noise (SGLD, noisy gradients) for exploration and Bayesian sampling |
| Random Fourier features | Use Gaussian samples to approximate RBF kernel inner products in linear time |
| Linear regression | OLS = MLE under Gaussian noise; closed-form posterior under Gaussian prior |
| Gaussian mixture model | Density modeled as a weighted sum of Gaussians; trained via EM |
When you initialize the weights of a deep neural network, you almost always sample them from some kind of Gaussian (or symmetric uniform with matched variance). Xavier Glorot and Yoshua Bengio's 2010 paper Understanding the difficulty of training deep feedforward neural networks showed that for tanh networks, sampling weights from N(0, 2 / (n_in + n_out)) keeps the variance of activations and gradients roughly constant across layers, which prevents the vanishing or exploding gradient problem. Kaiming He et al.'s 2015 follow-up Delving Deep into Rectifiers derived the analogous variance for ReLU networks, N(0, 2 / n_in), accounting for the fact that ReLU zeros out about half its inputs.
The practical effect is dramatic. Without principled initialization, training networks deeper than about ten layers was essentially impossible in 2010. With Glorot or He initialization, plus residual connections and normalization layers, modern networks can be hundreds of layers deep and still trainable. The mathematical content of the initialization is just a Gaussian with carefully chosen variance, but the engineering consequences were enormous.
Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in 2015, normalizes the pre-activations of each layer to have approximately zero mean and unit variance over each mini-batch, then applies a learned affine transformation. The Gaussian connection is partly motivational (the original paper appealed to "internal covariate shift" and the idea that activations should have a stable distribution) and partly mechanical (the normalization step rescales to N(0, 1)-like statistics).
The theoretical justification has shifted over the years. The original "covariate shift" story turned out to be incomplete, and follow-up work (Santurkar et al. 2018, How Does Batch Normalization Help Optimization?) argued that the real benefit comes from smoothing the loss landscape. The empirical fact that batch norm works extremely well in practice has not been seriously contested.
VAEs (Kingma and Welling, 2013, Auto-Encoding Variational Bayes) parameterize the encoder as q(z | x) = N(μ_φ(x), diag(σ_φ²(x))) and the prior as p(z) = N(0, I). The KL divergence between two diagonal Gaussians has a closed-form expression, which makes the training objective tractable. The reparameterization trick (sample ε ~ N(0, I) and set z = μ + σ · ε) is what allows the encoder gradients to flow through the sampling step.
Without the Gaussian assumption, none of this works in closed form. You can substitute other distributions (Bernoulli for discrete latents, mixture of Gaussians for multimodal latents), but every alternative pays a tractability cost.
A Gaussian process is a stochastic process where any finite collection of values has a joint multivariate normal distribution. Specifying a GP requires a mean function m(x) and a kernel function k(x, x'), and posterior inference under Gaussian likelihood reduces to a linear system involving the kernel matrix.
GPs are the canonical example of nonparametric Bayesian regression and are widely used in Bayesian optimization, where the GP serves as a surrogate model for an expensive black-box function. The catch is the O(n³) scaling of the kernel-matrix inversion, which limits exact GPs to a few thousand training points; sparse approximations and inducing-point methods extend this to hundreds of thousands.
Diffusion models, and specifically DDPMs (Ho, Jain, Abbeel 2020, Denoising Diffusion Probabilistic Models), define a forward process that gradually corrupts data x_0 by adding Gaussian noise:
q(x_t | x_(t-1)) = N(x_t; √(1 - β_t) · x_(t-1), β_t · I)
After T steps the data becomes pure Gaussian noise. The model learns to invert this process by predicting the noise (or equivalently, the score) at each step. Both the forward process and the conditional distributions in the reverse process are Gaussian, which makes the entire framework analytically tractable. Score-based models and continuous-time formulations (Song et al. 2021) extend the Gaussian noise to a stochastic differential equation, but the Gaussian structure is the entire reason the math works out.
The reparameterization trick is one of those small ideas with outsized consequences. If you want to compute the gradient of an expectation E_{z ~ q_φ(z)}[f(z)] with respect to φ, and q is the family q_φ(z) = N(μ_φ, σ_φ²), you can rewrite the sample as z = μ_φ + σ_φ · ε where ε ~ N(0, I). The gradient ∇_φ then passes through the deterministic transformation, and you avoid the high-variance score-function estimator. Almost every modern stochastic latent variable model uses this trick somewhere, and almost every implementation relies on the fact that N(μ, σ²) decomposes into μ + σ · N(0, 1).
The stochastic gradient Langevin dynamics algorithm (Welling and Teh, 2011) injects Gaussian noise into SGD updates:
θ_(t+1) = θ_t + (ε_t / 2) · ∇ log p(θ_t | data) + N(0, ε_t · I)
This turns SGD into an approximate posterior sampler, providing a bridge between optimization and Markov chain Monte Carlo. The Gaussian noise term is what gives the chain its mixing properties, and the magnitude is chosen to balance the deterministic drift against the stochastic exploration.
More generally, SGD itself is sometimes analyzed as a stochastic differential equation with Gaussian-like gradient noise, which produces useful intuitions about generalization and the implicit regularization of large learning rates.
Ali Rahimi and Ben Recht's 2007 paper Random Features for Large-Scale Kernel Machines showed that for shift-invariant kernels (in particular the Gaussian RBF kernel), the kernel can be approximated by an inner product of random feature maps where the random directions are sampled from a Gaussian distribution. This converts kernel methods, which scale O(n²) in memory and O(n³) in time, into linear methods that scale O(nD) for D random features.
The Gaussian sampling here is a direct consequence of Bochner's theorem applied to the RBF kernel: the spectral measure of an RBF kernel is itself a Gaussian, so to sample random frequencies that approximate the kernel you sample from that Gaussian.
The most common failure mode in applied statistics is assuming normality where it does not hold. A few areas where this has caused real problems:
Finance. Asset returns are notoriously not normal. Daily log returns of equities have heavy tails, with extreme moves (the 1987 crash, the 2008 collapse, the 2020 March drawdown) being orders of magnitude more frequent than a Gaussian model predicts. Long-Term Capital Management famously blew up in 1998 in part because their risk models assumed Gaussian returns and the 1998 Russian default produced moves that, under their model, should have happened roughly once in the lifetime of the universe. The standard fix is to use Student's t-distribution, generalized hyperbolic distributions, or stable distributions with infinite variance.
Network traffic and outliers in computing. Inter-arrival times of network packets, file sizes on the web, sizes of scientific computations: these all tend to follow heavy-tailed distributions (Pareto, log-normal, power-law) rather than Gaussians. Modeling them as Gaussian leads to wildly wrong capacity-planning decisions.
High-dimensional ML data. Pixel values in images, embedding norms, gradient magnitudes during training, all of these often have non-Gaussian distributions, and assuming Gaussianity in the wrong place can produce surprising failures. Gradient clipping in deep learning exists partly because gradient magnitudes have heavier tails than a Gaussian model would suggest.
Bayesian inference with heavy-tailed priors. Robust priors like Cauchy or Student-t are often preferable to Gaussian priors when you genuinely do not have strong prior information, because they put more weight on extreme values being possible.
The right response to non-normality depends on what you are doing. For point estimation with reasonable sample sizes, the central limit theorem often saves you. For tail-risk estimation, value-at-risk calculations, or anywhere extreme events matter, Gaussian assumptions can be catastrophic and you should reach for a heavier-tailed distribution explicitly.
| Distribution | How it differs from normal | Common use |
|---|---|---|
| Multivariate normal | Vector-valued, covariance matrix replaces variance | GPs, multivariate regression, latent variables |
| Standard normal | μ = 0, σ = 1 | Reference distribution for z-scores and tests |
| Half-normal | Folded over zero, only takes nonnegative values | Modeling magnitudes, scale priors |
| Truncated normal | Restricted to a bounded interval | Constrained latent variables, censored data |
| Log-normal | exp of a normal; nonnegative, right-skewed | File sizes, biological traits, financial volatility |
| Skew normal | Adds a skewness parameter | Mildly asymmetric data |
| Gaussian mixture model | Weighted sum of Gaussians | Clustering, density estimation |
| Student's t-distribution | Heavier tails; converges to normal as df → ∞ | Robust modeling, small-sample inference |
| Cauchy | Special case of t with df = 1; no mean or variance | Robust statistics, scale priors |
| Chi-squared | Sum of squared standard normals | Variance estimation, goodness-of-fit |
| F | Ratio of chi-squared variables | Variance ratio tests, ANOVA |
Most of these are connected to the normal distribution by simple transformations (Cholesky, exponentiation, summing of squares), which is one reason the Gaussian sits at the center of so much classical statistics.
At the risk of belaboring the point, here is a short list of why Gaussians dominate machine learning specifically:
None of this means you should always use a Gaussian. It does mean that when you can use one without doing real damage, doing so will save you a lot of work.