Gaussian Process
Last reviewed
Apr 28, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 5,174 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 5,174 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Gaussian process (GP) is a probabilistic machine learning model defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. A Gaussian process generalizes the multivariate Gaussian distribution from finite-dimensional vectors to functions, providing a flexible non-parametric prior over function space. Specified completely by a mean function m(x) and a covariance (or kernel) function k(x, x'), Gaussian processes are widely used for regression, classification, Bayesian optimization, geostatistics, and surrogate modeling of expensive simulations [1].
The modern formulation of Gaussian processes for machine learning was popularized by Carl Edward Rasmussen and Christopher K. I. Williams in their canonical 2006 textbook Gaussian Processes for Machine Learning, although the underlying theory traces back to the geostatistical work of Danie Krige and Georges Matheron in the 1950s and 1960s under the name of kriging [1][2]. Gaussian processes occupy a distinctive position in modern machine learning because they offer principled uncertainty quantification, automatic complexity control through the marginal likelihood, and a clean mathematical framework for combining prior knowledge with observed data.
Formally, a Gaussian process f(x) is fully specified by two functions:
Given any finite collection of inputs X = {x_1, x_2, ..., x_n}, the corresponding function values f = [f(x_1), f(x_2), ..., f(x_n)] follow a multivariate Gaussian distribution with mean vector m(X) and covariance matrix K, where K_ij = k(x_i, x_j). This finite-dimensional consistency property gives Gaussian processes their tractability: any prediction or inference can be reduced to operations on Gaussian random vectors, which have closed-form expressions for conditioning and marginalization [1].
The mean function is often set to zero (after centering the data), in which case the entire prior over functions is encoded by the kernel. The kernel determines properties such as smoothness, periodicity, length scales of variation, and stationarity. Choosing an appropriate kernel is therefore one of the most important modeling decisions when working with Gaussian processes.
The theoretical foundations of stochastic processes that include Gaussian processes were laid by Andrey Kolmogorov in the early 20th century as part of his axiomatic treatment of probability theory. However, the practical use of Gaussian processes for prediction predates their adoption in machine learning by several decades.
In the 1950s, the South African mining engineer Danie Krige developed an empirical method for estimating gold ore grades at unsampled locations in the Witwatersrand reef complex based on weighted averages of nearby samples. The French mathematician Georges Matheron formalized this approach in 1960 as a rigorous statistical method, naming it kriging in honour of Krige and founding the field of geostatistics [3]. Kriging is mathematically equivalent to Gaussian process regression with a particular choice of mean and covariance functions, although the geostatistical tradition emphasizes spatial interpolation while the machine learning tradition emphasizes general function approximation [3].
Gaussian processes entered mainstream machine learning in the 1990s through the work of researchers including David MacKay, Radford Neal, and Christopher Williams. Neal showed in 1996 that a neural network with a single hidden layer of infinite width and Gaussian-distributed weights converges to a Gaussian process, establishing a deep theoretical link between deep learning and kernel methods. The 2006 textbook by Rasmussen and Williams synthesized the field and remains the canonical reference [1].
Gaussian process regression (GPR) is the most common application of GPs and exemplifies the elegance of the framework. Given training data {(x_i, y_i)}_{i=1}^n with y_i = f(x_i) + epsilon_i where epsilon_i is independent Gaussian noise with variance sigma_n^2, the goal is to predict f(x*) at a new test input x*.
Under a zero-mean GP prior with kernel k, the joint distribution of the observed targets y and the test function value f(x*) is multivariate Gaussian. Conditioning on the observations yields a Gaussian posterior with closed-form mean and variance:
where k* is the vector of covariances between x* and the training inputs. The posterior mean serves as the point prediction, while the posterior variance provides a calibrated uncertainty estimate. This combination of prediction and uncertainty is one of the most attractive features of Gaussian processes for scientific and engineering applications, where knowing the confidence in a prediction is often as important as the prediction itself [1].
The noise variance sigma_n^2 is part of the model and can be learned alongside the kernel hyperparameters by maximizing the marginal likelihood (see below). When the noise variance is set to zero, the GP interpolates the training data exactly; when it is positive, the GP performs smoothing, allowing for noisy observations.
The kernel encodes assumptions about the function being modeled and is the principal lever for shaping the behavior of a Gaussian process. A valid kernel must produce a positive semi-definite Gram matrix for any finite set of inputs. Many kernels of practical interest can be combined through addition or multiplication to form richer covariance structures, and most modern GP libraries include a kernel zoo of standard building blocks.
| Kernel | Form | Smoothness | Typical Use |
|---|---|---|---|
| RBF (Squared Exponential) | exp(-d^2 / (2 l^2)) | Infinitely differentiable | Smooth regression, default choice |
| Matern 1/2 (Exponential) | exp(-d / l) | Continuous, not differentiable | Rough functions, Brownian motion |
| Matern 3/2 | (1 + sqrt(3) d / l) exp(-sqrt(3) d / l) | Once differentiable | Moderately rough functions |
| Matern 5/2 | (1 + sqrt(5) d / l + 5 d^2 / (3 l^2)) exp(-sqrt(5) d / l) | Twice differentiable | Smooth but not infinitely so |
| Periodic (ExpSineSquared) | exp(-2 sin^2(pi d / p) / l^2) | Periodic | Seasonal patterns, oscillations |
| Linear | sigma_b^2 + sigma_v^2 (x - c)(x' - c) | N/A | Linear trends, drift |
| Rational Quadratic | (1 + d^2 / (2 alpha l^2))^{-alpha} | Mixture of length scales | Multi-scale variation |
| Spectral Mixture | sum of Gaussians in frequency domain | Tunable | Pattern discovery, extrapolation |
| White Noise | sigma^2 delta(x, x') | None | Independent noise |
| Constant | c | N/A | Bias term, prior mean offset |
The RBF kernel (also called squared exponential or Gaussian kernel) is the default choice in many libraries because it produces extremely smooth functions and is well-behaved analytically. However, its infinite differentiability is often unrealistic for real data, leading practitioners to prefer the Matern family for many applications. The Matern kernel is parameterized by a smoothness parameter nu that controls the differentiability of sample functions: nu = 1/2 corresponds to the exponential kernel and produces continuous but non-differentiable paths, while nu approaches infinity recovers the RBF kernel. The Matern 5/2 kernel offers a good compromise between flexibility and smoothness and is the default in many Bayesian optimization libraries.
The periodic kernel, sometimes called the ExpSineSquared kernel, models functions that repeat with a known or learnable period. It is commonly used for seasonal effects in time series. The linear kernel corresponds to Bayesian linear regression and can be combined with other kernels to model functions with both linear trends and non-linear deviations.
The spectral mixture kernel, introduced by Andrew Gordon Wilson in 2013, parameterizes the spectral density (Fourier transform) of the kernel as a mixture of Gaussians. This construction is universal in the sense that it can approximate any stationary kernel arbitrarily well, and it has been shown to outperform the RBF, Matern, rational quadratic, and periodic kernels on extrapolation and pattern discovery tasks [4].
A distinctive advantage of Gaussian processes is that kernel hyperparameters (length scales, signal variances, periods, noise variances) can be learned directly from data by maximizing the log marginal likelihood, also called the log evidence. Under the GP model, the marginal likelihood is the probability of observing the training targets y given the inputs X and the hyperparameters theta, integrated over all possible functions:
log p(y | X, theta) = -1/2 y^T (K_theta + sigma_n^2 I)^{-1} y - 1/2 log |K_theta + sigma_n^2 I| - n/2 log(2 pi)
This expression has three interpretable terms. The first is a data fit term that rewards models that explain the observations well. The second is a complexity penalty (sometimes called the Occam factor) that penalizes models that are too flexible, automatically guarding against overfitting. The third is a normalization constant. The natural balance between fit and complexity makes the marginal likelihood a principled criterion for model selection without requiring cross-validation [1][5].
Marginal likelihood optimization is typically performed using gradient-based methods such as L-BFGS or Adam, with gradients computed via automatic differentiation in modern libraries. The objective surface is generally non-convex, so multiple restarts from different initial hyperparameter values are often needed to find a good local optimum. The computational cost is dominated by the O(n^3) Cholesky factorization of the covariance matrix at each evaluation, which constrains exact GPs to datasets of a few thousand points.
When the response variable is discrete rather than continuous, a Gaussian process can still serve as a latent function whose values are mapped through a sigmoid (binary) or softmax (multiclass) link function to produce class probabilities. Unlike GP regression, GP classification has no closed-form posterior because the non-Gaussian likelihood makes the integrals intractable [6].
Approximate inference methods are therefore required. The two most widely used approaches are:
Empirical comparisons by Kuss and Rasmussen and others have shown that EP typically outperforms Laplace approximation in terms of predictive accuracy and uncertainty calibration, with EP being the method of choice unless the computational budget is very tight [6]. GP classification is less commonly used than GP regression in practice, partly because of these inference challenges and partly because for many classification problems other methods such as deep neural networks are more competitive.
Gaussian processes are the dominant surrogate model in Bayesian optimization, a sequential strategy for global optimization of expensive black-box functions. The setup is to find the input x* that maximizes (or minimizes) an objective function f(x) when each evaluation of f is costly, such as training a deep learning model with a particular hyperparameter configuration, running a physics simulation, or conducting a wet-lab experiment. The GP serves as a probabilistic surrogate for f, and an acquisition function derived from the GP posterior decides which point to evaluate next [7].
Common acquisition functions include:
A pivotal early application was the 2012 paper by Jasper Snoek, Hugo Larochelle, and Ryan Adams on practical Bayesian optimization of machine learning algorithms, which demonstrated that GP-based Bayesian optimization could automatically tune the hyperparameters of deep learning models with state-of-the-art results. This work spawned the Spearmint library and established Bayesian optimization as a standard tool for hyperparameter tuning [7].
| Library | Backend | Key Features | Year |
|---|---|---|---|
| Spearmint | Python (NumPy) | MCMC over hyperparameters, parallel evaluation, constraints | 2012 |
| GPyOpt | GPy / NumPy | Mixed-type variables, batch optimization, local penalization | 2016 |
| BoTorch | PyTorch / GPyTorch | Monte Carlo acquisition, differentiable, multi-objective | 2019 |
| Ax | BoTorch | High-level adaptive experimentation platform | 2019 |
| scikit-optimize | scikit-learn | Lightweight, easy to use, GP and forest surrogates | 2016 |
| Dragonfly | Python | Massive scale, multi-fidelity, distributed | 2018 |
| Cornell-MOE | Python / C++ | Knowledge gradient, parallel | 2014 |
| Emukit | Python | Decision-making under uncertainty, modular | 2019 |
Among these, BoTorch has emerged as a leading research framework because it is built on PyTorch and GPyTorch, supports gradient-based optimization of acquisition functions through reparameterization tricks, and integrates seamlessly with the Ax adaptive experimentation platform from Meta. BoTorch's algorithms tend to achieve greater sample efficiency than older libraries and are widely used in industrial applications [8].
Gaussian process regression in spatial statistics is known as kriging, the original application of the technique. Kriging is used to interpolate values of a spatial variable (such as ore grade, pollutant concentration, or soil moisture) at unsampled locations based on observations at nearby sampled locations. The covariance function in kriging is typically called a variogram or covariogram and is fitted to empirical data using methods such as the method of moments or restricted maximum likelihood [3].
Variants of kriging differ in how the mean function is treated:
Kriging is the workhorse of mining, petroleum geology, hydrology, soil science, and environmental monitoring. The geostatistical literature has accumulated decades of practical wisdom on issues such as variogram modeling, anisotropy, nested structures, and large-scale spatial inference that informs modern Gaussian process practice [3].
The principal limitation of Gaussian processes is their O(n^3) computational complexity and O(n^2) memory complexity, where n is the number of training points. The cubic cost arises from the Cholesky factorization of the n by n covariance matrix required for inference and marginal likelihood evaluation. For datasets larger than a few thousand points, exact GPs become impractical, motivating extensive research into approximations [9].
The most influential family of approximations introduces a small set of m inducing points (with m much less than n) that summarize the dataset and reduce computation to O(n m^2). Key milestones include:
Beyond inducing points, other strategies for scaling Gaussian processes include:
The relationship between Gaussian processes and deep learning is rich and bidirectional. Several research threads connect the two paradigms:
Radford Neal showed in 1996 that a single-hidden-layer neural network with infinite width and i.i.d. Gaussian weights converges to a Gaussian process. This result was extended to deep networks by Lee, Bahri, and others in 2018, giving rise to the Neural Network Gaussian Process (NNGP) kernel that exactly characterizes infinitely wide deep networks. The related Neural Tangent Kernel (NTK) describes the training dynamics of wide networks under gradient descent. These connections have provided theoretical insights into why deep learning works and have produced practical kernels for GP regression [11].
Deep kernel learning, introduced by Andrew Gordon Wilson and colleagues in 2016, combines the expressiveness of deep learning with the calibrated uncertainty of Gaussian processes by using a neural network to transform inputs before applying a base kernel such as the spectral mixture or RBF. The neural network and GP hyperparameters are jointly trained by maximizing the marginal likelihood. Deep kernel learning achieves both representation learning and probabilistic uncertainty in a unified framework and has been applied to datasets with millions of examples [12].
Deep Gaussian processes, introduced by Andreas Damianou and Neil Lawrence in 2013, stack multiple GP layers in a hierarchy where the output of one GP becomes the input to the next. This composition produces models with greater expressive power than a single GP, particularly for non-stationary or hierarchical data. Inference in deep GPs is more challenging because the layered composition is no longer a Gaussian process, requiring variational approximations. Deep GPs have shown competitive performance on tasks where representation learning is important but uncertainty matters [13].
Neural processes (NPs), introduced by Marta Garnelo and colleagues at DeepMind in 2018, are a family of neural network models inspired by Gaussian processes that aim to learn distributions over functions directly. A neural process consists of an encoder that maps observed context points to a global latent representation, and a decoder that produces a distribution over function values at any query point conditioned on the latent. Unlike GPs, neural processes scale linearly with the number of context points and can leverage deep learning representations, but they sacrifice some of the calibrated uncertainty properties of true GPs. Conditional neural processes, attentive neural processes, and convolutional conditional neural processes are subsequent developments. Research has shown that neural processes are mathematically related to Gaussian processes with deep kernels under certain conditions [14].
| Aspect | Gaussian Processes | Neural Networks |
|---|---|---|
| Model type | Non-parametric | Parametric (fixed weights) |
| Uncertainty quantification | Built-in, calibrated posterior variance | Requires extensions (MC dropout, ensembles, BNNs) |
| Training | Marginal likelihood (closed-form for regression) | Stochastic gradient descent on loss |
| Hyperparameters | Few (kernel parameters, noise) | Many (weights, biases, architecture) |
| Scalability | O(n^3), needs sparse approximations beyond ~10K points | Scales linearly with data, billions of points feasible |
| Interpretability | Kernel structure encodes assumptions | Often opaque, requires post-hoc interpretation |
| Sample efficiency | Excellent for small data | Typically requires large datasets |
| Flexibility | Limited by chosen kernel | Universal function approximators |
| Out-of-distribution behavior | Reverts to prior, often degrades gracefully | Can produce overconfident wrong predictions |
| Best use cases | Small data, uncertainty critical, expensive evaluations | Large data, perception, generation |
Gaussian processes and neural networks occupy complementary niches. GPs shine when data is scarce, when uncertainty quantification matters (such as in active learning, Bayesian optimization, scientific applications, and safety-critical decision-making), and when the user wants principled control over inductive biases via the kernel. Neural networks dominate when data is abundant and when learned representations are critical, such as in image, speech, and language modeling. Hybrid approaches such as deep kernel learning and neural processes attempt to combine the strengths of both paradigms.
It is worth noting the conceptual link to other Gaussian-based probabilistic models. A Gaussian mixture model (GMM) is a finite weighted sum of Gaussians used for density estimation and clustering, while a Gaussian process is an infinite-dimensional distribution over functions. Both are foundational in Bayesian inference, but they address different problems: GMMs cluster data, while GPs predict functions.
A mature ecosystem of open-source libraries supports Gaussian process modeling across the major scientific computing platforms.
| Library | Backend | Key Features | Best For |
|---|---|---|---|
| GPy | Python (NumPy) | Mature, comprehensive, extensive kernel zoo | Research, prototyping |
| GPflow | TensorFlow | Variational methods, multi-output, easy customization | Modern research, production |
| GPyTorch | PyTorch | GPU acceleration, scalable inference, modular | Large-scale, deep kernel learning |
| BoTorch | PyTorch / GPyTorch | Bayesian optimization, Monte Carlo acquisition | Hyperparameter tuning, BO |
| scikit-learn | Python (NumPy) | Simple API, integrated with sklearn ecosystem | Beginners, small datasets |
| Stan | Stan | Full Bayesian inference with HMC | Hierarchical models, full posteriors |
| PyMC | Python | Probabilistic programming, GPs as random variables | Bayesian workflow integration |
| Edward2 / TensorFlow Probability | TensorFlow | GPs as distributions, deep probabilistic models | Probabilistic programming on TF |
| celerite / celerite2 | Python | O(n) inference for 1D problems with semi-separable kernels | Time series, astronomy |
| Pyro | PyTorch | Probabilistic programming with GPs as building blocks | Variational inference, hybrid models |
GPyTorch has become the dominant library for GP research because of its scalable inference engine based on conjugate gradients and Lanczos iteration, GPU acceleration, and tight integration with PyTorch's automatic differentiation. GPflow offers a similar feature set on TensorFlow with particularly strong support for multi-output GPs and variational methods. scikit-learn provides an entry point with a simple API suitable for educational purposes and small problems. Stan is preferred when full posterior inference over hyperparameters via Hamiltonian Monte Carlo is required, common in hierarchical models and astronomical applications [15].
The combination of probabilistic predictions, principled uncertainty quantification, and a flexible non-parametric formulation has led to Gaussian processes being adopted across a remarkably broad set of application areas.
Bayesian optimization with Gaussian process surrogates is a standard tool for tuning hyperparameters of expensive machine learning models, including deep learning systems. Major frameworks such as Google Vizier, Meta Ax, Microsoft NNI, and Amazon SageMaker Automatic Model Tuning include GP-based search strategies. Automated machine learning (AutoML) systems frequently rely on GPs for inner-loop optimization.
When each evaluation of a physical or computational simulation is expensive (taking minutes to days per run), a Gaussian process trained on a small set of simulation outputs can serve as a fast surrogate for downstream tasks such as design optimization, sensitivity analysis, and uncertainty quantification. GP surrogates are widely used in aerospace, automotive, energy, and pharmaceutical engineering [16].
Kriging and its variants remain the standard interpolation method in mining, soil science, hydrology, meteorology, and air quality monitoring. GPs are used to model temperature, precipitation, pollutant concentrations, and ecological variables across landscapes.
Gaussian processes can model time series by treating time as the input variable. Composing periodic, trend, and noise kernels produces flexible models that handle seasonality and irregular sampling. GP time series models are popular in finance, demand forecasting, and astronomy where calibrated uncertainty bands are valued.
In model-based reinforcement learning, GPs are used to learn dynamics models from limited interaction data. The PILCO algorithm by Marc Deisenroth and Carl Rasmussen demonstrated that GP-based dynamics learning could solve continuous control problems with far fewer samples than model-free methods. GPs are also used in safe Bayesian optimization for robotic control where exploration must respect safety constraints.
The posterior variance of a GP provides a natural acquisition criterion for selecting informative measurements. This is exploited in active learning, optimal experimental design, drug discovery, and materials science to minimize the number of expensive experiments needed to characterize a system.
Gaussian processes are widely used in exoplanet detection (modeling stellar variability that obscures planetary signals), gravitational wave analysis, and cosmological parameter estimation. The celerite library provides O(n) GP inference for the structured kernels common in astronomical time series.
GPs are used in robotics for state estimation, terrain classification, mapping, and policy learning. Their uncertainty estimates enable robots to identify when they are operating in unfamiliar conditions and to seek out informative observations.
Despite their many strengths, Gaussian processes have several well-known limitations.
Research on Gaussian processes continues to advance on several fronts. Modern directions include: