# Gaussian Process

> Source: https://aiwiki.ai/wiki/gaussian_process
> Updated: 2026-06-23
> Categories: Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **Gaussian process (GP)** is a probabilistic [machine learning](/wiki/machine_learning) model defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. Carl Edward Rasmussen and Christopher K. I. Williams give the canonical definition in their 2006 textbook: "A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution." [1] A Gaussian process generalizes the multivariate Gaussian distribution from finite-dimensional vectors to functions, providing a flexible non-parametric prior over function space. It is specified completely by a mean function m(x) and a covariance (or kernel) function k(x, x'), and is widely used for regression, classification, [Bayesian optimization](/wiki/bayesian_optimization), geostatistics, and surrogate modeling of expensive simulations [1].

Because conditioning a joint Gaussian on observed data is closed-form, Gaussian process regression returns both a point prediction and a calibrated uncertainty (the posterior variance) at every input. The principal practical cost is scalability: exact inference scales as O(n^3) in time and O(n^2) in memory for n training points, which limits exact GPs to a few thousand observations and motivates sparse and scalable approximations [1][9]. The modern formulation for [machine learning](/wiki/machine_learning) was popularized by the 2006 Rasmussen and Williams book, although the underlying theory traces back to the geostatistical work of Danie Krige and Georges Matheron in the 1950s and 1960s under the name of *kriging* [1][3]. Gaussian processes occupy a distinctive position in modern machine learning because they offer principled uncertainty quantification, automatic complexity control through the marginal likelihood, and a clean mathematical framework for combining prior knowledge with observed data.

## How is a Gaussian process defined?

Formally, a Gaussian process f(x) is fully specified by two functions:

- A mean function m(x) = E[f(x)], which encodes the expected value of the function at any input x.
- A covariance function k(x, x') = E[(f(x) - m(x))(f(x') - m(x'))], which encodes how function values at different inputs are correlated.

Given any finite collection of inputs X = {x_1, x_2, ..., x_n}, the corresponding function values f = [f(x_1), f(x_2), ..., f(x_n)] follow a multivariate Gaussian distribution with mean vector m(X) and covariance matrix K, where K_ij = k(x_i, x_j). This finite-dimensional consistency property gives Gaussian processes their tractability: any prediction or inference can be reduced to operations on Gaussian random vectors, which have closed-form expressions for conditioning and marginalization [1].

The mean function is often set to zero (after centering the data), in which case the entire prior over functions is encoded by the kernel. The kernel determines properties such as smoothness, periodicity, length scales of variation, and stationarity. Choosing an appropriate kernel is therefore one of the most important modeling decisions when working with Gaussian processes.

## Where did Gaussian processes come from?

The theoretical foundations of stochastic processes that include Gaussian processes were laid by Andrey Kolmogorov in the early 20th century as part of his axiomatic treatment of probability theory. However, the practical use of Gaussian processes for prediction predates their adoption in machine learning by several decades.

In the 1950s, the South African mining engineer Danie Krige developed an empirical method for estimating gold ore grades at unsampled locations in the Witwatersrand reef complex based on weighted averages of nearby samples. The French mathematician Georges Matheron formalized this approach in 1960 as a rigorous statistical method, naming it *kriging* in honour of Krige and founding the field of geostatistics [3]. Kriging is mathematically equivalent to Gaussian process regression with a particular choice of mean and covariance functions, although the geostatistical tradition emphasizes spatial interpolation while the machine learning tradition emphasizes general function approximation [3].

Gaussian processes entered mainstream machine learning in the 1990s through the work of researchers including David MacKay, Radford Neal, and Christopher Williams. Neal showed in his 1996 book *Bayesian Learning for Neural Networks* that a [neural network](/wiki/neural_network) with a single hidden layer of infinite width and Gaussian-distributed weights converges to a Gaussian process, establishing a deep theoretical link between [deep learning](/wiki/deep_learning) and kernel methods [18]. The 2006 textbook by Rasmussen and Williams synthesized the field and remains the canonical reference [1].

## What is Gaussian process regression?

Gaussian process regression (GPR) is the most common application of GPs and exemplifies the elegance of the framework. Given training data {(x_i, y_i)}_{i=1}^n with y_i = f(x_i) + epsilon_i where epsilon_i is independent Gaussian noise with variance sigma_n^2, the goal is to predict f(x*) at a new test input x*.

Under a zero-mean GP prior with kernel k, the joint distribution of the observed targets y and the test function value f(x*) is multivariate Gaussian. Conditioning on the observations yields a Gaussian posterior with closed-form mean and variance:

- Posterior mean: mu*(x*) = k*^T (K + sigma_n^2 I)^{-1} y
- Posterior variance: sigma*^2(x*) = k(x*, x*) - k*^T (K + sigma_n^2 I)^{-1} k*

where k* is the vector of covariances between x* and the training inputs. The posterior mean serves as the point prediction, while the posterior variance provides a calibrated uncertainty estimate. This combination of prediction and uncertainty is one of the most attractive features of Gaussian processes for scientific and engineering applications, where knowing the confidence in a prediction is often as important as the prediction itself [1].

The noise variance sigma_n^2 is part of the model and can be learned alongside the kernel hyperparameters by maximizing the marginal likelihood (see below). When the noise variance is set to zero, the GP interpolates the training data exactly; when it is positive, the GP performs smoothing, allowing for noisy observations.

## Kernel functions: the kernel zoo

The kernel encodes assumptions about the function being modeled and is the principal lever for shaping the behavior of a Gaussian process. A valid kernel must produce a positive semi-definite Gram matrix for any finite set of inputs. Many kernels of practical interest can be combined through addition or multiplication to form richer covariance structures, and most modern GP libraries include a kernel zoo of standard building blocks.

| Kernel | Form | Smoothness | Typical Use |
|--------|------|------------|-------------|
| RBF (Squared Exponential) | exp(-d^2 / (2 l^2)) | Infinitely differentiable | Smooth regression, default choice |
| Matern 1/2 (Exponential) | exp(-d / l) | Continuous, not differentiable | Rough functions, Brownian motion |
| Matern 3/2 | (1 + sqrt(3) d / l) exp(-sqrt(3) d / l) | Once differentiable | Moderately rough functions |
| Matern 5/2 | (1 + sqrt(5) d / l + 5 d^2 / (3 l^2)) exp(-sqrt(5) d / l) | Twice differentiable | Smooth but not infinitely so |
| Periodic (ExpSineSquared) | exp(-2 sin^2(pi d / p) / l^2) | Periodic | Seasonal patterns, oscillations |
| Linear | sigma_b^2 + sigma_v^2 (x - c)(x' - c) | N/A | Linear trends, drift |
| Rational Quadratic | (1 + d^2 / (2 alpha l^2))^{-alpha} | Mixture of length scales | Multi-scale variation |
| Spectral Mixture | sum of Gaussians in frequency domain | Tunable | Pattern discovery, extrapolation |
| White Noise | sigma^2 delta(x, x') | None | Independent noise |
| Constant | c | N/A | Bias term, prior mean offset |

The **RBF kernel** (also called squared exponential or Gaussian kernel) is the default choice in many libraries because it produces extremely smooth functions and is well-behaved analytically. However, its infinite differentiability is often unrealistic for real data, leading practitioners to prefer the **Matern family** for many applications. The Matern kernel is parameterized by a smoothness parameter nu that controls the differentiability of sample functions: nu = 1/2 corresponds to the exponential kernel and produces continuous but non-differentiable paths, while nu approaches infinity recovers the RBF kernel. The Matern 5/2 kernel offers a good compromise between flexibility and smoothness and is the default in many [Bayesian optimization](/wiki/bayesian_optimization) libraries.

The **periodic kernel**, sometimes called the ExpSineSquared kernel, models functions that repeat with a known or learnable period. It is commonly used for seasonal effects in time series. The **linear kernel** corresponds to Bayesian linear regression and can be combined with other kernels to model functions with both linear trends and non-linear deviations.

The **spectral mixture kernel**, introduced by Andrew Gordon Wilson in 2013, parameterizes the spectral density (Fourier transform) of the kernel as a mixture of Gaussians. This construction is universal in the sense that it can approximate any stationary kernel arbitrarily well, and it has been shown to outperform the RBF, Matern, rational quadratic, and periodic kernels on extrapolation and pattern discovery tasks [4].

## How are kernel hyperparameters learned?

A distinctive advantage of Gaussian processes is that kernel hyperparameters (length scales, signal variances, periods, noise variances) can be learned directly from data by maximizing the *log marginal likelihood*, also called the log evidence. Under the GP model, the marginal likelihood is the probability of observing the training targets y given the inputs X and the hyperparameters theta, integrated over all possible functions:

log p(y | X, theta) = -1/2 y^T (K_theta + sigma_n^2 I)^{-1} y - 1/2 log |K_theta + sigma_n^2 I| - n/2 log(2 pi)

This expression has three interpretable terms. The first is a *data fit* term that rewards models that explain the observations well. The second is a *complexity penalty* (sometimes called the Occam factor) that penalizes models that are too flexible, automatically guarding against overfitting. The third is a normalization constant. The natural balance between fit and complexity makes the marginal likelihood a principled criterion for model selection without requiring cross-validation [1][5].

Marginal likelihood optimization is typically performed using gradient-based methods such as L-BFGS or Adam, with gradients computed via automatic differentiation in modern libraries. The objective surface is generally non-convex, so multiple restarts from different initial hyperparameter values are often needed to find a good local optimum. The computational cost is dominated by the O(n^3) Cholesky factorization of the covariance matrix at each evaluation, which constrains exact GPs to datasets of a few thousand points [9].

## Gaussian process classification

When the response variable is discrete rather than continuous, a Gaussian process can still serve as a latent function whose values are mapped through a sigmoid (binary) or softmax (multiclass) link function to produce class probabilities. Unlike GP regression, GP classification has no closed-form posterior because the non-Gaussian likelihood makes the integrals intractable [6].

Approximate inference methods are therefore required. The two most widely used approaches are:

- **Laplace approximation**: A Gaussian approximation centered at the mode of the posterior. Fast and simple but can be inaccurate, particularly when the posterior is skewed or multimodal.
- **Expectation propagation (EP)**: An iterative moment-matching scheme that approximates each likelihood factor with a Gaussian. EP is generally more accurate than Laplace but more expensive and can fail to converge in some settings.
- **Variational inference**: Optimizes a Gaussian approximation to the posterior by minimizing the [KL divergence](/wiki/kl_divergence) to the true posterior. Variational methods underpin most scalable modern implementations.
- **Markov chain Monte Carlo (MCMC)**: Provides asymptotically exact inference at high computational cost.

Empirical comparisons by Kuss and Rasmussen and others have shown that EP typically outperforms Laplace approximation in terms of predictive accuracy and uncertainty calibration, with EP being the method of choice unless the computational budget is very tight [6]. GP classification is less commonly used than GP regression in practice, partly because of these inference challenges and partly because for many classification problems other methods such as deep [neural networks](/wiki/neural_network) are more competitive.

## What is Gaussian process Bayesian optimization?

Gaussian processes are the dominant surrogate model in [Bayesian optimization](/wiki/bayesian_optimization), a sequential strategy for global optimization of expensive black-box functions. The setup is to find the input x* that maximizes (or minimizes) an objective function f(x) when each evaluation of f is costly, such as training a [deep learning](/wiki/deep_learning) model with a particular hyperparameter configuration, running a physics simulation, or conducting a wet-lab experiment. The GP serves as a probabilistic surrogate for f, and an *acquisition function* derived from the GP posterior decides which point to evaluate next [7].

Common acquisition functions include:

- **Expected Improvement (EI)**: Chooses the point with the largest expected improvement over the current best observation. EI naturally balances exploration and exploitation and has strong empirical track record.
- **Upper Confidence Bound (UCB)**: Chooses the point that maximizes mu(x) + beta * sigma(x), where beta controls the exploration-exploitation tradeoff. UCB has strong regret bounds in theory.
- **Probability of Improvement (PI)**: Chooses the point most likely to exceed the current best. Tends to over-exploit unless tuned carefully.
- **Thompson sampling**: Draws a sample function from the GP posterior and selects the maximizer of that sample. Naturally handles batch parallelism.
- **Knowledge Gradient (KG)**: Looks ahead to the expected improvement in the posterior maximizer after observing a point. More expensive but often more sample-efficient.
- **Predictive Entropy Search (PES)**: Chooses the point that maximally reduces uncertainty about the location of the optimum. Theoretically appealing but computationally heavy.

A pivotal early application was the 2012 paper by Jasper Snoek, Hugo Larochelle, and Ryan Adams on *practical Bayesian optimization of machine learning algorithms*, which demonstrated that GP-based Bayesian optimization could automatically tune the hyperparameters of [deep learning](/wiki/deep_learning) models with state-of-the-art results. This work spawned the **Spearmint** library and established Bayesian optimization as a standard tool for hyperparameter tuning [7].

### Bayesian Optimization Libraries

| Library | Backend | Key Features | Year |
|---------|---------|--------------|------|
| Spearmint | Python (NumPy) | MCMC over hyperparameters, parallel evaluation, constraints | 2012 |
| GPyOpt | GPy / NumPy | Mixed-type variables, batch optimization, local penalization | 2016 |
| BoTorch | PyTorch / GPyTorch | Monte Carlo acquisition, differentiable, multi-objective | 2019 |
| Ax | BoTorch | High-level adaptive experimentation platform | 2019 |
| scikit-optimize | scikit-learn | Lightweight, easy to use, GP and forest surrogates | 2016 |
| Dragonfly | Python | Massive scale, multi-fidelity, distributed | 2018 |
| Cornell-MOE | Python / C++ | Knowledge gradient, parallel | 2014 |
| Emukit | Python | Decision-making under uncertainty, modular | 2019 |

Among these, **BoTorch** has emerged as a leading research framework. Its authors describe it as "a modern programming framework for Bayesian optimization that combines Monte-Carlo (MC) acquisition functions, a novel sample average approximation optimization approach, auto-differentiation, and variance reduction techniques." [8] Because it is built on PyTorch and GPyTorch, it supports gradient-based optimization of acquisition functions through reparameterization tricks and integrates seamlessly with the **Ax** adaptive experimentation platform from Meta. BoTorch's algorithms tend to achieve greater sample efficiency than older libraries and are widely used in industrial applications [8].

## Geostatistics and kriging

Gaussian process regression in spatial statistics is known as **kriging**, the original application of the technique. Kriging is used to interpolate values of a spatial variable (such as ore grade, pollutant concentration, or soil moisture) at unsampled locations based on observations at nearby sampled locations. The covariance function in kriging is typically called a *variogram* or *covariogram* and is fitted to empirical data using methods such as the method of moments or restricted maximum likelihood [3].

Variants of kriging differ in how the mean function is treated:

- **Simple kriging** assumes a known constant mean.
- **Ordinary kriging** assumes an unknown constant mean estimated from data.
- **Universal kriging** allows the mean to be a parametric function of location, such as a polynomial trend.
- **Co-kriging** jointly models multiple correlated variables.
- **Indicator kriging** handles categorical or threshold-exceedance data.

Kriging is the workhorse of mining, petroleum geology, hydrology, soil science, and environmental monitoring. The geostatistical literature has accumulated decades of practical wisdom on issues such as variogram modeling, anisotropy, nested structures, and large-scale spatial inference that informs modern Gaussian process practice [3].

## How do sparse Gaussian processes scale to big data?

The principal limitation of Gaussian processes is their **O(n^3) computational complexity** and **O(n^2) memory complexity**, where n is the number of training points. The cubic cost arises from the Cholesky factorization of the n by n covariance matrix required for inference and marginal likelihood evaluation. For datasets larger than a few thousand points, exact GPs become impractical, motivating extensive research into approximations [9].

### Inducing Point Methods

The most influential family of approximations introduces a small set of m inducing points (with m much less than n) that summarize the dataset and reduce computation to O(n m^2). Key milestones include:

- **Fully Independent Training Conditional (FITC)** by Snelson and Ghahramani (2006): Approximates the GP prior by assuming independence among training points conditional on the inducing points. FITC reduces cost to O(n m^2) and was the first widely-adopted sparse GP method [17].
- **Variational Free Energy (VFE)** by Titsias (2009): Treats the inducing points as variational parameters and derives a rigorous lower bound on the marginal likelihood. VFE is provably consistent in the sense that as inducing points are added, the bound becomes tight against the full GP. Empirically VFE is more conservative and accurate than FITC, which can place inducing points in pathological configurations [9].
- **Stochastic Variational Gaussian Process (SVGP)** by James Hensman, Nicolo Fusi, and Neil Lawrence (2013): Reformulates the variational lower bound so that it factorizes over training points, enabling stochastic gradient optimization with mini-batches. SVGP scales GPs to millions of data points and underpins most modern large-scale GP applications [10].

### Other Scalability Approaches

Beyond inducing points, other strategies for scaling Gaussian processes include:

- **Structured kernel interpolation (KISS-GP)**: Exploits Toeplitz or Kronecker structure in gridded inputs.
- **Stochastic gradient descent on kernel matrices**: Iterative methods avoid forming the full Kronecker matrix.
- **Random Fourier features**: Approximates stationary kernels with finite-dimensional feature maps.
- **Local GP experts and mixtures**: Partitions the input space and fits a separate GP per region.
- **GPU-accelerated linear algebra**: Libraries such as GPyTorch use Lanczos iteration and conjugate gradients to scale exact GPs further on GPUs [15].

## How are Gaussian processes related to deep learning?

The relationship between Gaussian processes and [deep learning](/wiki/deep_learning) is rich and bidirectional. Several research threads connect the two paradigms:

### Neural Networks as Gaussian Processes

Radford Neal showed in 1996 that a single-hidden-layer [neural network](/wiki/neural_network) with infinite width and i.i.d. Gaussian weights converges to a Gaussian process [18]. This result was extended to deep networks by Lee, Bahri, and others in 2018, giving rise to the **Neural Network Gaussian Process (NNGP) kernel** that exactly characterizes infinitely wide deep networks [11]. The related **Neural Tangent Kernel (NTK)** describes the training dynamics of wide networks under gradient descent. These connections have provided theoretical insights into why deep learning works and have produced practical kernels for GP regression [11].

### Deep Kernel Learning

**Deep kernel learning**, introduced by Andrew Gordon Wilson and colleagues in 2016, combines the expressiveness of [deep learning](/wiki/deep_learning) with the calibrated uncertainty of Gaussian processes by using a [neural network](/wiki/neural_network) to transform inputs before applying a base kernel such as the spectral mixture or RBF. The neural network and GP hyperparameters are jointly trained by maximizing the marginal likelihood. Deep kernel learning achieves both representation learning and probabilistic uncertainty in a unified framework and has been applied to datasets with millions of examples [12].

### Deep Gaussian Processes

**Deep Gaussian processes**, introduced by Andreas Damianou and Neil Lawrence in 2013, stack multiple GP layers in a hierarchy where the output of one GP becomes the input to the next. This composition produces models with greater expressive power than a single GP, particularly for non-stationary or hierarchical data. Inference in deep GPs is more challenging because the layered composition is no longer a Gaussian process, requiring variational approximations. Deep GPs have shown competitive performance on tasks where representation learning is important but uncertainty matters [13].

### Neural Processes

**Neural processes (NPs)**, introduced by Marta Garnelo and colleagues at DeepMind in 2018, are a family of [neural network](/wiki/neural_network) models inspired by Gaussian processes that aim to learn distributions over functions directly. A neural process consists of an encoder that maps observed context points to a global latent representation, and a decoder that produces a distribution over function values at any query point conditioned on the latent. Unlike GPs, neural processes scale linearly with the number of context points and can leverage [deep learning](/wiki/deep_learning) representations, but they sacrifice some of the calibrated uncertainty properties of true GPs. Conditional neural processes, attentive neural processes, and convolutional conditional neural processes are subsequent developments. Research has shown that neural processes are mathematically related to Gaussian processes with deep kernels under certain conditions [14].

## How do Gaussian processes differ from neural networks?

| Aspect | Gaussian Processes | Neural Networks |
|--------|--------------------|-----------------|
| Model type | Non-parametric | Parametric (fixed weights) |
| Uncertainty quantification | Built-in, calibrated posterior variance | Requires extensions (MC dropout, ensembles, BNNs) |
| Training | Marginal likelihood (closed-form for regression) | Stochastic gradient descent on loss |
| Hyperparameters | Few (kernel parameters, noise) | Many (weights, biases, architecture) |
| Scalability | O(n^3), needs sparse approximations beyond ~10K points | Scales linearly with data, billions of points feasible |
| Interpretability | Kernel structure encodes assumptions | Often opaque, requires post-hoc interpretation |
| Sample efficiency | Excellent for small data | Typically requires large datasets |
| Flexibility | Limited by chosen kernel | Universal function approximators |
| Out-of-distribution behavior | Reverts to prior, often degrades gracefully | Can produce overconfident wrong predictions |
| Best use cases | Small data, uncertainty critical, expensive evaluations | Large data, perception, generation |

Gaussian processes and neural networks occupy complementary niches. GPs shine when data is scarce, when uncertainty quantification matters (such as in active learning, [Bayesian optimization](/wiki/bayesian_optimization), scientific applications, and safety-critical decision-making), and when the user wants principled control over inductive biases via the kernel. Neural networks dominate when data is abundant and when learned representations are critical, such as in image, speech, and language modeling. Hybrid approaches such as deep kernel learning and neural processes attempt to combine the strengths of both paradigms.

It is worth noting the conceptual link to other Gaussian-based probabilistic models. A [Gaussian mixture model](/wiki/gaussian_mixture_model) (GMM) is a finite weighted sum of Gaussians used for density estimation and clustering, while a Gaussian process is an infinite-dimensional distribution over functions. Both are foundational in [Bayesian inference](/wiki/bayesian_inference), but they address different problems: GMMs cluster data, while GPs predict functions.

## Software and libraries

A mature ecosystem of open-source libraries supports Gaussian process modeling across the major scientific computing platforms.

| Library | Backend | Key Features | Best For |
|---------|---------|--------------|----------|
| GPy | Python (NumPy) | Mature, comprehensive, extensive kernel zoo | Research, prototyping |
| GPflow | TensorFlow | Variational methods, multi-output, easy customization | Modern research, production |
| GPyTorch | PyTorch | GPU acceleration, scalable inference, modular | Large-scale, deep kernel learning |
| BoTorch | PyTorch / GPyTorch | Bayesian optimization, Monte Carlo acquisition | Hyperparameter tuning, BO |
| scikit-learn | Python (NumPy) | Simple API, integrated with sklearn ecosystem | Beginners, small datasets |
| Stan | Stan | Full Bayesian inference with HMC | Hierarchical models, full posteriors |
| PyMC | Python | Probabilistic programming, GPs as random variables | Bayesian workflow integration |
| Edward2 / TensorFlow Probability | TensorFlow | GPs as distributions, deep probabilistic models | Probabilistic programming on TF |
| celerite / celerite2 | Python | O(n) inference for 1D problems with semi-separable kernels | Time series, astronomy |
| Pyro | PyTorch | Probabilistic programming with GPs as building blocks | Variational inference, hybrid models |

**GPyTorch** has become the dominant library for GP research because of its scalable inference engine based on conjugate gradients and Lanczos iteration, GPU acceleration, and tight integration with PyTorch's automatic differentiation [15]. **GPflow** offers a similar feature set on TensorFlow with particularly strong support for multi-output GPs and variational methods. **scikit-learn** provides an entry point with a simple API suitable for educational purposes and small problems. **Stan** is preferred when full posterior inference over hyperparameters via Hamiltonian Monte Carlo is required, common in hierarchical models and astronomical applications.

## What are Gaussian processes used for?

The combination of probabilistic predictions, principled uncertainty quantification, and a flexible non-parametric formulation has led to Gaussian processes being adopted across a remarkably broad set of application areas.

### Hyperparameter Tuning

Bayesian optimization with Gaussian process surrogates is a standard tool for tuning hyperparameters of expensive [machine learning](/wiki/machine_learning) models, including [deep learning](/wiki/deep_learning) systems. Major frameworks such as Google Vizier, Meta Ax, Microsoft NNI, and Amazon SageMaker Automatic Model Tuning include GP-based search strategies. Automated machine learning (AutoML) systems frequently rely on GPs for inner-loop optimization.

### Surrogate Modeling for Expensive Simulations

When each evaluation of a physical or computational simulation is expensive (taking minutes to days per run), a Gaussian process trained on a small set of simulation outputs can serve as a fast surrogate for downstream tasks such as design optimization, sensitivity analysis, and uncertainty quantification. GP surrogates are widely used in aerospace, automotive, energy, and pharmaceutical engineering [16].

### Spatial and Environmental Statistics

Kriging and its variants remain the standard interpolation method in mining, soil science, hydrology, meteorology, and air quality monitoring. GPs are used to model temperature, precipitation, pollutant concentrations, and ecological variables across landscapes.

### Time Series Forecasting

Gaussian processes can model time series by treating time as the input variable. Composing periodic, trend, and noise kernels produces flexible models that handle seasonality and irregular sampling. GP time series models are popular in finance, demand forecasting, and astronomy where calibrated uncertainty bands are valued.

### Reinforcement Learning and Control

In model-based reinforcement learning, GPs are used to learn dynamics models from limited interaction data. The PILCO algorithm by Marc Deisenroth and Carl Rasmussen demonstrated that GP-based dynamics learning could solve continuous control problems with far fewer samples than model-free methods [19]. GPs are also used in safe Bayesian optimization for robotic control where exploration must respect safety constraints.

### Active Learning and Experimental Design

The posterior variance of a GP provides a natural acquisition criterion for selecting informative measurements. This is exploited in active learning, optimal experimental design, drug discovery, and materials science to minimize the number of expensive experiments needed to characterize a system.

### Astronomy and Astrophysics

Gaussian processes are widely used in exoplanet detection (modeling stellar variability that obscures planetary signals), gravitational wave analysis, and cosmological parameter estimation. The celerite library provides O(n) GP inference for the structured kernels common in astronomical time series [20].

### Robotics

GPs are used in robotics for state estimation, terrain classification, mapping, and policy learning. Their uncertainty estimates enable robots to identify when they are operating in unfamiliar conditions and to seek out informative observations.

## What are the limitations of Gaussian processes?

Despite their many strengths, Gaussian processes have several well-known limitations.

- **Cubic computational complexity**: Exact inference scales as O(n^3) in time and O(n^2) in memory, limiting exact GPs to a few thousand training points. Sparse and scalable methods address this but introduce approximation error [9].
- **Sensitivity to kernel choice**: The kernel encodes strong assumptions about the function. A poorly chosen kernel can produce poor predictions even with abundant data, and selecting an appropriate kernel can require domain expertise.
- **Difficulty with high-dimensional inputs**: Standard kernels suffer from the curse of dimensionality. As input dimension grows, distance-based kernels become less informative, and GPs may underperform compared to neural networks for high-dimensional perceptual data.
- **Stationarity assumption**: Most common kernels are stationary, meaning the covariance depends only on the distance between inputs. Non-stationary functions, where smoothness varies across the input space, require either non-stationary kernels (which are harder to specify) or warping techniques.
- **Marginal likelihood non-convexity**: Hyperparameter optimization is non-convex and can converge to poor local optima, particularly for complex compound kernels or sparse approximations.
- **Limited representation learning**: Unlike deep networks, vanilla GPs do not learn hierarchical representations from raw data. Hybrid approaches such as deep kernel learning attempt to address this.
- **Implementation challenges**: Numerical stability of Cholesky factorization and gradient computation requires careful handling. Adding small jitter to the diagonal of the covariance matrix is a common practical fix.
- **Less mature multi-output support**: Modeling multiple correlated outputs with full covariance structure is more complex than for single-output GPs and an active area of research.

## Recent Developments

Research on Gaussian processes continues to advance on several fronts. Modern directions include:

- **Scalable inference on GPUs**: Libraries such as GPyTorch leverage Krylov subspace methods and Lanczos quadrature to scale exact GP inference to hundreds of thousands of points on a single GPU [15].
- **GP and ensemble hybrids**: Treed GPs, mixtures of GP experts, and ensembles of GPs combine local models for improved scalability and flexibility.
- **Multi-fidelity GPs**: Methods such as deep multi-fidelity models combine cheap low-fidelity simulations with expensive high-fidelity ones to accelerate surrogate construction.
- **Constrained and physics-informed GPs**: GPs that enforce physical laws, monotonicity, or other constraints are increasingly used in scientific [machine learning](/wiki/machine_learning).
- **Causal Gaussian processes**: GPs in causal inference for treatment effect estimation, particularly in continuous treatment settings.
- **GPs for set-valued and structured outputs**: Extensions to model functions with set, graph, or tree-valued outputs.
- **Probabilistic numerics**: Casting numerical algorithms (integration, ODE solving, linear algebra) as GP inference, providing uncertainty over computational results.

## See Also

- [Bayesian optimization](/wiki/bayesian_optimization)
- [Bayesian inference](/wiki/bayesian_inference)
- [Machine learning](/wiki/machine_learning)
- [Gaussian mixture model](/wiki/gaussian_mixture_model)
- [Neural network](/wiki/neural_network)
- [Deep learning](/wiki/deep_learning)
- [KL divergence](/wiki/kl_divergence)

## References

1. Rasmussen, C. E. and Williams, C. K. I. (2006). *Gaussian Processes for Machine Learning*. MIT Press. Available at https://gaussianprocess.org/gpml/
2. Rasmussen, C. E. (2004). "Gaussian Processes in Machine Learning". *Lecture Notes in Computer Science*, vol 3176. Springer. https://mlg.eng.cam.ac.uk/pub/pdf/Ras04.pdf
3. Cressie, N. (1990). "The origins of kriging". *Mathematical Geology*, 22(3), 239-252; Matheron, G. (1963). "Principles of geostatistics". *Economic Geology*, 58(8), 1246-1266.
4. Wilson, A. G. and Adams, R. P. (2013). "Gaussian Process Kernels for Pattern Discovery and Extrapolation". *International Conference on Machine Learning (ICML)*. arXiv:1302.4245.
5. MacKay, D. J. C. (2003). *Information Theory, Inference, and Learning Algorithms*. Cambridge University Press, chapter 45.
6. Kuss, M. and Rasmussen, C. E. (2005). "Assessing Approximate Inference for Binary Gaussian Process Classification". *Journal of Machine Learning Research*, 6, 1679-1704.
7. Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1206.2944.
8. Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. (2020). "BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization". *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1910.06403.
9. Titsias, M. (2009). "Variational Learning of Inducing Variables in Sparse Gaussian Processes". *Artificial Intelligence and Statistics (AISTATS)*.
10. Hensman, J., Fusi, N., and Lawrence, N. D. (2013). "Gaussian Processes for Big Data". *Conference on Uncertainty in Artificial Intelligence (UAI)*. arXiv:1309.6835.
11. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. (2018). "Deep Neural Networks as Gaussian Processes". *International Conference on Learning Representations (ICLR)*. arXiv:1711.00165.
12. Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. (2016). "Deep Kernel Learning". *Artificial Intelligence and Statistics (AISTATS)*. arXiv:1511.02222.
13. Damianou, A. C. and Lawrence, N. D. (2013). "Deep Gaussian Processes". *Artificial Intelligence and Statistics (AISTATS)*. arXiv:1211.0358.
14. Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. (2018). "Neural Processes". *ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models*. arXiv:1807.01622.
15. Gardner, J. R., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). "GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration". *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1809.11165.
16. Gramacy, R. B. (2020). *Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences*. CRC Press.
17. Snelson, E. and Ghahramani, Z. (2006). "Sparse Gaussian Processes using Pseudo-inputs". *Advances in Neural Information Processing Systems (NeurIPS)*.
18. Neal, R. M. (1996). *Bayesian Learning for Neural Networks*. Lecture Notes in Statistics, Springer.
19. Deisenroth, M. P. and Rasmussen, C. E. (2011). "PILCO: A Model-Based and Data-Efficient Approach to Policy Search". *International Conference on Machine Learning (ICML)*.
20. Foreman-Mackey, D., Agol, E., Ambikasaran, S., and Angus, R. (2017). "Fast and Scalable Gaussian Process Modeling with Applications to Astronomical Time Series". *The Astronomical Journal*, 154(6), 220. arXiv:1703.09710.
