Probabilistic Regression Model

A probabilistic regression model is a regression model that outputs a full probability distribution over possible target values rather than a single point estimate. By modeling the conditional distribution p(y | x) of a continuous target variable y given input features x, these models capture the uncertainty inherent in predictions. This makes them especially valuable in applications where understanding the range and likelihood of outcomes is as important as the prediction itself, such as in medical diagnosis, weather forecasting, financial risk assessment, and engineering safety analysis.

Unlike standard regression methods (such as ordinary least squares) that return only a best-guess value, probabilistic regression models provide prediction intervals, variance estimates, or entire density functions. This additional information allows practitioners to make risk-aware decisions and to quantify how confident a model is in each of its predictions.

Explain like I'm 5 (ELI5)

Imagine you want to guess how tall a sunflower will grow. A regular guess gives you one number, like "6 feet." But a probabilistic guess is more like saying, "It will probably be between 5 and 7 feet tall, and most likely around 6 feet." The guess includes a range, and it tells you how sure you are. If you have seen lots of sunflowers before, your range is small because you know what to expect. If you have never grown one, your range is bigger because you are less certain. That is what a probabilistic regression model does: it gives not just a single answer but a whole picture of what could happen and how likely each outcome is.

Mathematical foundation

In a standard regression setting, we model the expected value of y given x:

$$\hat{y} = f(x)$$

A probabilistic regression model instead estimates the full conditional distribution:

$$p(y \mid x, \theta)$$

where θ represents the model parameters. For many probabilistic regression methods, the output distribution is assumed to be Gaussian:

$$p(y \mid x) = \mathcal{N}(y; \mu(x), \sigma^2(x))$$

Here, both the mean μ(x) and the variance σ²(x) are functions of the input. The model is typically trained by maximizing the log-likelihood:

$$\mathcal{L}(\theta) = \sum_{i=1}^{N} \log p(y_i \mid x_i, \theta)$$

For a Gaussian output, the negative log-likelihood reduces to:

$$-\mathcal{L}(\theta) = \frac{1}{2} \sum_{i=1}^{N} \left[ \log \sigma^2(x_i) + \frac{(y_i - \mu(x_i))^2}{\sigma^2(x_i)} \right] + \text{const}$$

This formulation naturally penalizes both inaccurate mean predictions and poorly calibrated variance estimates.

Types of uncertainty

Probabilistic regression models distinguish between two fundamental sources of uncertainty:

Type	Also known as	Source	Reducible?	Example
Aleatoric uncertainty	Data uncertainty, statistical uncertainty	Inherent noise in the data-generating process	No	Measurement sensor noise, natural variability in patient outcomes
Epistemic uncertainty	Model uncertainty, knowledge uncertainty	Lack of knowledge due to limited training data	Yes (with more data)	Predictions in regions of feature space with few training examples

Aleatoric uncertainty is irreducible and stems from the randomness inherent in the system being modeled. For example, even with perfect knowledge of all relevant factors, there is still random variation in daily stock prices. Aleatoric uncertainty can be further divided into homoscedastic uncertainty (constant noise across the input space) and heteroscedastic uncertainty (noise that varies with input).

Epistemic uncertainty arises from incomplete knowledge and can be reduced by collecting more training data or by improving the model. Bayesian methods naturally capture epistemic uncertainty through posterior distributions over model parameters. In regions where few training examples exist, the posterior is broad, signaling high epistemic uncertainty.

A well-designed probabilistic regression model should ideally capture both types. Total predictive uncertainty is often expressed as the sum of aleatoric and epistemic components:

$$\text{Var}[y \mid x] = \underbrace{\mathbb{E}[\sigma^2(x)]}{\text{aleatoric}} + \underbrace{\text{Var}[\mu(x)]}{\text{epistemic}}$$

Classical probabilistic regression methods

Bayesian linear regression

Bayesian linear regression extends classical linear regression by treating the regression coefficients as random variables with prior distributions rather than fixed unknown parameters. Given a linear model y = Xβ + ε with Gaussian noise ε ~ N(0, σ²I), a Gaussian prior is placed on the weights:

$$\beta \sim \mathcal{N}(\mu_0, \Lambda_0^{-1})$$

Using Bayes' theorem, the posterior distribution over the weights given observed data (X, y) is:

$$p(\beta \mid X, y) = \mathcal{N}(\mu_n, \Lambda_n^{-1})$$

where:

Λ_n = X^T X / σ² + Λ_0 (posterior precision)
μ_n = Λ_n^{-1}(X^T y / σ² + Λ_0 μ_0) (posterior mean)

The posterior mean is a weighted combination of the prior mean and the maximum likelihood estimate, with the weighting determined by the relative precision of each. The posterior predictive distribution for a new input x* is:

$$p(y^* \mid x^*, X, y) = \mathcal{N}(x^{*T} \mu_n, ; \sigma^2 + x^{T} \Lambda_n^{-1} x^)$$

This prediction captures two sources of variance: the noise variance σ² (aleatoric) and the parameter uncertainty x*^T Λ_n^{-1} x* (epistemic). As more data is observed, Λ_n grows and the epistemic component shrinks.

Bayesian linear regression also enables principled model comparison through the marginal likelihood (model evidence), which can be computed in closed form for conjugate priors.

Gaussian process regression

Gaussian process (GP) regression, also known as kriging, is a nonparametric Bayesian approach that places a prior distribution directly over functions rather than over a finite set of parameters. A Gaussian process is fully specified by a mean function m(x) and a covariance (kernel) function k(x, x'):

$$f(x) \sim \mathcal{GP}(m(x), k(x, x'))$$

Given n training observations (X, y) and assuming Gaussian noise with variance σ², the predictive distribution at a new test point x* is Gaussian:

$$p(f^* \mid x^, X, y) = \mathcal{N}(\bar{f}^, \text{cov}(f^*))$$

with:

Posterior mean: $\bar{f}^* = k(x^*, X)[k(X, X) + \sigma^2 I]^{-1} y$
Posterior variance: $\text{cov}(f^) = k(x^, x^) - k(x^, X)[k(X, X) + \sigma^2 I]^{-1} k(X, x^*)$

The posterior mean acts as a weighted combination of the observed outputs, while the posterior variance depends only on the input locations and decreases in regions near the training data.

Common kernel functions

The choice of kernel encodes assumptions about the underlying function, such as smoothness, periodicity, and length scale.

Kernel	Formula	Properties
Squared Exponential (RBF)	k(x, x') = σ_f² exp(-\|x - x'\|² / 2ℓ²)	Infinitely differentiable; very smooth functions
Matern (ν = 3/2)	k(x, x') = σ_f² (1 + √3 d/ℓ) exp(-√3 d/ℓ)	Once differentiable; realistic for many physical processes
Matern (ν = 5/2)	k(x, x') = σ_f² (1 + √5 d/ℓ + 5d²/3ℓ²) exp(-√5 d/ℓ)	Twice differentiable; good balance of smoothness
Periodic	k(x, x') = σ_f² exp(-2 sin²(π\|x - x'\| / p) / ℓ²)	Captures periodic patterns with period p
Linear	k(x, x') = σ_b² + σ_v² (x - c)(x' - c)	Equivalent to Bayesian linear regression
Rational Quadratic	k(x, x') = σ_f² (1 + \|x - x'\|² / 2αℓ²)^(-α)	Mixture of RBF kernels with different length scales

Kernel hyperparameters (such as the length scale ℓ and signal variance σ_f²) are typically optimized by maximizing the log marginal likelihood. Because this objective can be non-convex, multiple restarts of the optimizer are often used to avoid local optima.

A major limitation of standard GP regression is its computational cost: training requires O(n³) time and O(n²) memory due to the inversion of the n-by-n covariance matrix. This has motivated the development of sparse GP approximations, including inducing point methods and the Vecchia approximation, which reduce complexity to O(nm²) where m is the number of inducing points (m << n).

Generalized linear models

Generalized linear models (GLMs) extend linear regression by allowing the response variable to follow any distribution from the exponential family (such as Gaussian, Poisson, Gamma, or Inverse-Gaussian). A GLM has three components:

A probability distribution for y from the exponential family
A linear predictor η = Xβ
A link function g that relates the mean of y to the linear predictor: g(μ) = η

For probabilistic regression, GLMs with appropriate link functions and distributions can model non-Gaussian response variables while still providing distributional predictions. For instance, a Gamma GLM with a log link is commonly used for modeling strictly positive, right-skewed continuous outcomes.

Quantile regression

Quantile regression, introduced by Koenker and Bassett in 1978, estimates conditional quantiles of the response variable rather than the conditional mean. For a given quantile level τ (where 0 < τ < 1), the τ-th conditional quantile function q_τ(x) is estimated by minimizing the pinball loss (also called the check loss):

$$L_\tau(y, \hat{q}) = \begin{cases} \tau (y - \hat{q}) & \text{if } y \geq \hat{q} \ (1 - \tau)(\hat{q} - y) & \text{if } y < \hat{q} \end{cases}$$

By fitting separate quantile functions for multiple values of τ (for example, τ = 0.05, 0.5, 0.95), quantile regression constructs prediction intervals without assuming any parametric distribution for the errors. This makes it particularly useful for heteroscedastic data, where the spread of y varies with x. Unlike methods that assume Gaussian errors, quantile regression is robust to outliers and non-normal error distributions.

The special case of τ = 0.5 yields the conditional median, which minimizes the sum of absolute deviations rather than squared errors.

Deep learning approaches

Heteroscedastic neural networks

Nix and Weigend (1994) proposed one of the earliest neural network approaches to probabilistic regression. The network outputs two values for each input: a predicted mean μ(x) and a predicted variance σ²(x). The model is trained by minimizing the Gaussian negative log-likelihood:

$$\mathcal{L} = \frac{1}{2N} \sum_{i=1}^{N} \left[ \log \sigma^2(x_i) + \frac{(y_i - \mu(x_i))^2}{\sigma^2(x_i)} \right]$$

This allows the model to learn input-dependent (heteroscedastic) noise, producing wider prediction intervals where the data is noisier. The variance output is typically parameterized through a softplus or exponential activation to ensure positivity.

One known issue with this approach is that the variance predictions can become overconfident, especially in regions with limited data. Regularization techniques and careful initialization help mitigate this problem.

Mixture density networks

Mixture density networks (MDNs), introduced by Christopher Bishop in 1994, combine a neural network with a Gaussian mixture model. The network outputs the parameters of a mixture of K Gaussians:

$$p(y \mid x) = \sum_{k=1}^{K} \pi_k(x) , \mathcal{N}(y; \mu_k(x), \sigma_k^2(x))$$

where π_k(x) are the mixing coefficients (summing to 1), and μ_k(x) and σ_k²(x) are the mean and variance of the k-th component. MDNs can represent multimodal predictive distributions, which is especially important for inverse problems and situations where multiple valid outputs exist for a single input. For example, in robotics, a single target position might be reachable through multiple joint configurations.

The network is trained end-to-end using maximum likelihood, and the mixing coefficients are typically produced via a softmax layer.

Monte Carlo dropout

Gal and Ghahramani (2016) showed that dropout, a regularization technique commonly used during training, can be reinterpreted as approximate Bayesian inference. By keeping dropout active during prediction and performing T stochastic forward passes, one obtains a set of predictions {ŷ_1, ..., ŷ_T} for each input. The predictive mean and variance are estimated as:

$$\hat{\mu}(x) = \frac{1}{T} \sum_{t=1}^{T} \hat{y}t(x), \qquad \hat{\sigma}^2(x) = \frac{1}{T} \sum{t=1}^{T} (\hat{y}_t(x) - \hat{\mu}(x))^2$$

This method is attractive because it requires no changes to the network architecture and can be applied to any model that already uses dropout. However, the quality of the uncertainty estimates depends on the dropout rate and may not fully capture the true posterior, particularly during extrapolation.

Deep ensembles

Lakshminarayanan, Pritzel, and Blundell (2017) proposed deep ensembles as a simple and scalable alternative to Bayesian neural networks. An ensemble of M independently trained networks (each with different random initializations) produces predictions that are combined as a mixture of Gaussians:

$$p(y \mid x) = \frac{1}{M} \sum_{m=1}^{M} \mathcal{N}(y; \mu_m(x), \sigma_m^2(x))$$

The predictive mean and total variance are:

$$\mu^*(x) = \frac{1}{M} \sum_{m=1}^{M} \mu_m(x)$$

$$\sigma^{*2}(x) = \frac{1}{M} \sum_{m=1}^{M} [\sigma_m^2(x) + \mu_m^2(x)] - \mu^{*2}(x)$$

Deep ensembles have been shown to produce well-calibrated uncertainty estimates and have become a widely used baseline for predictive uncertainty in deep learning. They tend to outperform Monte Carlo dropout on out-of-distribution detection and overall calibration. The main drawback is the increased computational cost of training and storing multiple models.

Bayesian neural networks

Bayesian neural networks (BNNs) place probability distributions over the network weights rather than learning fixed point estimates. The posterior over weights p(w | D) is combined with the likelihood to produce predictive distributions:

$$p(y \mid x, D) = \int p(y \mid x, w) , p(w \mid D) , dw$$

Since this integral is intractable for all but the simplest networks, approximate inference methods are used:

Variational inference: Graves (2011) and Blundell et al. (2015, "Bayes by Backprop") approximate the posterior with a parameterized distribution q(w; φ) by minimizing the KL divergence between q and the true posterior. The variational parameters φ are learned alongside the model through backpropagation.
Markov chain Monte Carlo (MCMC): Methods such as stochastic gradient Langevin dynamics (SGLD) and Hamiltonian Monte Carlo (HMC) sample from the posterior. These methods are asymptotically exact but computationally expensive for large networks.

BNNs naturally decompose uncertainty into aleatoric and epistemic components and provide principled uncertainty estimates, but they remain more difficult to train and scale than ensembles or dropout-based methods.

Deep evidential regression

Amini et al. (2020) proposed deep evidential regression, which places an evidential prior (Normal Inverse-Gamma distribution) over the parameters of the Gaussian likelihood. The network outputs four parameters (γ, ν, α, β) that parameterize this higher-order distribution, allowing simultaneous estimation of aleatoric and epistemic uncertainty in a single forward pass without sampling.

The predictive distribution has mean γ and variance β(1 + 1/ν) / (α - 1). Epistemic uncertainty is captured through ν (inversely), while aleatoric uncertainty is captured through β/α. This approach is computationally efficient since it avoids the need for multiple forward passes or ensemble training.

Gradient boosting approaches

NGBoost

NGBoost (Natural Gradient Boosting), developed by Duan et al. (2020) at Stanford, extends gradient boosting to probabilistic prediction. Traditional gradient boosting optimizes a scalar loss function, while NGBoost treats the parameters of a conditional probability distribution as the targets of a multiparameter boosting algorithm.

The key insight is the use of the natural gradient (rather than the ordinary gradient) to update the distribution parameters at each boosting iteration. The natural gradient accounts for the geometry of the parameter space of probability distributions, leading to more stable and efficient optimization. NGBoost is flexible in that it can be used with any base learner, any parametric distribution family, and any proper scoring rule.

CatBoost and LightGBM

Other popular gradient boosting frameworks, such as CatBoost and LightGBM, also support quantile regression objectives, enabling probabilistic predictions through estimated conditional quantiles. These approaches do not model a full parametric distribution but provide prediction intervals by fitting separate models for different quantile levels.

Distribution-free methods

Conformal prediction

Conformal prediction is a framework for constructing prediction intervals with finite-sample coverage guarantees, regardless of the underlying data distribution or model architecture. Given a desired coverage level 1 - α (for example, 90%), conformal prediction produces intervals C(x) such that:

$$P(y \in C(x)) \geq 1 - \alpha$$

The method works by computing nonconformity scores on a held-out calibration set and using the empirical quantile of these scores to set the interval width. The only assumption required is exchangeability of the data (a weaker condition than the i.i.d. assumption).

Conformalized quantile regression (CQR), proposed by Romano, Patterson, and Candes (2019), combines quantile regression with conformal calibration. The base quantile regression model produces initial interval estimates, and conformal prediction adjusts these intervals to guarantee marginal coverage. CQR intervals can adapt to heteroscedastic noise because the underlying quantile regression model produces intervals of varying width.

Conformal prediction is model-agnostic, computationally efficient (no retraining needed), and provides valid coverage guarantees even when the model is misspecified. However, it guarantees only marginal coverage (averaged over the test distribution) rather than conditional coverage for each individual input.

Evaluation and scoring rules

Evaluating probabilistic regression models requires metrics that assess the quality of the entire predicted distribution, not just the point prediction. A proper scoring rule is a function S(F, y) that assigns a score to a predictive distribution F when the true outcome is y, and for which the expected score is optimized when the forecaster reports the true distribution. A scoring rule is strictly proper if the optimum is unique.

Scoring rule	Formula (simplified)	Properties
Log-likelihood (logarithmic score)	S(F, y) = log f(y)	Strictly proper; local (depends only on density at y); sensitive to tail behavior
Continuous Ranked Probability Score (CRPS)	CRPS(F, y) = ∫(F(z) - 1{y ≤ z})² dz	Strictly proper; generalizes MAE to distributional forecasts; robust to outliers
Interval Score	IS_α = (u - l) + (2/α)(l - y)1{y < l} + (2/α)(y - u)1{y > u}	Proper for prediction intervals at level (1 - α); penalizes width and miscoverage
Energy Score	ES(F, y) = E\|Y - y\| - (1/2)E\|Y - Y'\|	Strictly proper; multivariate generalization of CRPS

Calibration

A probabilistic regression model is calibrated if the predicted probability levels match the observed frequencies. For example, if a model predicts that 90% of outcomes will fall within a particular interval, then approximately 90% of observed outcomes should indeed fall within that interval. Calibration is typically assessed using:

Probability Integral Transform (PIT) histograms: If the model is well-calibrated, the PIT values (F(y) evaluated at observed y) should be uniformly distributed.
Calibration plots: These compare the nominal coverage level with the observed coverage across multiple probability thresholds.
Reliability diagrams: Similar to calibration plots but binned by predicted confidence level.

A model can have good calibration but poor sharpness (overly wide intervals), or vice versa. The ideal model is both sharp (narrow intervals) and calibrated (correct coverage).

Comparison of methods

Method	Distributional assumption	Uncertainty type captured	Scalability	Multimodal support	Key advantage
Bayesian linear regression	Gaussian prior and likelihood	Both	O(d³) in dimensions	No	Closed-form posterior
Gaussian process regression	Prior over functions	Both	O(n³) training	No	Nonparametric flexibility
Quantile regression	Distribution-free	Aleatoric	O(n) per quantile	Partial (via crossing quantiles)	No distributional assumptions
Heteroscedastic NN	Gaussian output	Aleatoric only	GPU-scalable	No	Simple to implement
Mixture density network	Gaussian mixture output	Aleatoric	GPU-scalable	Yes	Models multimodal targets
Monte Carlo dropout	Approximate Bayesian	Both (approximate)	Same as base model	No	No architecture changes
Deep ensemble	Mixture of predictions	Both	M times base cost	Partial	Strong calibration
BNN (variational)	Full weight posterior	Both	GPU-scalable (approximate)	Depends on likelihood	Principled uncertainty
Deep evidential regression	Normal Inverse-Gamma prior	Both	Single forward pass	No	Fast inference
NGBoost	Parametric (user-chosen)	Both	O(n) per iteration	Depends on family	Works with any base learner
Conformal prediction	Distribution-free	Coverage guarantee	O(n log n) calibration	N/A (interval only)	Finite-sample guarantee

Applications

Probabilistic regression models are used across many domains where quantifying predictive uncertainty is important.

Weather and climate forecasting

Weather prediction is inherently uncertain due to the chaotic nature of atmospheric dynamics. Probabilistic weather forecasts provide ensemble predictions or distributional forecasts (for example, "there is a 70% chance that tomorrow's temperature will be between 18 and 24 degrees Celsius"). Modern numerical weather prediction systems use ensemble methods, post-processed with techniques such as quantile regression forests or Bayesian model averaging, to produce calibrated probabilistic forecasts. These forecasts support decision-making in agriculture, energy, aviation, and disaster preparedness.

Finance and risk management

In financial applications, probabilistic regression is used for value-at-risk (VaR) estimation, portfolio optimization, and option pricing. Quantile regression is especially common for estimating tail risks, where the focus is on the extreme percentiles of return distributions. Heteroscedastic models like GARCH (Generalized Autoregressive Conditional Heteroskedasticity) capture the time-varying volatility of financial returns, providing distributional forecasts that inform risk management strategies.

Healthcare and medical diagnosis

Probabilistic models in healthcare support clinical decision-making by estimating uncertainty in patient outcomes. Growth charts, used in pediatrics, are a classic example of quantile regression applied to medical data. Survival analysis models, such as Cox proportional hazards with Bayesian extensions, provide probabilistic predictions of patient prognosis. Gaussian processes have been applied to model patient trajectories in intensive care, where principled uncertainty estimates help clinicians identify deteriorating patients.

Autonomous systems and robotics

Self-driving vehicles and robotic systems use probabilistic regression to predict the future trajectories of other agents (pedestrians, vehicles). Mixture density networks are well-suited to this task because trajectory prediction is inherently multimodal: a pedestrian at a crosswalk might continue walking, stop, or turn. Uncertainty-aware predictions enable safer planning and control decisions.

Engineering and materials science

Bayesian optimization, which relies on Gaussian process regression as a surrogate model, is widely used for experimental design and hyperparameter tuning. The GP posterior provides both a predicted value and an uncertainty estimate at each candidate point, and the acquisition function (such as expected improvement) uses both quantities to balance exploration and exploitation. In materials science, probabilistic models guide the search for materials with desired properties while minimizing costly experiments.

Energy systems

Probabilistic forecasting of renewable energy generation (solar irradiance, wind power) is critical for grid management and energy trading. Quantile regression and ensemble methods provide prediction intervals that help grid operators maintain supply-demand balance and plan reserve capacity under uncertainty.

Software and tools

Several widely used libraries provide implementations of probabilistic regression methods.

Library	Language	Methods supported
scikit-learn	Python	Gaussian process regression, quantile regression (GradientBoostingRegressor)
GPyTorch	Python	Scalable Gaussian process regression with GPU support
PyTorch	Python	Heteroscedastic NNs, MDNs, BNNs, deep ensembles (via custom code)
TensorFlow Probability	Python	BNNs, variational inference, mixture density layers
Stan / PyStan	Python / R	Full Bayesian regression with MCMC and variational inference
PyMC	Python	Bayesian regression, Gaussian processes, hierarchical models
NGBoost	Python	Natural gradient boosting for probabilistic prediction
MAPIE	Python	Conformal prediction for regression and classification
brms	R	Bayesian regression with Gaussian process support
statsmodels	Python	Quantile regression, GLMs

History and key publications

The development of probabilistic regression has drawn from statistics, machine learning, and geostatistics over several decades.

Year	Contribution	Authors
1763	Bayes' theorem formulated (published posthumously)	Thomas Bayes
1978	Quantile regression introduced	Roger Koenker, Gilbert Bassett
1994	Heteroscedastic neural network regression proposed	David Nix, Andreas Weigend
1994	Mixture density networks introduced	Christopher Bishop
2003	"Gaussian Processes for Machine Learning" textbook (published 2006)	Carl Edward Rasmussen, Christopher Williams
2011	Practical variational inference for neural networks	Alex Graves
2015	Bayes by Backprop (weight uncertainty in neural networks)	Charles Blundell et al.
2016	Monte Carlo dropout as Bayesian approximation	Yarin Gal, Zoubin Ghahramani
2017	Deep ensembles for uncertainty estimation	Balaji Lakshminarayanan, Alexander Pritzel, Charles Blundell
2019	Conformalized quantile regression	Yaniv Romano, Evan Patterson, Emmanuel Candes
2020	NGBoost: natural gradient boosting for probabilistic prediction	Tony Duan, Anand Avati, et al.
2020	Deep evidential regression	Alexander Amini et al.

References

Koenker, R., & Bassett, G. (1978). "Regression Quantiles." *Econometrica*, 46(1), 33-50.
Nix, D. A., & Weigend, A. S. (1994). "Estimating the mean and variance of the target probability distribution." *Proceedings of the IEEE International Conference on Neural Networks (ICNN'94)*, 1, 55-60.
Bishop, C. M. (1994). "Mixture Density Networks." Technical Report NCRG/94/004, Aston University.
Rasmussen, C. E., & Williams, C. K. I. (2006). *Gaussian Processes for Machine Learning*. MIT Press.
Graves, A. (2011). "Practical Variational Inference for Neural Networks." *Advances in Neural Information Processing Systems*, 24.
Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). "Weight Uncertainty in Neural Networks." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*.
Gal, Y., & Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles." *Advances in Neural Information Processing Systems*, 30.
Gneiting, T., & Raftery, A. E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." *Journal of the American Statistical Association*, 102(477), 359-378.
Romano, Y., Patterson, E., & Candes, E. (2019). "Conformalized Quantile Regression." *Advances in Neural Information Processing Systems*, 32.
Duan, T., Avati, A., Ding, D. Y., Thai, K. K., Basu, S., Ng, A., & Schuler, A. (2020). "NGBoost: Natural Gradient Boosting for Probabilistic Prediction." *Proceedings of the 37th International Conference on Machine Learning (ICML)*.
Amini, A., Schwarting, W., Soleimany, A., & Rus, D. (2020). "Deep Evidential Regression." *Advances in Neural Information Processing Systems*, 33.
Hullermeier, E., & Waegeman, W. (2021). "Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods." *Machine Learning*, 110, 457-506.
Angelopoulos, A. N., & Bates, S. (2023). "Conformal Prediction: A Gentle Introduction." *Foundations and Trends in Machine Learning*, 16(4), 494-591.

Explain like I'm 5 (ELI5)

Mathematical foundation

Types of uncertainty

Classical probabilistic regression methods

Bayesian linear regression

Gaussian process regression

Common kernel functions

Generalized linear models

Quantile regression

Deep learning approaches

Heteroscedastic neural networks

Mixture density networks

Monte Carlo dropout

Deep ensembles

Bayesian neural networks

Deep evidential regression

Gradient boosting approaches

NGBoost

CatBoost and LightGBM

Distribution-free methods

Conformal prediction

Evaluation and scoring rules

Calibration

Comparison of methods

Applications

Weather and climate forecasting

Finance and risk management

Healthcare and medical diagnosis

Autonomous systems and robotics

Engineering and materials science

Energy systems

Software and tools

History and key publications

See also

References

Improve this article

Related Articles

ARC-AGI 2

Bayesian statistics

Squared Loss

Least Squares Regression

Linear Regression

Root Mean Squared Error (RMSE)

Explain like I'm 5 (ELI5)

Mathematical foundation

Types of uncertainty

Classical probabilistic regression methods

Bayesian linear regression

Gaussian process regression

Common kernel functions

Generalized linear models

Quantile regression

Deep learning approaches

Heteroscedastic neural networks

Mixture density networks

Monte Carlo dropout

Deep ensembles

Bayesian neural networks

Deep evidential regression

Gradient boosting approaches

NGBoost

CatBoost and LightGBM

Distribution-free methods

Conformal prediction

Evaluation and scoring rules

Calibration

Comparison of methods

Applications

Weather and climate forecasting

Finance and risk management

Healthcare and medical diagnosis

Autonomous systems and robotics

Engineering and materials science

Energy systems

Software and tools

History and key publications

See also

References

Related Articles

ARC-AGI 2