Boltzmann machine

A Boltzmann machine is a stochastic, recurrent neural network with binary stochastic units and symmetric connections. It defines an energy function over the joint configuration of visible and hidden units and assigns probabilities to configurations through the Boltzmann (Gibbs) distribution borrowed from statistical physics. The model was introduced by Geoffrey Hinton and Terrence Sejnowski in 1983 and presented in full in the 1985 Cognitive Science paper "A Learning Algorithm for Boltzmann Machines" by David Ackley, Hinton and Sejnowski. The machine takes its name from the physicist Ludwig Boltzmann, because the activation distribution it samples from has the same exponential form as the canonical Boltzmann distribution that governs the equilibrium states of physical systems at thermal equilibrium.

Boltzmann machines are an early example of an energy-based generative model and a milestone in the history of neural networks. Their restricted variant, the Restricted Boltzmann Machine (RBM), proposed by Paul Smolensky in 1986 under the name harmonium, became central to the deep learning revival of the mid-2000s. Stacks of RBMs trained greedily form Deep Belief Networks (Hinton, Osindero and Teh, 2006), which were briefly the state of the art for unsupervised pre-training of deep feedforward networks. The 2024 Nobel Prize in Physics, awarded jointly to John Hopfield and Geoffrey Hinton, cited the Boltzmann machine and the Hopfield network as foundational discoveries that enabled machine learning with artificial neural networks.

History and origin

The Boltzmann machine arose directly from work on Hopfield networks (1982). Hopfield had shown that a fully connected, symmetric network of binary units with deterministic asynchronous updates could function as content-addressable associative memory, with stable states identified by minima of an Ising-like energy function. The deterministic dynamics, however, get trapped in local minima and the network has no learning rule for hidden units. In a 1995 interview Hinton recalled that, in February or March 1983, he was preparing a talk on simulated annealing applied to Hopfield-style networks and "had to design a learning algorithm for the talk". The result was the Boltzmann machine: a stochastic generalization of Hopfield's network in which units flip states probabilistically according to a temperature parameter, allowing the network to escape local minima and to be trained by a contrastive Hebbian rule.

The earliest references appear in technical reports and conference papers from 1983 and 1984:

Fahlman, Hinton and Sejnowski, "Massively Parallel Architectures for AI: NETL, Thistle, and Boltzmann Machines", AAAI-83 (August 1983).
Hinton, Sejnowski and Ackley, "Boltzmann Machines: Constraint Satisfaction Networks That Learn", CMU technical report CMU-CS-84-119 (May 1984).
Ackley, Hinton and Sejnowski, "A Learning Algorithm for Boltzmann Machines", Cognitive Science 9, 147-169 (1985). This is the canonical reference for the model and its learning rule.

The two ideas that made the Boltzmann machine novel were the introduction of stochastic binary units (so that the network defines a probability distribution rather than a single deterministic equilibrium) and the contrastive learning rule (positive phase minus negative phase) that follows the gradient of the data log-likelihood for both visible and hidden weights.

Definition and architecture

A Boltzmann machine is an undirected graphical model with binary units $s_i \in {0,1}$. The units are partitioned into a set of visible units $v$ that are clamped to data during training, and a set of hidden units $h$ that the model uses to capture latent structure. Connections are symmetric: the weight $W_{ij}$ between units $i$ and $j$ equals $W_{ji}$, and there are no self-connections.

In the general (fully connected) Boltzmann machine, every pair of units (visible-visible, hidden-hidden, and visible-hidden) can be connected. The energy of a joint configuration $(v,h)$ is

$$E(v, h) = -\sum_i b_i s_i - \sum_{i<j} W_{ij} s_i s_j$$

where $s$ ranges over all visible and hidden units, $b_i$ is a bias, and $W_{ij}$ is the symmetric weight between units $i$ and $j$.

The probability of a configuration is given by the Boltzmann distribution

$$P(v,h) = \frac{1}{Z} \exp\left( -\frac{E(v,h)}{T} \right)$$

where $T$ is a temperature parameter (often set to 1 in machine-learning contexts) and $Z = \sum_{v',h'} \exp(-E(v',h')/T)$ is the partition function summing over all $2^N$ configurations of $N$ binary units. Marginalising out the hidden units gives the model's distribution over visible data:

$$P(v) = \frac{1}{Z} \sum_h \exp(-E(v,h))$$

The partition function is intractable in general because the sum has exponentially many terms.

A single unit's conditional probability of being on, given the states of all other units, takes the simple logistic form

$$P(s_i = 1 \mid s_{-i}) = \sigma!\left( \frac{1}{T} \Big( b_i + \sum_{j} W_{ij} s_j \Big) \right)$$

where $\sigma(x) = 1/(1 + e^{-x})$. This local update rule, applied repeatedly to randomly chosen units, is a Gibbs sampler for the joint distribution.

Inference

Inference in a Boltzmann machine means drawing samples from the model distribution, computing expectations over hidden units given visible data, or computing marginals. Since exact computation is intractable for any non-trivial network, inference is performed with stochastic Gibbs sampling. Each unit is updated in turn (or asynchronously at random) by drawing a new value from its conditional Bernoulli distribution given the current states of its neighbours.

At high temperature the network mixes quickly between configurations but the distribution is close to uniform. At low temperature the distribution concentrates on low-energy configurations but mixing is slow. Simulated annealing schedules the temperature from high to low so that the network explores broadly at first and then settles into a low-energy region. For a fully connected Boltzmann machine, the time required to reach equilibrium grows steeply with network size, which is the central practical obstacle to using the model. This is one of the chief reasons the unrestricted form was largely abandoned in favour of more constrained variants.

Learning rule

The celebrated Boltzmann learning rule from Ackley, Hinton and Sejnowski (1985) follows the gradient of the log-likelihood of the data. For a weight $W_{ij}$ the gradient is

$$\frac{\partial \log P(v)}{\partial W_{ij}} = \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}$$

The rule has two phases:

Positive phase (data-driven): clamp the visible units to a training example, run Gibbs sampling over the hidden units until the network reaches thermal equilibrium, and record the average pairwise correlation $\langle s_i s_j \rangle_{\text{data}}$.
Negative phase (model-driven): let the entire network run freely, including the visible units, until it reaches equilibrium, and record the unconditional correlation $\langle s_i s_j \rangle_{\text{model}}$.

The weight update is the difference between the two: $\Delta W_{ij} \propto \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}$.

This rule is local and Hebbian, requiring only knowledge of the states of the two units a synapse connects, which made it appealing as a model of biological learning. It is also computationally infeasible for any general Boltzmann machine of useful size, because reaching equilibrium in the negative phase is impossibly slow and the partition function never appears explicitly but is implicit in the model expectations.

Restricted Boltzmann Machine

The Restricted Boltzmann Machine (RBM) is a Boltzmann machine constrained to a bipartite graph: visible units connect only to hidden units, with no within-layer connections. Paul Smolensky introduced the architecture in 1986 in the chapter "Information Processing in Dynamical Systems: Foundations of Harmony Theory" (in Parallel Distributed Processing, vol. 1), where he called it the harmonium.

The RBM energy function for binary units $v \in {0,1}^m$ and $h \in {0,1}^n$ is

$$E(v, h) = -\sum_i a_i v_i - \sum_j b_j h_j - \sum_{i,j} v_i W_{ij} h_j$$

where $a_i$ are visible biases, $b_j$ are hidden biases, and $W_{ij}$ is the visible-to-hidden weight matrix.

The bipartite restriction has a striking consequence: the hidden units are conditionally independent given the visible units, and the visible units are conditionally independent given the hidden units. This means

$$P(h \mid v) = \prod_j P(h_j \mid v), \qquad P(v \mid h) = \prod_i P(v_i \mid h)$$

and each conditional is a Bernoulli with logistic activation:

$$P(h_j = 1 \mid v) = \sigma!\left(b_j + \sum_i v_i W_{ij}\right), \quad P(v_i = 1 \mid h) = \sigma!\left(a_i + \sum_j W_{ij} h_j\right)$$

This factorisation makes block Gibbs sampling easy: sample all hidden units in parallel given the visibles, then sample all visibles given the hiddens. The model is thus dramatically more tractable than a fully connected Boltzmann machine.

Contrastive Divergence

Even for an RBM, exact maximum-likelihood training is intractable because the negative phase still requires sampling from the model's marginal distribution. Hinton's 2002 paper "Training Products of Experts by Minimizing Contrastive Divergence" (Neural Computation, vol. 14, pp. 1771-1800) introduced the contrastive divergence (CD-k) algorithm, which became the standard training procedure for RBMs.

CD-k approximates the negative phase by running only $k$ steps of Gibbs sampling starting from the training data, rather than running the chain to equilibrium. The resulting weight update is

$$\Delta W_{ij} \approx \eta \big( \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}} \big)$$

where $\langle \cdot \rangle_{\text{recon}}$ uses the visibles obtained after $k$ Gibbs steps. In practice, $k=1$ (CD-1) is almost always used and works surprisingly well, even though the resulting gradient is biased and is not exactly the gradient of any simple objective. CD-1 was the algorithmic engine that finally made stacked RBMs trainable and is widely cited as the technical breakthrough behind the 2006-2010 renewal of interest in deep architectures.

Persistent Contrastive Divergence

Tijmen Tieleman's 2008 ICML paper "Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient" introduced Persistent Contrastive Divergence (PCD), also known as Stochastic Maximum Likelihood. PCD maintains a small set of persistent "fantasy particles" that are not reset to data points after each weight update; the negative-phase chain is allowed to continue across updates, on the heuristic that small parameter changes leave the chain near equilibrium of the new model. PCD is essentially as fast as CD-1, more accurate, and has become the default in many implementations (for example scikit-learn's BernoulliRBM).

Deep Belief Networks

The Deep Belief Network (DBN), introduced by Geoffrey Hinton, Simon Osindero and Yee-Whye Teh in "A Fast Learning Algorithm for Deep Belief Nets" (Neural Computation, 18, 1527-1554, 2006), is a generative model formed by stacking RBMs.

A DBN with $L$ hidden layers has the following structure: the top two layers form an undirected RBM, while the lower layers form a directed (sigmoid) belief network that runs top-down. Inference uses an approximate, layer-wise feedforward pass through the recognition weights tied to the generative weights of each layer.

The key insight in the 2006 paper is the greedy layer-wise pre-training procedure. The first RBM is trained on raw data using CD-1. Once trained, the hidden activations of this RBM are treated as the data for a second RBM, which is trained the same way, and so on for as many layers as desired. This procedure is justified by the "complementary priors" argument: under appropriate weight tying, adding a layer can only increase a variational lower bound on the data log-likelihood. After greedy pre-training, the whole network can be fine-tuned with a contrastive version of the wake-sleep algorithm or, more commonly, used to initialise a feedforward network that is fine-tuned with backpropagation.

Applied to the MNIST handwritten digit dataset, a three-hidden-layer DBN reached a test error of 1.25%, beating the best discriminative classifiers of the time. The result was widely interpreted as proof that deep networks could be trained successfully and was the trigger for the 2006-2010 "deep learning thaw".

Deep Boltzmann Machines

A Deep Boltzmann Machine (DBM), introduced by Ruslan Salakhutdinov and Geoffrey Hinton at AISTATS 2009, is a fully undirected generative model with multiple hidden layers. Unlike a DBN, every layer interacts symmetrically with its neighbours; the network is a single deep undirected graphical model rather than a stack of RBMs with a directed lower portion.

With hidden layers $h^{(1)}, h^{(2)}, \ldots, h^{(L)}$ and visible units $v$, the energy function for a two-hidden-layer DBM is

$$E(v, h^{(1)}, h^{(2)}) = -v^\top W^{(1)} h^{(1)} - h^{(1)\top} W^{(2)} h^{(2)}$$

The even-odd structure of layers means that, conditional on the layers above and below, units within a layer remain independent, which keeps block Gibbs sampling efficient. Training combines two approximations:

Variational mean-field inference for the data-dependent (positive) statistics. The intractable posterior $P(h \mid v)$ is approximated with a factorised distribution whose parameters are optimised to maximise a variational bound.
Persistent Markov chains (PCD) for the data-independent (negative) statistics.

The combination, often initialised by greedy layer-wise RBM pre-training, makes DBMs with millions of parameters practical to train. Salakhutdinov and Hinton reported strong generative-model performance on MNIST and on the NORB three-dimensional object dataset.

Variants

Variant	Year	Key idea	Notable use
Boltzmann machine	1985	Fully connected, binary stochastic units	Original constraint-satisfaction toy problems
Restricted Boltzmann Machine (RBM)	1986	Bipartite visible/hidden graph	Pre-training, feature learning
Mean-field Boltzmann machine	1987	Replace stochastic units with mean activations	Faster but biased inference
Helmholtz machine	1995	Directed generative + recognition net trained with wake-sleep	Precursor to modern variational models
Conditional RBM	2006	Visible units conditioned on previous frames	Motion capture and time series
Gaussian-Bernoulli RBM	2006	Real-valued visible units, binary hiddens	Image and audio data
Replicated Softmax	2009	Softmax visible units replicated by document length	Topic modelling, document retrieval
Deep Belief Network (DBN)	2006	Stacked RBMs with directed lower layers	Pre-training of deep classifiers
Deep Boltzmann Machine (DBM)	2009	Fully undirected multi-layer Boltzmann machine	Generative model for images, NORB
Spike-and-slab RBM	2011	Binary spike + real-valued slab per hidden	Continuous data with sparse codes
Convolutional RBM/DBN	2009	Weight-shared local connectivity	Scalable image modelling

Applications

Boltzmann machines and their descendants found practical use in several domains during the late 2000s.

Handwritten digit modelling. Hinton, Osindero and Teh (2006) trained a three-hidden-layer DBN on MNIST with around 1.7 million parameters and achieved 1.25% test error, which was the best published number on the dataset at the time and a key empirical justification for deep learning.

Collaborative filtering. Salakhutdinov, Mnih and Hinton's 2007 ICML paper "Restricted Boltzmann Machines for Collaborative Filtering" applied an RBM with softmax visible units (one per movie rating) to the Netflix Prize dataset of 100 million user-movie ratings. RBMs slightly outperformed carefully tuned SVD baselines and a linear blend of RBM and SVD predictions was used in the winning Netflix Prize ensemble.

Topic modelling and document retrieval. Salakhutdinov and Hinton's NIPS 2009 paper "Replicated Softmax: An Undirected Topic Model" replaced the Dirichlet prior of Latent Dirichlet Allocation with an RBM-style undirected model in which the number of softmax visible units equals the document length. It produced better held-out perplexity and better document retrieval than LDA on standard text corpora.

Speech and acoustic modelling. From 2009 to roughly 2012, DBN-initialised feedforward networks were the dominant acoustic models in research-grade automatic speech recognition. Mohamed, Dahl and Hinton (2009-2012) and the IBM, Google and Microsoft speech groups used DBN pre-training to push phone-recognition error rates down on TIMIT and large-vocabulary ASR benchmarks, work that is widely credited with bringing deep learning into industrial speech recognition.

Pre-training for deep feedforward networks. From 2006 to roughly 2012, the dominant recipe for training deep nets was greedy unsupervised pre-training with a stack of RBMs (or autoencoders), followed by supervised fine-tuning with backpropagation. The pre-training was thought to find weight regions near good local minima and to act as a regulariser.

Why Boltzmann machines fell out of favour

By 2012-2014, RBM- and DBN-based pre-training had largely disappeared from mainstream practice. Several developments contributed.

Better activations and initialisation made pre-training unnecessary. Glorot, Bordes and Bengio's 2011 AISTATS paper "Deep Sparse Rectifier Neural Networks" showed that simply replacing sigmoid or tanh activations with rectified linear units (ReLU) lets deep networks train well from random initialisation on large supervised tasks, eliminating the need for unsupervised pre-training. Improved initialisation schemes (Xavier and He), batch normalisation, and dropout (Srivastava et al., 2014) further closed the gap.

Big data and big GPUs. Krizhevsky, Sutskever and Hinton's 2012 ImageNet result with a purely supervised convolutional network (AlexNet) demonstrated that, given enough labelled data and GPU compute, supervised learning alone outperformed pre-trained networks. Subsequent vision and language work confirmed this.

Sampling-based training is slow and brittle. CD-1 has biased gradients and PCD-style chains can get stuck. Hyperparameter tuning (learning rates, momentum, weight decay, mini-batch size, number of Gibbs steps, type of CD) is delicate, and the resulting models are hard to evaluate because the partition function $Z$ cannot be computed exactly.

Better generative models arrived. Variational autoencoders (Kingma and Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014) offered far more practical generative models. They scaled to higher-resolution images, were easier to evaluate qualitatively, and used the same gradient-based optimisation as the rest of deep learning rather than sampling-based contrastive methods.

By the time the modern wave of generative AI took off in the late 2010s, the Boltzmann machine was a historical landmark rather than a working tool.

Connection to physics

The Boltzmann machine sits at the intersection of statistical physics and machine learning. Its energy function has the same form as the Hamiltonian of an Ising spin glass, and the probability distribution over states is a Gibbs measure. Several physical models map directly onto Boltzmann machines:

The Ising model with external field corresponds to a Boltzmann machine with biases and pairwise interactions.
The Sherrington-Kirkpatrick spin-glass model corresponds to a fully connected Boltzmann machine with random symmetric weights.
The Hopfield network (1982) is the deterministic, zero-temperature limit of a Boltzmann machine and the immediate ancestor of the stochastic version.

In October 2024 the Royal Swedish Academy of Sciences awarded the Nobel Prize in Physics jointly to John J. Hopfield and Geoffrey Hinton "for foundational discoveries and inventions that enable machine learning with artificial neural networks". The official citation explicitly mentions the Hopfield network and the Boltzmann machine: Hinton's contribution is described as having used "tools from statistical physics" in 1983-1985 to create a network "that can learn to recognise characteristic elements in a set of data". The award is, in effect, the recognition of the Boltzmann machine as a piece of physics.

Modern resurgence

Although the explicit Boltzmann machine has not returned to wide use, several modern lines of research are continuous-state successors of the energy-based formulation it pioneered.

Energy-based models. Yann LeCun and collaborators have advocated for a broad family of energy-based models in which a parametric energy function $E_\theta(x)$ is shaped to take low values on data and high values elsewhere. The Joint Embedding Predictive Architecture (JEPA) family formalises self-supervised learning in this framework. RBM training is the historical template for this style of learning.

Score-based and diffusion models. Denoising score-matching and the diffusion models that grew out of it (Song and Ermon, 2019; Ho et al., 2020) can be viewed as continuous-state energy-based models. Training amounts to learning the gradient of the log density (the score), which sidesteps the partition function in much the same way contrastive divergence did, and sampling is performed by running Langevin or reverse-diffusion chains analogous to Gibbs sampling in a Boltzmann machine.

Modern Hopfield networks. Ramsauer et al. (2020) showed that continuous-state Hopfield networks with the right energy function reproduce the attention mechanism of Transformers, reviving interest in the energy-based view of associative memory.

Comparison with other generative models

Model	Training	Inference	Sample quality	Era of dominance	Key paper
Boltzmann machine	Contrastive Hebbian, MCMC	Slow Gibbs sampling	Limited (toy)	1985-1990	Ackley, Hinton, Sejnowski 1985
Restricted Boltzmann Machine (RBM)	Contrastive Divergence	Block Gibbs	Modest	2002-2012	Smolensky 1986; Hinton 2002
Deep Belief Network (DBN)	Greedy stacked RBMs + fine-tune	Layer-wise feedforward	Good for digits	2006-2012	Hinton, Osindero, Teh 2006
Deep Boltzmann Machine (DBM)	Mean field + PCD	Mean-field iterations	Good but slow	2009-2012	Salakhutdinov, Hinton 2009
Variational Autoencoder (VAE)	ELBO via reparameterisation	Single forward pass	Blurry but reliable	2014-present	Kingma, Welling 2013
Generative Adversarial Network (GAN)	Min-max game	Single forward pass	Sharp images	2014-2020	Goodfellow et al. 2014
Normalising flow	Exact log-likelihood	Invertible forward	Tractable density	2015-present	Dinh et al. 2014; Rezende, Mohamed 2015
Diffusion model	Score matching, denoising	Iterative reverse process	State of the art	2020-present	Ho, Jain, Abbeel 2020

Key papers timeline

Year	Authors	Paper	Contribution
1982	Hopfield	Neural networks and physical systems with emergent collective computational abilities	Hopfield network, deterministic ancestor
1983	Fahlman, Hinton, Sejnowski	Massively Parallel Architectures for AI: NETL, Thistle, and Boltzmann Machines (AAAI-83)	First public mention
1985	Ackley, Hinton, Sejnowski	A Learning Algorithm for Boltzmann Machines (Cognitive Science)	Defines model and learning rule
1986	Smolensky	Information Processing in Dynamical Systems: Foundations of Harmony Theory	Introduces the harmonium / RBM
2002	Hinton	Training Products of Experts by Minimizing Contrastive Divergence (Neural Computation)	CD-k algorithm
2006	Hinton, Osindero, Teh	A Fast Learning Algorithm for Deep Belief Nets (Neural Computation)	DBN, layer-wise pre-training
2007	Salakhutdinov, Mnih, Hinton	Restricted Boltzmann Machines for Collaborative Filtering (ICML)	RBM for Netflix Prize
2008	Tieleman	Training RBMs using Approximations to the Likelihood Gradient (ICML)	Persistent CD
2009	Salakhutdinov, Hinton	Deep Boltzmann Machines (AISTATS)	DBM with mean-field + PCD
2009	Salakhutdinov, Hinton	Replicated Softmax (NIPS)	Topic model RBM
2011	Glorot, Bordes, Bengio	Deep Sparse Rectifier Neural Networks (AISTATS)	Made pre-training unnecessary
2024	Royal Swedish Academy	Nobel Prize in Physics	Awarded to Hopfield and Hinton, citing Boltzmann machines

References

Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A Learning Algorithm for Boltzmann Machines. *Cognitive Science*, 9(1), 147-169.
Smolensky, P. (1986). Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Rumelhart, D. E. and McClelland, J. L. (eds.), *Parallel Distributed Processing: Explorations in the Microstructure of Cognition*, Volume 1: Foundations, MIT Press, pp. 194-281.
Hinton, G. E. (2002). Training Products of Experts by Minimizing Contrastive Divergence. *Neural Computation*, 14(8), 1771-1800.
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A Fast Learning Algorithm for Deep Belief Nets. *Neural Computation*, 18(7), 1527-1554.
Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann Machines for Collaborative Filtering. *Proceedings of the 24th International Conference on Machine Learning (ICML)*, 791-798.
Tieleman, T. (2008). Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. *Proceedings of the 25th International Conference on Machine Learning (ICML)*, 1064-1071.
Salakhutdinov, R., and Hinton, G. (2009). Deep Boltzmann Machines. *Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 5, 448-455.
Salakhutdinov, R., and Hinton, G. (2009). Replicated Softmax: An Undirected Topic Model. *Advances in Neural Information Processing Systems (NIPS)* 22.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. *AISTATS*, 15, 315-323.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*, MIT Press, Chapter 20: Deep Generative Models. deeplearningbook.org.
Hinton, G. E. (Scholarpedia article). Boltzmann machine.
Royal Swedish Academy of Sciences (2024). Press release: The Nobel Prize in Physics 2024.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. *Proceedings of the National Academy of Sciences*, 79(8), 2554-2558.

Boltzmann machine

History and origin

Definition and architecture

Inference

Learning rule

Restricted Boltzmann Machine

Contrastive Divergence

Persistent Contrastive Divergence

Deep Belief Networks

Deep Boltzmann Machines

Variants

Applications

Why Boltzmann machines fell out of favour

Connection to physics

Modern resurgence

Comparison with other generative models

Key papers timeline

References

Improve this article

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Internet ChatGPT Plugins

Netflix Prize

ACM A.M. Turing Award

Generative AI

Boltzmann machine

History and origin

Definition and architecture

Inference

Learning rule

Restricted Boltzmann Machine

Contrastive Divergence

Persistent Contrastive Divergence

Deep Belief Networks

Deep Boltzmann Machines

Variants

Applications

Why Boltzmann machines fell out of favour

Connection to physics

Modern resurgence

Comparison with other generative models

Key papers timeline

References

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Internet ChatGPT Plugins

Netflix Prize

ACM A.M. Turing Award

Generative AI