Boltzmann machine
Last reviewed
Apr 30, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 3,965 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 · 3,965 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Boltzmann machine is a stochastic, recurrent neural network with binary stochastic units and symmetric connections. It defines an energy function over the joint configuration of visible and hidden units and assigns probabilities to configurations through the Boltzmann (Gibbs) distribution borrowed from statistical physics. The model was introduced by Geoffrey Hinton and Terrence Sejnowski in 1983 and presented in full in the 1985 Cognitive Science paper "A Learning Algorithm for Boltzmann Machines" by David Ackley, Hinton and Sejnowski. The machine takes its name from the physicist Ludwig Boltzmann, because the activation distribution it samples from has the same exponential form as the canonical Boltzmann distribution that governs the equilibrium states of physical systems at thermal equilibrium.
Boltzmann machines are an early example of an energy-based generative model and a milestone in the history of neural networks. Their restricted variant, the Restricted Boltzmann Machine (RBM), proposed by Paul Smolensky in 1986 under the name harmonium, became central to the deep learning revival of the mid-2000s. Stacks of RBMs trained greedily form Deep Belief Networks (Hinton, Osindero and Teh, 2006), which were briefly the state of the art for unsupervised pre-training of deep feedforward networks. The 2024 Nobel Prize in Physics, awarded jointly to John Hopfield and Geoffrey Hinton, cited the Boltzmann machine and the Hopfield network as foundational discoveries that enabled machine learning with artificial neural networks.
The Boltzmann machine arose directly from work on Hopfield networks (1982). Hopfield had shown that a fully connected, symmetric network of binary units with deterministic asynchronous updates could function as content-addressable associative memory, with stable states identified by minima of an Ising-like energy function. The deterministic dynamics, however, get trapped in local minima and the network has no learning rule for hidden units. In a 1995 interview Hinton recalled that, in February or March 1983, he was preparing a talk on simulated annealing applied to Hopfield-style networks and "had to design a learning algorithm for the talk". The result was the Boltzmann machine: a stochastic generalization of Hopfield's network in which units flip states probabilistically according to a temperature parameter, allowing the network to escape local minima and to be trained by a contrastive Hebbian rule.
The earliest references appear in technical reports and conference papers from 1983 and 1984:
The two ideas that made the Boltzmann machine novel were the introduction of stochastic binary units (so that the network defines a probability distribution rather than a single deterministic equilibrium) and the contrastive learning rule (positive phase minus negative phase) that follows the gradient of the data log-likelihood for both visible and hidden weights.
A Boltzmann machine is an undirected graphical model with binary units $s_i \in {0,1}$. The units are partitioned into a set of visible units $v$ that are clamped to data during training, and a set of hidden units $h$ that the model uses to capture latent structure. Connections are symmetric: the weight $W_{ij}$ between units $i$ and $j$ equals $W_{ji}$, and there are no self-connections.
In the general (fully connected) Boltzmann machine, every pair of units (visible-visible, hidden-hidden, and visible-hidden) can be connected. The energy of a joint configuration $(v,h)$ is
$$E(v, h) = -\sum_i b_i s_i - \sum_{i<j} W_{ij} s_i s_j$$
where $s$ ranges over all visible and hidden units, $b_i$ is a bias, and $W_{ij}$ is the symmetric weight between units $i$ and $j$.
The probability of a configuration is given by the Boltzmann distribution
$$P(v,h) = \frac{1}{Z} \exp\left( -\frac{E(v,h)}{T} \right)$$
where $T$ is a temperature parameter (often set to 1 in machine-learning contexts) and $Z = \sum_{v',h'} \exp(-E(v',h')/T)$ is the partition function summing over all $2^N$ configurations of $N$ binary units. Marginalising out the hidden units gives the model's distribution over visible data:
$$P(v) = \frac{1}{Z} \sum_h \exp(-E(v,h))$$
The partition function is intractable in general because the sum has exponentially many terms.
A single unit's conditional probability of being on, given the states of all other units, takes the simple logistic form
$$P(s_i = 1 \mid s_{-i}) = \sigma!\left( \frac{1}{T} \Big( b_i + \sum_{j} W_{ij} s_j \Big) \right)$$
where $\sigma(x) = 1/(1 + e^{-x})$. This local update rule, applied repeatedly to randomly chosen units, is a Gibbs sampler for the joint distribution.
Inference in a Boltzmann machine means drawing samples from the model distribution, computing expectations over hidden units given visible data, or computing marginals. Since exact computation is intractable for any non-trivial network, inference is performed with stochastic Gibbs sampling. Each unit is updated in turn (or asynchronously at random) by drawing a new value from its conditional Bernoulli distribution given the current states of its neighbours.
At high temperature the network mixes quickly between configurations but the distribution is close to uniform. At low temperature the distribution concentrates on low-energy configurations but mixing is slow. Simulated annealing schedules the temperature from high to low so that the network explores broadly at first and then settles into a low-energy region. For a fully connected Boltzmann machine, the time required to reach equilibrium grows steeply with network size, which is the central practical obstacle to using the model. This is one of the chief reasons the unrestricted form was largely abandoned in favour of more constrained variants.
The celebrated Boltzmann learning rule from Ackley, Hinton and Sejnowski (1985) follows the gradient of the log-likelihood of the data. For a weight $W_{ij}$ the gradient is
$$\frac{\partial \log P(v)}{\partial W_{ij}} = \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}$$
The rule has two phases:
The weight update is the difference between the two: $\Delta W_{ij} \propto \langle s_i s_j \rangle_{\text{data}} - \langle s_i s_j \rangle_{\text{model}}$.
This rule is local and Hebbian, requiring only knowledge of the states of the two units a synapse connects, which made it appealing as a model of biological learning. It is also computationally infeasible for any general Boltzmann machine of useful size, because reaching equilibrium in the negative phase is impossibly slow and the partition function never appears explicitly but is implicit in the model expectations.
The Restricted Boltzmann Machine (RBM) is a Boltzmann machine constrained to a bipartite graph: visible units connect only to hidden units, with no within-layer connections. Paul Smolensky introduced the architecture in 1986 in the chapter "Information Processing in Dynamical Systems: Foundations of Harmony Theory" (in Parallel Distributed Processing, vol. 1), where he called it the harmonium.
The RBM energy function for binary units $v \in {0,1}^m$ and $h \in {0,1}^n$ is
$$E(v, h) = -\sum_i a_i v_i - \sum_j b_j h_j - \sum_{i,j} v_i W_{ij} h_j$$
where $a_i$ are visible biases, $b_j$ are hidden biases, and $W_{ij}$ is the visible-to-hidden weight matrix.
The bipartite restriction has a striking consequence: the hidden units are conditionally independent given the visible units, and the visible units are conditionally independent given the hidden units. This means
$$P(h \mid v) = \prod_j P(h_j \mid v), \qquad P(v \mid h) = \prod_i P(v_i \mid h)$$
and each conditional is a Bernoulli with logistic activation:
$$P(h_j = 1 \mid v) = \sigma!\left(b_j + \sum_i v_i W_{ij}\right), \quad P(v_i = 1 \mid h) = \sigma!\left(a_i + \sum_j W_{ij} h_j\right)$$
This factorisation makes block Gibbs sampling easy: sample all hidden units in parallel given the visibles, then sample all visibles given the hiddens. The model is thus dramatically more tractable than a fully connected Boltzmann machine.
Even for an RBM, exact maximum-likelihood training is intractable because the negative phase still requires sampling from the model's marginal distribution. Hinton's 2002 paper "Training Products of Experts by Minimizing Contrastive Divergence" (Neural Computation, vol. 14, pp. 1771-1800) introduced the contrastive divergence (CD-k) algorithm, which became the standard training procedure for RBMs.
CD-k approximates the negative phase by running only $k$ steps of Gibbs sampling starting from the training data, rather than running the chain to equilibrium. The resulting weight update is
$$\Delta W_{ij} \approx \eta \big( \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}} \big)$$
where $\langle \cdot \rangle_{\text{recon}}$ uses the visibles obtained after $k$ Gibbs steps. In practice, $k=1$ (CD-1) is almost always used and works surprisingly well, even though the resulting gradient is biased and is not exactly the gradient of any simple objective. CD-1 was the algorithmic engine that finally made stacked RBMs trainable and is widely cited as the technical breakthrough behind the 2006-2010 renewal of interest in deep architectures.
Tijmen Tieleman's 2008 ICML paper "Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient" introduced Persistent Contrastive Divergence (PCD), also known as Stochastic Maximum Likelihood. PCD maintains a small set of persistent "fantasy particles" that are not reset to data points after each weight update; the negative-phase chain is allowed to continue across updates, on the heuristic that small parameter changes leave the chain near equilibrium of the new model. PCD is essentially as fast as CD-1, more accurate, and has become the default in many implementations (for example scikit-learn's BernoulliRBM).
The Deep Belief Network (DBN), introduced by Geoffrey Hinton, Simon Osindero and Yee-Whye Teh in "A Fast Learning Algorithm for Deep Belief Nets" (Neural Computation, 18, 1527-1554, 2006), is a generative model formed by stacking RBMs.
A DBN with $L$ hidden layers has the following structure: the top two layers form an undirected RBM, while the lower layers form a directed (sigmoid) belief network that runs top-down. Inference uses an approximate, layer-wise feedforward pass through the recognition weights tied to the generative weights of each layer.
The key insight in the 2006 paper is the greedy layer-wise pre-training procedure. The first RBM is trained on raw data using CD-1. Once trained, the hidden activations of this RBM are treated as the data for a second RBM, which is trained the same way, and so on for as many layers as desired. This procedure is justified by the "complementary priors" argument: under appropriate weight tying, adding a layer can only increase a variational lower bound on the data log-likelihood. After greedy pre-training, the whole network can be fine-tuned with a contrastive version of the wake-sleep algorithm or, more commonly, used to initialise a feedforward network that is fine-tuned with backpropagation.
Applied to the MNIST handwritten digit dataset, a three-hidden-layer DBN reached a test error of 1.25%, beating the best discriminative classifiers of the time. The result was widely interpreted as proof that deep networks could be trained successfully and was the trigger for the 2006-2010 "deep learning thaw".
A Deep Boltzmann Machine (DBM), introduced by Ruslan Salakhutdinov and Geoffrey Hinton at AISTATS 2009, is a fully undirected generative model with multiple hidden layers. Unlike a DBN, every layer interacts symmetrically with its neighbours; the network is a single deep undirected graphical model rather than a stack of RBMs with a directed lower portion.
With hidden layers $h^{(1)}, h^{(2)}, \ldots, h^{(L)}$ and visible units $v$, the energy function for a two-hidden-layer DBM is
$$E(v, h^{(1)}, h^{(2)}) = -v^\top W^{(1)} h^{(1)} - h^{(1)\top} W^{(2)} h^{(2)}$$
The even-odd structure of layers means that, conditional on the layers above and below, units within a layer remain independent, which keeps block Gibbs sampling efficient. Training combines two approximations:
The combination, often initialised by greedy layer-wise RBM pre-training, makes DBMs with millions of parameters practical to train. Salakhutdinov and Hinton reported strong generative-model performance on MNIST and on the NORB three-dimensional object dataset.
| Variant | Year | Key idea | Notable use |
|---|---|---|---|
| Boltzmann machine | 1985 | Fully connected, binary stochastic units | Original constraint-satisfaction toy problems |
| Restricted Boltzmann Machine (RBM) | 1986 | Bipartite visible/hidden graph | Pre-training, feature learning |
| Mean-field Boltzmann machine | 1987 | Replace stochastic units with mean activations | Faster but biased inference |
| Helmholtz machine | 1995 | Directed generative + recognition net trained with wake-sleep | Precursor to modern variational models |
| Conditional RBM | 2006 | Visible units conditioned on previous frames | Motion capture and time series |
| Gaussian-Bernoulli RBM | 2006 | Real-valued visible units, binary hiddens | Image and audio data |
| Replicated Softmax | 2009 | Softmax visible units replicated by document length | Topic modelling, document retrieval |
| Deep Belief Network (DBN) | 2006 | Stacked RBMs with directed lower layers | Pre-training of deep classifiers |
| Deep Boltzmann Machine (DBM) | 2009 | Fully undirected multi-layer Boltzmann machine | Generative model for images, NORB |
| Spike-and-slab RBM | 2011 | Binary spike + real-valued slab per hidden | Continuous data with sparse codes |
| Convolutional RBM/DBN | 2009 | Weight-shared local connectivity | Scalable image modelling |
Boltzmann machines and their descendants found practical use in several domains during the late 2000s.
Handwritten digit modelling. Hinton, Osindero and Teh (2006) trained a three-hidden-layer DBN on MNIST with around 1.7 million parameters and achieved 1.25% test error, which was the best published number on the dataset at the time and a key empirical justification for deep learning.
Collaborative filtering. Salakhutdinov, Mnih and Hinton's 2007 ICML paper "Restricted Boltzmann Machines for Collaborative Filtering" applied an RBM with softmax visible units (one per movie rating) to the Netflix Prize dataset of 100 million user-movie ratings. RBMs slightly outperformed carefully tuned SVD baselines and a linear blend of RBM and SVD predictions was used in the winning Netflix Prize ensemble.
Topic modelling and document retrieval. Salakhutdinov and Hinton's NIPS 2009 paper "Replicated Softmax: An Undirected Topic Model" replaced the Dirichlet prior of Latent Dirichlet Allocation with an RBM-style undirected model in which the number of softmax visible units equals the document length. It produced better held-out perplexity and better document retrieval than LDA on standard text corpora.
Speech and acoustic modelling. From 2009 to roughly 2012, DBN-initialised feedforward networks were the dominant acoustic models in research-grade automatic speech recognition. Mohamed, Dahl and Hinton (2009-2012) and the IBM, Google and Microsoft speech groups used DBN pre-training to push phone-recognition error rates down on TIMIT and large-vocabulary ASR benchmarks, work that is widely credited with bringing deep learning into industrial speech recognition.
Pre-training for deep feedforward networks. From 2006 to roughly 2012, the dominant recipe for training deep nets was greedy unsupervised pre-training with a stack of RBMs (or autoencoders), followed by supervised fine-tuning with backpropagation. The pre-training was thought to find weight regions near good local minima and to act as a regulariser.
By 2012-2014, RBM- and DBN-based pre-training had largely disappeared from mainstream practice. Several developments contributed.
Better activations and initialisation made pre-training unnecessary. Glorot, Bordes and Bengio's 2011 AISTATS paper "Deep Sparse Rectifier Neural Networks" showed that simply replacing sigmoid or tanh activations with rectified linear units (ReLU) lets deep networks train well from random initialisation on large supervised tasks, eliminating the need for unsupervised pre-training. Improved initialisation schemes (Xavier and He), batch normalisation, and dropout (Srivastava et al., 2014) further closed the gap.
Big data and big GPUs. Krizhevsky, Sutskever and Hinton's 2012 ImageNet result with a purely supervised convolutional network (AlexNet) demonstrated that, given enough labelled data and GPU compute, supervised learning alone outperformed pre-trained networks. Subsequent vision and language work confirmed this.
Sampling-based training is slow and brittle. CD-1 has biased gradients and PCD-style chains can get stuck. Hyperparameter tuning (learning rates, momentum, weight decay, mini-batch size, number of Gibbs steps, type of CD) is delicate, and the resulting models are hard to evaluate because the partition function $Z$ cannot be computed exactly.
Better generative models arrived. Variational autoencoders (Kingma and Welling, 2013) and generative adversarial networks (Goodfellow et al., 2014) offered far more practical generative models. They scaled to higher-resolution images, were easier to evaluate qualitatively, and used the same gradient-based optimisation as the rest of deep learning rather than sampling-based contrastive methods.
By the time the modern wave of generative AI took off in the late 2010s, the Boltzmann machine was a historical landmark rather than a working tool.
The Boltzmann machine sits at the intersection of statistical physics and machine learning. Its energy function has the same form as the Hamiltonian of an Ising spin glass, and the probability distribution over states is a Gibbs measure. Several physical models map directly onto Boltzmann machines:
In October 2024 the Royal Swedish Academy of Sciences awarded the Nobel Prize in Physics jointly to John J. Hopfield and Geoffrey Hinton "for foundational discoveries and inventions that enable machine learning with artificial neural networks". The official citation explicitly mentions the Hopfield network and the Boltzmann machine: Hinton's contribution is described as having used "tools from statistical physics" in 1983-1985 to create a network "that can learn to recognise characteristic elements in a set of data". The award is, in effect, the recognition of the Boltzmann machine as a piece of physics.
Although the explicit Boltzmann machine has not returned to wide use, several modern lines of research are continuous-state successors of the energy-based formulation it pioneered.
Energy-based models. Yann LeCun and collaborators have advocated for a broad family of energy-based models in which a parametric energy function $E_\theta(x)$ is shaped to take low values on data and high values elsewhere. The Joint Embedding Predictive Architecture (JEPA) family formalises self-supervised learning in this framework. RBM training is the historical template for this style of learning.
Score-based and diffusion models. Denoising score-matching and the diffusion models that grew out of it (Song and Ermon, 2019; Ho et al., 2020) can be viewed as continuous-state energy-based models. Training amounts to learning the gradient of the log density (the score), which sidesteps the partition function in much the same way contrastive divergence did, and sampling is performed by running Langevin or reverse-diffusion chains analogous to Gibbs sampling in a Boltzmann machine.
Modern Hopfield networks. Ramsauer et al. (2020) showed that continuous-state Hopfield networks with the right energy function reproduce the attention mechanism of Transformers, reviving interest in the energy-based view of associative memory.
| Model | Training | Inference | Sample quality | Era of dominance | Key paper |
|---|---|---|---|---|---|
| Boltzmann machine | Contrastive Hebbian, MCMC | Slow Gibbs sampling | Limited (toy) | 1985-1990 | Ackley, Hinton, Sejnowski 1985 |
| Restricted Boltzmann Machine (RBM) | Contrastive Divergence | Block Gibbs | Modest | 2002-2012 | Smolensky 1986; Hinton 2002 |
| Deep Belief Network (DBN) | Greedy stacked RBMs + fine-tune | Layer-wise feedforward | Good for digits | 2006-2012 | Hinton, Osindero, Teh 2006 |
| Deep Boltzmann Machine (DBM) | Mean field + PCD | Mean-field iterations | Good but slow | 2009-2012 | Salakhutdinov, Hinton 2009 |
| Variational Autoencoder (VAE) | ELBO via reparameterisation | Single forward pass | Blurry but reliable | 2014-present | Kingma, Welling 2013 |
| Generative Adversarial Network (GAN) | Min-max game | Single forward pass | Sharp images | 2014-2020 | Goodfellow et al. 2014 |
| Normalising flow | Exact log-likelihood | Invertible forward | Tractable density | 2015-present | Dinh et al. 2014; Rezende, Mohamed 2015 |
| Diffusion model | Score matching, denoising | Iterative reverse process | State of the art | 2020-present | Ho, Jain, Abbeel 2020 |
| Year | Authors | Paper | Contribution |
|---|---|---|---|
| 1982 | Hopfield | Neural networks and physical systems with emergent collective computational abilities | Hopfield network, deterministic ancestor |
| 1983 | Fahlman, Hinton, Sejnowski | Massively Parallel Architectures for AI: NETL, Thistle, and Boltzmann Machines (AAAI-83) | First public mention |
| 1985 | Ackley, Hinton, Sejnowski | A Learning Algorithm for Boltzmann Machines (Cognitive Science) | Defines model and learning rule |
| 1986 | Smolensky | Information Processing in Dynamical Systems: Foundations of Harmony Theory | Introduces the harmonium / RBM |
| 2002 | Hinton | Training Products of Experts by Minimizing Contrastive Divergence (Neural Computation) | CD-k algorithm |
| 2006 | Hinton, Osindero, Teh | A Fast Learning Algorithm for Deep Belief Nets (Neural Computation) | DBN, layer-wise pre-training |
| 2007 | Salakhutdinov, Mnih, Hinton | Restricted Boltzmann Machines for Collaborative Filtering (ICML) | RBM for Netflix Prize |
| 2008 | Tieleman | Training RBMs using Approximations to the Likelihood Gradient (ICML) | Persistent CD |
| 2009 | Salakhutdinov, Hinton | Deep Boltzmann Machines (AISTATS) | DBM with mean-field + PCD |
| 2009 | Salakhutdinov, Hinton | Replicated Softmax (NIPS) | Topic model RBM |
| 2011 | Glorot, Bordes, Bengio | Deep Sparse Rectifier Neural Networks (AISTATS) | Made pre-training unnecessary |
| 2024 | Royal Swedish Academy | Nobel Prize in Physics | Awarded to Hopfield and Hinton, citing Boltzmann machines |