See also: Machine learning terms, Regularization
Dropout regularization is a regularization technique for neural networks that prevents overfitting by randomly setting a fraction of neuron activations to zero during training. First proposed by Geoffrey Hinton and collaborators in 2012 and formalized by Srivastava et al. in 2014, dropout has become one of the most widely used regularization methods in deep learning. The core insight is that randomly deactivating neurons forces the network to learn redundant, distributed representations rather than relying on specific co-adapted features.
The idea of dropout was first introduced in a 2012 paper by Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov titled "Improving neural networks by preventing co-adaptation of feature detectors." The paper demonstrated that randomly omitting half of the feature detectors on each training case significantly reduced overfitting and produced state-of-the-art results on speech and object recognition benchmarks.
The technique was then formalized and studied extensively in the 2014 paper "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, published in the Journal of Machine Learning Research (JMLR). This paper provided a thorough empirical evaluation across vision, speech recognition, document classification, and computational biology tasks, alongside theoretical analysis connecting dropout to model averaging and Bayesian inference.
Hinton has noted that the biological inspiration for dropout came from the observation that sexual reproduction, which randomly combines half the genes from each parent, is more effective at producing robust organisms than asexual reproduction. In a similar way, dropout forces each neuron to be useful on its own and in combination with random subsets of other neurons.
Dropout operates differently during training and inference. Understanding both phases is essential for grasping why the technique is effective.
During each forward pass in training, dropout randomly "drops" (sets to zero) each neuron's output with probability p, called the dropout rate. Only the surviving neurons participate in the forward pass and the subsequent backpropagation step. The dropped neurons receive no gradient updates for that particular training iteration.
Mathematically, for a layer with output vector h, dropout applies an element-wise mask:
r_j ~ Bernoulli(1 - p)
h'_j = r_j * h_j
where r_j is a binary random variable that equals 1 with probability (1 - p) and 0 with probability p. The vector h' replaces h as input to the next layer. Because the mask is resampled for every training example (or mini-batch), the network effectively trains a different "thinned" sub-network on each step.
At test time, all neurons are active (no dropout is applied). To compensate for the fact that more neurons are active during inference than during any single training step, the outputs must be scaled. In the original formulation, each neuron's output is multiplied by (1 - p) at test time so that the expected output matches what was seen during training.
Modern deep learning frameworks use inverted dropout, which handles the scaling during training rather than at inference. During training, surviving activations are multiplied by 1 / (1 - p), so that the expected value of the output remains unchanged regardless of dropout. At test time, the network is used as-is with no modifications. This approach is preferred because it simplifies the inference code and avoids any test-time overhead. Both PyTorch and TensorFlow/Keras implement inverted dropout by default.
The inverted dropout formulation is:
r_j ~ Bernoulli(1 - p)
h'_j = (r_j * h_j) / (1 - p)
During inference, h'_j = h_j (no change needed).
The dropout rate p is a hyperparameter that controls how aggressively neurons are deactivated. The optimal value depends on the network architecture, dataset size, and the degree of overfitting observed.
| Layer type | Typical dropout rate (p) | Notes |
|---|---|---|
| Input layer | 0.0 to 0.2 | Dropping too many input features can lose critical information |
| Hidden layers (fully connected) | 0.3 to 0.5 | 0.5 is the most common default; higher values for large networks |
| Convolutional neural network layers | 0.1 to 0.3 | Convolutional layers have fewer parameters per feature map, so less dropout is needed |
| Recurrent neural network layers | 0.2 to 0.5 | Applied carefully to avoid disrupting temporal dependencies |
| Output layer | 0.0 | Dropout is almost never applied to the output layer |
Srivastava et al. (2014) found that a retention probability of 0.5 for hidden units and 0.8 for input units worked well across a wide range of tasks. A useful guideline is that if the network is underfitting, the dropout rate should be lowered (or dropout removed entirely), and if overfitting is severe, the rate can be increased or the network can be made wider to compensate.
Several complementary explanations have been proposed for why dropout is effective.
A neural network with n units that can be dropped has 2^n possible thinned sub-networks. Training with dropout can be viewed as training an exponential number of weight-sharing sub-networks simultaneously. At test time, using the full network with scaled weights approximates the geometric mean of the predictions from all these sub-networks. This is analogous to ensemble methods like bagging, where multiple models are trained on different subsets of data and their predictions are averaged. The key difference is that dropout achieves this effect within a single network without the computational cost of training separate models.
Without dropout, neurons can develop complex co-adaptations where a neuron becomes useful only in the presence of specific other neurons. This makes the network brittle because it depends on particular feature combinations. Dropout breaks these co-adaptations by ensuring that a neuron cannot rely on any specific set of partner neurons being present. Each neuron is forced to learn features that are individually useful across many random contexts, producing more robust and transferable representations.
Dropout adds noise to the training process, which acts as a form of regularization similar to adding noise to the weights or the gradients. Wager, Wang, and Liang (2013) showed that for generalized linear models, the regularization effect of dropout is approximately equivalent to an adaptive form of L2 regularization where the penalty depends on the Fisher information of each feature. This provides a theoretical link between dropout and classical regularization techniques.
Gal and Ghahramani (2016) established a formal connection between dropout and approximate Bayesian inference. They showed that a neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a deep Gaussian process. This interpretation opened the door to using dropout for uncertainty estimation (see the Monte Carlo dropout section below).
Since the introduction of standard dropout, researchers have developed several variants tailored to different architectures and use cases.
Standard element-wise dropout is not ideal for convolutional neural network layers because adjacent pixels in a feature map are highly correlated. Dropping individual pixels has little effect since the information can be recovered from neighboring activations. Spatial dropout (Tompson et al., 2015) addresses this by dropping entire feature maps (channels) rather than individual elements. If a feature map is dropped, all spatial locations within that map are set to zero simultaneously. This forces the network to learn to use information from multiple feature maps rather than relying on any single one.
DropConnect (Wan et al., 2013) generalizes dropout by randomly setting individual weights (connections) to zero rather than setting entire neuron activations to zero. While dropout masks the output of neurons, DropConnect masks the weight matrix itself. This means each neuron receives input from a random subset of the neurons in the previous layer. DropConnect is more fine-grained than dropout and was shown to improve performance on several image classification benchmarks, though it is computationally more expensive and less commonly used in practice.
DropBlock (Ghiasi, Lin, and Le, 2018) extends spatial dropout specifically for convolutional networks by dropping contiguous rectangular regions of a feature map. Rather than dropping individual elements or entire channels, DropBlock removes a block of spatially correlated units. This is more effective for convolutional layers because it forces the network to look at a wider spatial context rather than relying on localized features. DropBlock was shown to outperform both standard dropout and spatial dropout on ImageNet and COCO benchmarks.
Kingma, Salimans, and Welling (2015) proposed variational dropout, which frames dropout as a form of variational inference. Unlike standard dropout where the same dropout rate is applied uniformly, variational dropout can learn individual dropout rates for each weight or unit. Gal and Ghahramani (2016) extended this idea by showing that applying the same dropout mask at each time step of a recurrent neural network (rather than resampling at every step) corresponds to a form of approximate variational inference. This approach, sometimes called "locked" or "tied" dropout, prevents the recurrent connections from losing long-term memory due to accumulating dropout noise over time.
Alpha dropout (Klambauer et al., 2017) was designed specifically for self-normalizing neural networks that use the Scaled Exponential Linear Unit (SELU) activation function. Standard dropout disrupts the mean and variance properties that SELU depends on for self-normalization. Alpha dropout instead sets dropped activations to the negative saturation value of SELU (rather than zero) and applies an affine transformation to maintain the desired mean and variance. This preserves the self-normalizing property throughout the network.
Concrete dropout (Gal, Hron, and Kendall, 2017) treats the dropout rate as a learnable parameter that is optimized jointly with the network weights during training. It uses a continuous relaxation of the discrete Bernoulli distribution (the "concrete" distribution) to allow gradient-based optimization of the dropout probability. This eliminates the need to manually tune dropout rates, which is particularly useful in Bayesian neural network settings where the dropout rate has a principled interpretation as a prior parameter.
| Variant | What is dropped | Key advantage | Best suited for |
|---|---|---|---|
| Standard dropout | Individual activations | Simple, general-purpose | Fully connected layers |
| Spatial dropout | Entire feature maps (channels) | Respects spatial correlations | Convolutional layers |
| DropConnect | Individual weights | More fine-grained noise | Fully connected layers |
| DropBlock | Contiguous spatial regions | Forces wider spatial context | Convolutional layers |
| Variational dropout | Activations (with learned rates) | Consistent mask across time steps | Recurrent layers |
| Alpha dropout | Activations (to SELU saturation value) | Preserves self-normalization | SELU-based networks |
| Concrete dropout | Activations (learned rate) | No manual tuning of dropout rate | Bayesian settings |
The transformer architecture applies dropout in multiple locations. The original "Attention Is All You Need" paper (Vaswani et al., 2017) used dropout in three places: after the attention weight computation (attention dropout), after each sub-layer output before the residual addition, and within the feed-forward sub-layers. Typical dropout rates in transformers range from 0.1 to 0.3.
Attention dropout randomly zeros out entries in the attention weight matrix, which encourages the model to distribute attention across a broader set of positions rather than concentrating on a few tokens. Residual dropout is applied to the output of each sub-layer before it is added to the residual connection. Some implementations also apply dropout to the embedding layer outputs.
Large-scale language models such as GPT-2 and BERT use dropout as one of their primary regularization mechanisms. Interestingly, some very large models (GPT-3, for example) have been trained without dropout, relying instead on the sheer volume of training data to prevent overfitting. Recent work by Liu et al. (2023) showed that dropout can also reduce underfitting in certain training regimes when applied with a scheduled rate.
One of the most influential extensions of dropout is Monte Carlo (MC) dropout, proposed by Gal and Ghahramani (2016). The key insight is that keeping dropout active at inference time and running multiple forward passes with different random masks produces a distribution of predictions rather than a single point estimate.
The procedure is:
Gal and Ghahramani proved that this procedure is mathematically equivalent to approximate inference in a deep Gaussian process, giving it a principled Bayesian interpretation. MC dropout is attractive in practice because it requires no changes to the model architecture or training procedure; the only cost is running multiple forward passes at test time. It has been applied in safety-critical domains such as medical imaging, autonomous driving, and robotics, where knowing how confident a model is can be as important as the prediction itself.
Batch normalization (BN) and dropout are both widely used regularization techniques, but combining them can be problematic. Li et al. (2019) published a paper titled "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift" that analyzed why the two techniques sometimes interfere.
The core issue is that dropout introduces a variance shift between training and inference. During training, the random zeroing of activations changes the distribution of inputs to the batch normalization layer. BN computes running statistics (mean and variance) during training that it uses at test time, but these statistics are computed on dropout-perturbed activations. At inference time, dropout is turned off, so the actual activation distribution differs from what BN expects. This mismatch can degrade performance.
Several strategies have been proposed to address this:
Dropout is not universally beneficial and can hurt performance in certain situations:
Dropout is straightforward to implement in modern deep learning frameworks.
PyTorch provides several dropout modules:
import torch.nn as nn
# Standard dropout (for fully connected layers)
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(p=0.5), # drops 50% of activations
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(128, 10)
)
# Dropout2d (for convolutional layers - drops entire channels)
conv_block = nn.Sequential(
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Dropout2d(p=0.2) # drops 20% of feature maps
)
PyTorch automatically disables dropout when model.eval() is called and re-enables it with model.train(). Additional variants include nn.Dropout1d for temporal data and nn.Dropout3d for volumetric data.
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation='softmax')
])
# For convolutional layers (spatial dropout)
conv_model = tf.keras.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.SpatialDropout2D(0.2),
tf.keras.layers.Conv2D(128, (3, 3), activation='relu'),
tf.keras.layers.SpatialDropout2D(0.2)
])
# Alpha dropout for SELU networks
selu_model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='selu'),
tf.keras.layers.AlphaDropout(0.1)
])
Keras automatically handles the training/inference distinction: dropout is active during model.fit() and model.train_on_batch() but inactive during model.predict() and model.evaluate().
Imagine you are on a team of kids trying to solve a jigsaw puzzle. Every time you practice, some of your teammates have to sit out at random. Because you never know who will be missing, every kid on the team learns to be helpful on their own, not just when their best friend is there. When the real contest comes and everyone shows up, the whole team is much stronger because every member knows how to contribute.
Dropout works the same way in a neural network. During practice (training), random neurons are told to sit out. This forces the remaining neurons to learn useful features on their own. When the network takes a real test (inference), all neurons participate, and the network performs better because no single neuron is a weak link that the others depend on too much.