Dropout

Dropout is a regularization technique for neural networks that randomly sets a fraction of neuron activations to zero during training. Proposed by Geoffrey Hinton and colleagues in 2012 ^[1] and described in full detail by Nitish Srivastava, Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov in a 2014 JMLR paper ^[2], dropout became one of the most important tools for preventing overfitting in deep learning. The idea is simple: at each training step, every neuron in a given layer has a probability p of being temporarily "dropped out" (its output set to zero), forcing the network to learn redundant representations that do not depend on any single neuron.

Dropout was a breakthrough when it was introduced. Before dropout, training deep neural networks on small to medium datasets almost always led to severe overfitting, where the model would memorize the training data but fail to generalize to new inputs. Weight decay, early stopping, and data augmentation helped somewhat, but dropout provided a qualitatively different kind of regularization that dramatically improved generalization across many tasks: image classification, speech recognition, document classification, and computational biology ^[2].

History

The idea behind dropout grew out of work in Hinton's lab at the University of Toronto in 2011 and 2012. Hinton has often described the inspiration coming from analogies to biological evolution and to bank fraud prevention, where rotating bank tellers reduces opportunities for collusion. Translated to neural networks, randomly removing units during training prevents groups of neurons from forming brittle conspiracies that fail when the input distribution shifts.

The technique was first publicly presented in a 2012 arXiv preprint by Hinton, Srivastava, Krizhevsky, Sutskever, and Salakhutdinov titled "Improving neural networks by preventing co-adaptation of feature detectors" ^[1]. That paper showed substantial gains on TIMIT phone recognition, CIFAR-10, MNIST, and Reuters document classification, and it was discussed at the NIPS 2012 deep learning workshop. The same year, Krizhevsky, Sutskever, and Hinton used dropout in AlexNet to win the ILSVRC ImageNet competition by a wide margin ^[11], which made dropout famous overnight in the computer vision community.

The full theoretical and empirical treatment was published in 2014 in the Journal of Machine Learning Research as "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" ^[2]. That paper has since accumulated over 50,000 citations and is one of the most cited works in machine learning. It established dropout as a default ingredient of deep model training for nearly a decade, until very large data and architectural changes began to reduce its perceived necessity.

How dropout works

The core mechanism of dropout is straightforward. During each forward pass in training, each neuron in the dropout layer is independently zeroed out with probability p (typically called the dropout rate). The remaining active neurons have their outputs scaled by 1/(1-p) to maintain the expected sum of activations. During inference (test time), all neurons are active and no scaling is needed.

Training phase

Consider a layer with output activations h = [h_1, h_2, ..., h_n]. During training with dropout rate p:

For each neuron i, sample a mask value r_i from a Bernoulli distribution: r_i ~ Bernoulli(1 - p). So r_i = 1 with probability (1 - p) and r_i = 0 with probability p.
Multiply the activations element-wise by the mask: h_dropped = h * r.
(In inverted dropout) scale the surviving activations: h_out = h_dropped / (1 - p).

In equation form, the forward pass for a layer becomes y = f(W (x ⊙ m)) where m is the binary Bernoulli mask, ⊙ is element-wise multiplication, and f is the activation function. The mask is resampled at every training step, so a different random subset of neurons is active each time. This means the network effectively trains a different "thinned" sub-network on each mini-batch.

Expected output and inverted dropout

Without any rescaling, the expected value of a neuron's output drops by a factor of (1 - p) when dropout is active. If we want the expected output to match what a network without dropout would produce, we have two choices: scale at test time or scale at training time. Inverted dropout multiplies the surviving activations by 1/(1-p) so that E[h_out] = h, which keeps the inference graph identical to a normal forward pass. The original 2014 paper described the opposite convention (scale weights by (1 - p) at test time), but inverted dropout is now universal in PyTorch, TensorFlow, JAX, Keras, and other modern frameworks.

Inference phase

At test time, all neurons are active. If using inverted dropout, no modification is needed because the scaling was already applied during training. If using the original formulation, the weights coming out of a dropout layer must be multiplied by (1 - p) at test time to compensate for the fact that more neurons are active than during any single training step.

Standard vs inverted dropout

Approach	During training	During inference
Standard dropout	Zero out neurons with probability p, no scaling	Multiply outgoing weights by (1 - p)
Inverted dropout	Zero out neurons with probability p, scale survivors by 1/(1-p)	No modification needed

Inverted dropout is preferred in practice because it keeps the test-time forward pass identical to a normal forward pass, which simplifies deployment, model export, and inference optimizations like graph fusion or quantization. All major deep learning frameworks (PyTorch, TensorFlow, JAX) implement inverted dropout.

Why dropout works

Dropout's effectiveness has been explained through several complementary perspectives. None of these is wrong, and most authors view them as different views of the same phenomenon.

Preventing co-adaptation

The original motivation from Hinton et al. ^[1] was to prevent "co-adaptation" of feature detectors. In a standard neural network, neurons can develop complex co-dependencies where one neuron's useful representation depends on the specific outputs of several other neurons. If those partner neurons are absent at test time (because the input distribution has shifted slightly), the co-adapted feature becomes unreliable. By randomly removing neurons during training, dropout forces each neuron to learn features that are useful in combination with many different random subsets of other neurons, producing more robust individual features.

Implicit ensemble

A network with n neurons that can be dropped has 2^n possible thinned sub-networks (each corresponding to a different dropout mask). Training with dropout can be viewed as training an exponentially large ensemble of these sub-networks simultaneously, with shared weights. At test time, using the full network with scaled weights approximates the geometric mean of the predictions of all these sub-networks ^[2]. Ensembles are well known to reduce variance and improve generalization, and dropout provides a computationally cheap way to approximate ensemble averaging without ever instantiating the full ensemble.

For a linear model with squared loss, this approximation is exact: the deterministic test-time prediction equals the expectation over masks. For nonlinear networks the test-time output is only an approximation of the true ensemble mean, but it works remarkably well in practice.

Implicit regularization and noise injection

Dropout injects multiplicative noise into the hidden representations during training. This noise acts as a regularizer by preventing the network from relying too heavily on any particular activation pattern. For squared-error linear regression, adding multiplicative noise with mean 1 is equivalent to adding an L2 penalty whose strength depends on p, the input variance, and the weight magnitudes. This connection makes dropout closely related to weight decay and helps explain why combining the two can sometimes overshoot the optimal regularization strength.

Sparser, more decorrelated representations

Srivastava et al. observed that networks trained with dropout tend to develop activations that are more sparse and less correlated than networks trained without dropout ^[2]. Each unit is forced to be useful on its own, which encourages it to detect a recognizable feature rather than fragments of features that only become meaningful in combination with specific neighbors. The result is reminiscent of biological neural systems, where neurons tend to fire sparsely.

Bayesian interpretation

Yarin Gal and Zoubin Ghahramani showed that training with dropout is mathematically equivalent to approximate variational inference in a deep Gaussian process ^[3]. Under this interpretation, the dropout mask samples correspond to samples from the approximate posterior distribution over the network's weights. This connection provides a theoretical foundation for using dropout not only as a regularizer but also as a tool for uncertainty estimation (see Monte Carlo dropout below) and ties dropout to the broader literature on Bayesian neural networks.

Dropout rate guidelines

The dropout rate p (the probability of zeroing out a neuron) is the main hyperparameter for dropout. The optimal rate depends on the layer type, the network architecture, and the dataset size.

Layer type	Recommended dropout rate	Rationale
Input layer	0.0 to 0.2	Dropping too many input features discards information; light dropout can help
Hidden layers (fully connected)	0.5	Original recommendation from Hinton et al.; balances regularization and capacity
Convolutional layers	0.1 to 0.3	Spatial redundancy in conv layers means less dropout is needed
Recurrent layers	0.2 to 0.5	Applied to non-recurrent connections; recurrent connections may use variational dropout
Transformer attention and FFN	0.1 (typical)	Lower rates work well with the residual structure
Output layer	0.0	Dropping output neurons distorts the loss signal

The rate of 0.5 for hidden layers was the original recommendation from the 2012 and 2014 papers and has a nice property: it maximizes the variance of a Bernoulli random variable, which maximizes the noise injected per neuron. In practice, modern architectures often use lower rates (0.1 to 0.3) because other forms of regularization are also applied.

Larger networks typically benefit from higher dropout rates because they have more capacity to overfit. Smaller networks may not tolerate high dropout rates because too much capacity is being removed at each training step. When the training set is very large relative to the model capacity, dropout becomes less necessary because overfitting is less of a concern.

Reference dropout rates from notable models

Model	Year	Dropout setting
AlexNet ^[11]	2012	0.5 in fully connected layers
VGG-16	2014	0.5 in fully connected layers
Original Transformer ^[9]	2017	0.1 attention, 0.1 residual, 0.1 embedding
BERT-Base	2018	0.1 across attention and FFN
GPT-2	2019	0.1
GPT-3 ^[10]	2020	0.0 to 0.1 depending on model scale
PaLM	2022	0.0
Llama, Llama 2	2023	0.0 (no dropout in pretraining) ^[12]
ViT-B/16	2020	0.0 (default in JAX/Flax reference impl)
DeiT-B	2020	Stochastic depth, no standard dropout in attention
ConvNeXt	2022	Stochastic depth, no standard dropout

The pattern is clear: as scale grew and training data exceeded model capacity, dropout rates dropped, often to zero. For smaller architectures or fine-tuning on limited downstream data, the older 0.1 to 0.5 ranges still apply.

When not to use dropout

Dropout is not always helpful. There are several situations where it can hurt accuracy or interact badly with other techniques.

Very large models trained on very large datasets. When the dataset is large enough that the model cannot easily memorize it, dropout often provides no benefit and can slow convergence. Llama and Llama 2 use no dropout during pretraining ^[12], and PaLM and other large pretraining runs follow the same pattern.
Single-epoch language model pretraining. Recent work has explicitly studied this case and found that dropout slightly hurts perplexity and downstream metrics under standard single-epoch pretraining of both masked and autoregressive language models.
After batch normalization layers in convolutional stacks. Li et al. (2019) documented a "variance shift" problem where batch norm running statistics are computed with dropout active during training but used without dropout at test time, which biases the test-time activations ^[8].
When stochastic depth is already used. Stochastic depth (DropPath) provides a stronger form of structural noise in residual networks and largely substitutes for unit-level dropout in modern image and vision-transformer training pipelines.
When the model is already heavily regularized. Stacking dropout on top of strong data augmentation, aggressive weight decay, label smoothing, and mixup can over-regularize and underfit.

A useful diagnostic: if training loss and validation loss are close, dropout is likely unnecessary or even harmful. Dropout earns its keep when training loss is much lower than validation loss, which is the classic overfitting signal.

Dropout variants

Since the original dropout paper, researchers have developed many specialized variants for different architectures and settings. Several have become more common in modern practice than the original.

DropConnect

DropConnect (Wan et al., 2013) is a generalization of dropout that randomly zeros out individual weights rather than neuron activations ^[4]. While dropout sets entire neuron outputs to zero (dropping all outgoing connections from a neuron), DropConnect sets individual connections to zero. This provides a finer-grained form of regularization. DropConnect can be more effective than dropout in some settings, but it is also more computationally expensive because the dropout mask is applied to the weight matrix, which is much larger than the activation vector.

Spatial dropout

Standard dropout drops individual activations, which is appropriate for fully connected layers but less effective for convolutional layers. In a convolutional neural network, adjacent spatial positions in a feature map are highly correlated, so dropping individual pixels has little effect because the surrounding pixels carry redundant information.

Spatial dropout (Tompson et al., 2015) addresses this by dropping entire feature maps (channels) rather than individual activations ^[5]. If a feature map is dropped, all spatial positions in that map are set to zero. This forces the network to learn redundant representations across different channels rather than relying on any single feature map. PyTorch implements this as nn.Dropout2d and TensorFlow as SpatialDropout2D.

DropBlock

DropBlock (Ghiasi et al., 2018) extends spatial dropout by dropping contiguous rectangular regions within feature maps ^[6]. Rather than dropping an entire channel (which can be too aggressive) or individual pixels (which is too weak), DropBlock drops a block of size block_size by block_size. The block size is a hyperparameter that controls the granularity of the dropped region. When block_size = 1, DropBlock reduces to standard dropout; when block_size covers the entire feature map, it reduces to spatial dropout.

DropBlock was shown to be particularly effective for training object detection and semantic segmentation models, where spatial structure is important and the network needs to be regularized at the right spatial scale.

Variational dropout for RNNs

The original dropout paper noted that naive application of dropout in recurrent neural networks is harmful, because resampling the mask at every time step injects noise into the recurrent connections and disrupts long-range memory. Gal and Ghahramani (2016) introduced "variational dropout" for RNNs, which uses the same dropout mask at every time step within a sequence (per layer) ^[3]. This provides regularization without destroying the recurrent signal and quickly became the standard way to apply dropout in LSTM and GRU language models.

Variational dropout for sparsification

A different and somewhat overlapping line of work, also called variational dropout, treats the dropout rate itself as a learned parameter via variational inference. Kingma, Salimans, and Welling (2015) and Molchanov, Ashukha, and Vetrov (2017) frame dropout as variational Bayes with Gaussian noise and optimize the per-weight noise level ^[7]. Many weights converge to a dropout rate near 1, effectively pruning them from the network. This produces extreme sparsity (often 90% or more) with minimal accuracy loss and connects dropout directly to the literature on neural network compression.

Concrete dropout

Concrete dropout (Gal, Hron, and Kendall, 2017) uses a continuous relaxation of the Bernoulli distribution (the Concrete or Gumbel-Softmax distribution) to make the dropout rate differentiable ^[13]. With this relaxation, the dropout probability can be optimized directly by gradient descent rather than tuned by grid search. This is particularly valuable when dropout is used for uncertainty estimation, since well-calibrated uncertainty depends on choosing the right rate. Empirically, concrete dropout learns higher dropout rates for the second and final layers of small image classifiers and lower rates for the input and middle layers, a pattern that is hard to discover by hand.

Stochastic depth and DropPath

Stochastic depth, introduced by Huang, Sun, Liu, Sedra, and Weinberger in 2016, drops entire residual blocks during training rather than individual neurons ^[14]. For each mini-batch, each residual block is replaced by an identity function with some probability, which means the network trains short paths and uses long paths at test time. Drop probabilities are usually set to increase linearly with depth so that earlier layers are kept and later layers are dropped more often. The same idea, often called DropPath in vision transformer literature, is used in DeiT, Swin Transformer, and ConvNeXt as a primary regularizer in place of standard dropout. The popular timm library implements both DropPath and DropBlock and applies them by default to many modern image models.

Recurrent dropout

Semeniuta, Severyn, and Barth (2016) proposed a different recurrent dropout that drops the hidden state update inside an LSTM or GRU cell rather than the hidden state itself ^[15]. Because the dropped components are added to the cell state, the long-term memory is preserved and only the additive update is regularized. This works well in practice on language modeling and machine translation benchmarks and complements feed-forward dropout on the input and output of the recurrent layer.

Word and embedding dropout

In NLP, word dropout (sometimes called embedding dropout) sets entire word embeddings to zero with some probability before they enter the network. Because the same vector is used everywhere a word appears, dropping the vector once removes that word from the entire input sequence for that forward pass. Embedding dropout is part of the AWD-LSTM language model and several modern text classification setups. It complements positional dropout in transformers, where the sum of token and positional encoding vectors is dropped before the first attention layer.

AlphaDropout

AlphaDropout (Klambauer, Unterthiner, Mayr, Hochreiter, 2017) is a special dropout designed for self-normalizing neural networks that use the SELU activation ^[16]. Standard dropout sets values to zero, which destroys the zero-mean unit-variance property that SELU maintains. AlphaDropout sets the dropped values to the negative saturation value of SELU instead and applies an affine transformation that preserves both mean and variance. This keeps the self-normalizing dynamics intact and is implemented as nn.AlphaDropout in PyTorch and tf.keras.layers.AlphaDropout in Keras.

Targeted dropout

Targeted dropout (Gomez, Zhang, Kamalakara, Madaan, Swersky, Gal, Hinton, 2018) ranks weights or units by importance (typically by magnitude) and applies dropout only to the least important elements ^[17]. Because dropout already encourages the network to be robust to the absence of dropped components, applying it to the candidate-for-pruning set teaches the network to perform well after those elements are pruned. Targeted dropout consistently outperforms one-shot magnitude pruning at the same sparsity level, often matching the accuracy of an unregularized dense network at half the parameter count.

Comparison of dropout variants

Variant	What is dropped	Where it works best	Memory cost	Key advantage
Dropout ^[2]	Individual neuron activations	Fully connected layers	Low	Simple, effective, well-understood
DropConnect ^[4]	Individual weights	Fully connected layers	Higher	Finer-grained regularization
Spatial dropout ^[5]	Entire feature map channels	Convolutional layers	Low	Handles spatial correlation
DropBlock ^[6]	Contiguous spatial regions	Convolutional layers	Low	Tunable spatial granularity
Variational dropout (RNN) ^[3]	Same mask across time steps	LSTMs and GRUs	Low	Preserves long-term memory
Variational sparsifying ^[7]	Learned per-weight noise	Any layer	Moderate	Automatic pruning, sparsification
Concrete dropout ^[13]	Activations with learned rate	Bayesian models	Low	Learnable dropout probability
Stochastic depth / DropPath ^[14]	Entire residual blocks	Deep ResNets, ViTs, ConvNeXt	Low	Trains short, tests deep
Recurrent dropout ^[15]	LSTM hidden state update	LSTMs	Low	Preserves cell memory
Word/embedding dropout	Whole word embeddings	NLP models	Low	Token-level regularization
AlphaDropout ^[16]	Activations replaced with SELU saturation value	SELU networks	Low	Preserves self-normalization
Targeted dropout ^[17]	Low-importance weights or units	Pruning pipelines	Low	Yields prunable networks

Dropout and other regularizers

Dropout sits in a family of regularizers, and in modern practice it is rarely used alone.

Dropout and batch normalization

Batch normalization (Ioffe and Szegedy, 2015) and dropout were both introduced as techniques to improve training of deep networks, but they interact in non-obvious ways.

Li et al. (2019) showed that applying dropout before batch normalization can cause a "variance shift" problem ^[8]. During training, dropout changes the variance of the layer's output (because some neurons are zeroed and others are scaled). Batch normalization computes running statistics of the mean and variance during training and uses these statistics at test time. The training statistics include the variance introduced by dropout, while the test statistics do not (because dropout is disabled at test time). This mismatch between training and test variance can hurt performance.

Several solutions have been proposed:

Apply dropout only after all batch normalization layers.
Use a consistent dropout rate that minimizes the variance shift.
Replace dropout with batch normalization entirely (many modern architectures take this approach).
Use a Gaussian dropout variant that has a more predictable effect on variance.

In practice, many modern convolutional architectures (like ResNet) use batch normalization without dropout and achieve excellent results. The regularization provided by batch normalization, combined with data augmentation, is often sufficient.

Dropout and weight decay

Dropout and weight decay both shrink the effective capacity of the model, but they do it differently. Weight decay penalizes the L2 norm of the weights, pulling them toward zero throughout training. Dropout injects noise into activations, which encourages each unit to be useful on its own. Used together, they can be complementary, but at high settings they tend to compete: a model with both very strong weight decay and high dropout often underfits. A common pattern is to use weight decay around 1e-4 to 1e-2 with dropout around 0.1 to 0.3 in convolutional and transformer models.

Dropout and data augmentation

Data augmentation attacks overfitting from the input side by showing the model many transformed versions of each example. Dropout attacks it from the inside by perturbing hidden activations. They address different failure modes and combine well, which is why most successful image classification recipes use both. For language models, dropout substitutes for the role data augmentation plays in vision because text is harder to augment without changing meaning.

Dropout and label smoothing

Label smoothing softens the one-hot training labels by mixing in a small amount of uniform probability. It penalizes overconfident predictions, which is a different lever from dropout's noise injection. Modern transformer and vision pipelines often use both at modest settings, for example dropout 0.1 with label smoothing 0.1.

Dropout and ensembles

Dropout's connection to ensembles is real but approximate. Combining dropout with explicit ensembling (training several independently initialized models and averaging their outputs) usually still improves accuracy, which suggests dropout does not fully capture the variance reduction of true ensembling.

Dropout in modern architectures

The role of dropout has evolved significantly as neural network architectures have changed.

Early deep learning era (2012 to 2016)

During the era of AlexNet, VGGNet, and similar architectures, dropout was essential. These networks had large fully connected layers at the end (sometimes with tens of millions of parameters), which were highly prone to overfitting. Without dropout, these models would severely overfit on datasets like ImageNet. Dropout rates of 0.5 were standard in these fully connected layers.

Convolutional network era (2015 to 2020)

As architectures shifted toward global average pooling (eliminating the large fully connected layers), the role of dropout diminished in convolutional networks. ResNet, for example, does not use dropout in its standard configuration. Batch normalization and data augmentation provided sufficient regularization. When dropout was used in convolutional networks, it was typically spatial dropout or DropBlock rather than standard dropout.

Normalizer-Free Networks (NFNets) by Brock, De, Smith, and Simonyan (2021) deliberately removed batch normalization and instead leaned heavily on dropout, gradient clipping, and stochastic depth to keep large image models stable ^[18]. This is one of the few recent settings where dropout was explicitly turned up rather than down.

Transformer era (2017 to present)

In transformer architectures, dropout is used but in a more targeted and restrained manner. The original transformer paper (Vaswani et al., 2017) applied dropout in three places ^[9]:

After the attention weights (attention dropout, typically rate 0.1).
After each sub-layer output, before the residual addition (residual dropout, typically rate 0.1).
To the sum of token embeddings and positional encodings.

For large language models with billions of parameters trained on trillions of tokens, dropout is often reduced or eliminated entirely. The sheer volume of training data provides implicit regularization, and the model is far from overfitting on the training set. GPT-3, for example, used dropout rates of 0.0 to 0.1 depending on model size ^[10]. PaLM and Chinchilla report using no dropout at all during pre-training. Llama and Llama 2 also use zero dropout during pretraining ^[12], a choice that is now standard for frontier-scale single-epoch pretraining.

However, dropout remains important during fine-tuning, where a large pre-trained model is adapted to a smaller downstream dataset. In this setting, overfitting is a real risk, and dropout rates of 0.1 to 0.3 are commonly applied. The default attention_dropout in many Hugging Face fine-tuning recipes is set to a small nonzero value, while the same parameter is zero in the base pretraining configuration.

Vision transformers and modern image models

Vision Transformers (ViT) usually disable hidden_dropout_prob and attention_probs_dropout_prob in their reference Flax implementation and rely on stochastic depth, RandAugment, and Mixup for regularization. DeiT (Touvron et al., 2021) made this combination popular, using DropPath with a probability that increases linearly across the depth of the network rather than standard dropout. Steiner et al. (2021) studied the trade-off between data augmentation and model regularization in detail and found that augmentation usually pays off more than dropout for ViT-scale models. ConvNeXt and modern hybrid architectures follow the same recipe: stochastic depth instead of unit dropout, plus heavy data augmentation.

Monte Carlo dropout for uncertainty estimation

One of the most influential extensions of dropout is Monte Carlo (MC) dropout, proposed by Yarin Gal and Zoubin Ghahramani in 2016 ^[3]. MC dropout provides a practical method for estimating uncertainty in neural network predictions without modifying the model architecture or training procedure.

How MC dropout works

The key insight is simple: keep dropout enabled at test time, run multiple forward passes with different dropout masks, and treat the variation in predictions as a measure of uncertainty.

Specifically:

Train the model with dropout as usual.
At test time, keep dropout enabled (do not switch to evaluation mode).
For each input, run T forward passes, each with a different random dropout mask.
The mean of the T predictions approximates the predictive mean.
The variance of the T predictions approximates the predictive uncertainty.

Gal and Ghahramani proved that this procedure is mathematically equivalent to performing approximate variational inference in a deep Gaussian process, where the dropout distribution serves as an approximate posterior over the model weights ^[3].

Applications of MC dropout

MC dropout has found applications in domains where knowing the model's confidence is as important as the prediction itself:

Application domain	How uncertainty is used
Medical diagnosis	Flag low-confidence predictions for human review
Autonomous vehicles	Increase caution when perception uncertainty is high
Active learning	Select the most uncertain samples for labeling
Bayesian optimization	Balance exploration and exploitation using uncertainty
Anomaly detection	High uncertainty on unusual inputs signals potential anomalies

Practical considerations

The main cost of MC dropout is the need for multiple forward passes at test time. Typical choices of T range from 10 to 100 passes. This increases inference latency linearly, which may be unacceptable for real-time applications. Several methods have been proposed to reduce this cost, including learned uncertainty estimates, deep ensembles, single-pass approximations, and last-layer Bayesian methods.

The quality of the uncertainty estimates depends on the dropout rate and network architecture. Gal and Ghahramani recommend tuning the dropout rate as a model hyperparameter (it corresponds to the prior length-scale in the Gaussian process interpretation) rather than treating it purely as a regularization parameter. Concrete dropout was developed in part to make this tuning automatic ^[13].

Implementation in deep learning frameworks

Dropout is built into every major deep learning framework. The basic API is similar across frameworks: instantiate a dropout layer with a rate, place it in your model, and switch the model between training and evaluation modes.

PyTorch

PyTorch makes dropout straightforward:

import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 512)
        self.dropout1 = nn.Dropout(p=0.5)
        self.fc2 = nn.Linear(512, 256)
        self.dropout2 = nn.Dropout(p=0.5)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout1(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout2(x)
        x = self.fc3(x)
        return x

model = MyModel()
model.train()  # dropout is active
model.eval()   # dropout is bypassed

Note that nn.Dropout in PyTorch uses inverted dropout by default and is automatically disabled when model.eval() is called. PyTorch also offers nn.Dropout1d, nn.Dropout2d, and nn.Dropout3d for spatial dropout on 1D, 2D, and 3D feature maps respectively, plus nn.AlphaDropout for SELU networks. For MC dropout, you would keep the model in training mode (model.train()) or manually enable dropout during inference by toggling each dropout layer's training flag.

TensorFlow and Keras

Keras provides tf.keras.layers.Dropout, tf.keras.layers.SpatialDropout1D, tf.keras.layers.SpatialDropout2D, tf.keras.layers.SpatialDropout3D, and tf.keras.layers.AlphaDropout. The Dropout layer takes a rate argument that is the probability of dropping a unit (the same convention as PyTorch). The training argument to model(inputs, training=True) controls whether dropout is active. Keras also automatically toggles dropout based on whether you call model.fit (training) or model.predict (inference).

JAX, Flax, and Haiku

In JAX-based libraries, dropout is implemented as a deterministic function of an input tensor and a PRNG key. Flax provides flax.linen.Dropout(rate=0.5) and Haiku provides haiku.dropout(rng, rate, x). Because JAX is functional, the user must explicitly pass a PRNG key to generate the dropout mask, and the layer must be told whether it is in training mode through a deterministic flag (in Flax) or by skipping the call entirely at inference (in Haiku).

Practical pattern across frameworks

The common pattern is the same in all three frameworks: dropout is an in-place perturbation applied during the forward pass, scaled to preserve expectation, and disabled at inference. The framework-specific differences are about how training versus inference mode is signaled and how randomness is plumbed.

Dropout in practice: tips and patterns

Common patterns

Place dropout after the activation function, not before. This is because the activation may map values into a specific range (for example ReLU produces non-negative values), and dropping before activation can interact poorly with the activation's behavior.
For convolutional layers, prefer spatial dropout or DropBlock to standard dropout. Adjacent pixels in feature maps are too correlated for unit-level dropout to do much.
For recurrent layers, use the same mask across time steps within a sequence (variational dropout) and never drop the recurrent connection itself with a freshly resampled mask, which destroys long-range memory.
For transformers, follow the original recipe: a small attention dropout, a small residual dropout, and a small embedding dropout, all around 0.1 to start.
Increase dropout for fine-tuning. When fine-tuning a pretrained model on a small dataset, raising the dropout rate often helps prevent overfitting.
Decrease dropout with more data. If the training dataset is very large relative to model capacity, dropout may be unnecessary and can slow convergence without providing regularization benefit.
Skip dropout when stochastic depth is already in use. The two techniques regularize at different scales but tend to produce diminishing returns when combined at high rates.

Debugging dropout

A few common pitfalls:

Forgetting to call model.eval() in PyTorch before inference, which leaves dropout active and produces noisy, inconsistent predictions.
Computing batch norm running statistics with dropout active before the batch norm layer, which can hurt test-time accuracy due to the variance shift problem ^[8].
Setting the dropout rate equal to the desired sparsity rather than the desired drop probability. PyTorch and Keras both follow the convention p = probability of dropping, but some early papers used p = probability of keeping, which can cause off-by-one confusion.
Stacking too much regularization at once. If both training and validation loss go up after adding dropout, the model is now underfitting and you should turn dropout (or another regularizer) down.

Empirical findings

The Srivastava et al. 2014 paper ^[2] reports a series of ablations that established dropout's value across domains:

On MNIST, dropout reduced the test error of a standard fully connected network from around 1.6% to 1.0% to 1.3% depending on architecture.
On CIFAR-10, dropout combined with a convolutional network produced what was at the time a state-of-the-art result.
On TIMIT phone recognition, dropout improved frame error rates over a strong baseline.
On ImageNet, dropout was a key ingredient in AlexNet's winning ILSVRC 2012 entry ^[11], which was widely seen as the start of the modern deep learning era.
On Reuters and other text classification tasks, dropout consistently beat unregularized neural networks and matched or exceeded earlier shallow methods.

The paper also showed that the hidden representations learned with dropout are sparser than those learned without it and that the learned weights are smaller in magnitude, which is consistent with the implicit-L2 view of dropout.

Historical impact and legacy

Dropout was a pivotal contribution to deep learning, and its influence shows up in several distinct ways.

Practical impact. Dropout directly enabled training deeper and larger networks on the datasets available in the early 2010s. AlexNet (Krizhevsky, Sutskever, and Hinton, 2012), which kicked off the deep learning revolution by winning the 2012 ImageNet competition, relied heavily on dropout to prevent overfitting ^[11]. Without dropout, AlexNet's large fully connected layers would have overfit catastrophically. Dropout remained a default ingredient in image classification, speech recognition, and language modeling pipelines for nearly a decade.

Conceptual impact. Dropout introduced the idea that noise during training can be beneficial, not just tolerable. This concept influenced many subsequent techniques: data augmentation strategies, label smoothing, stochastic depth, mixup, cutmix, and the noise mechanisms in diffusion models. The view of training as implicit ensembling also influenced model averaging methods like SWA and stochastic weight averaging.

Theoretical impact. The connection between dropout and Bayesian inference, formalized by Gal and Ghahramani, opened a bridge between deep learning and probabilistic modeling. It enabled practical uncertainty estimation in deep networks and motivated a wave of work on Bayesian deep learning, last-layer Bayesian methods, and deep ensembles.

The Srivastava et al. 2014 JMLR paper ^[2] has been cited over 50,000 times. While dropout's relative importance has decreased in the era of massive datasets and large language models (where the data itself provides regularization), it remains an essential tool for smaller-scale problems, fine-tuning, and uncertainty estimation. In modern practice the spirit of dropout lives on through stochastic depth, DropPath, DropBlock, and the various noise injection schemes in self-supervised and generative models.

References

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). "Improving neural networks by preventing co-adaptation of feature detectors." arXiv:1207.0580. https://arxiv.org/abs/1207.0580
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15(56), 1929-1958. https://www.jmlr.org/papers/v15/srivastava14a.html
Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 1050-1059. https://arxiv.org/abs/1506.02142
Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). "Regularization of Neural Networks using DropConnect." Proceedings of the 30th International Conference on Machine Learning (ICML 2013). https://proceedings.mlr.press/v28/wan13.html
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler, C. (2015). "Efficient Object Localization Using Convolutional Networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015). https://arxiv.org/abs/1411.4280
Ghiasi, G., Lin, T.-Y., and Le, Q. V. (2018). "DropBlock: A regularization method for convolutional networks." Proceedings of NeurIPS 2018. https://arxiv.org/abs/1810.12890
Molchanov, D., Ashukha, A., and Vetrov, D. (2017). "Variational Dropout Sparsifies Deep Neural Networks." Proceedings of ICML 2017. https://arxiv.org/abs/1701.05369
Li, X., Chen, S., Hu, X., and Yang, J. (2019). "Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019). https://arxiv.org/abs/1801.05134
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." Proceedings of NeurIPS 2017. https://arxiv.org/abs/1706.03762
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." Proceedings of NeurIPS 2020. https://arxiv.org/abs/2005.14165
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." Proceedings of NeurIPS 2012. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. https://arxiv.org/abs/2307.09288
Gal, Y., Hron, J., and Kendall, A. (2017). "Concrete Dropout." Proceedings of NeurIPS 2017. https://arxiv.org/abs/1705.07832
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Q. (2016). "Deep Networks with Stochastic Depth." Proceedings of ECCV 2016. https://arxiv.org/abs/1603.09382
Semeniuta, S., Severyn, A., and Barth, E. (2016). "Recurrent Dropout without Memory Loss." Proceedings of COLING 2016. https://arxiv.org/abs/1603.05118
Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). "Self-Normalizing Neural Networks." Proceedings of NeurIPS 2017. https://arxiv.org/abs/1706.02515
Gomez, A. N., Zhang, I., Kamalakara, S. R., Madaan, D., Swersky, K., Gal, Y., and Hinton, G. E. (2018). "Targeted Dropout." NeurIPS 2018 Workshop on Compact Deep Neural Networks. https://openreview.net/pdf?id=HkghWScuoQ
Brock, A., De, S., Smith, S. L., and Simonyan, K. (2021). "High-Performance Large-Scale Image Recognition Without Normalization." Proceedings of ICML 2021. https://arxiv.org/abs/2102.06171

History

How dropout works

Training phase

Expected output and inverted dropout

Inference phase

Standard vs inverted dropout

Why dropout works

Preventing co-adaptation

Implicit ensemble

Implicit regularization and noise injection

Sparser, more decorrelated representations

Bayesian interpretation

Dropout rate guidelines

Reference dropout rates from notable models

When not to use dropout

Dropout variants

DropConnect

Spatial dropout

DropBlock

Variational dropout for RNNs

Variational dropout for sparsification

Concrete dropout

Stochastic depth and DropPath

Recurrent dropout

Word and embedding dropout

AlphaDropout

Targeted dropout

Comparison of dropout variants

Dropout and other regularizers

Dropout and batch normalization

Dropout and weight decay

Dropout and data augmentation

Dropout and label smoothing

Dropout and ensembles

Dropout in modern architectures

Early deep learning era (2012 to 2016)

Convolutional network era (2015 to 2020)

Transformer era (2017 to present)

Vision transformers and modern image models

Monte Carlo dropout for uncertainty estimation

How MC dropout works

Applications of MC dropout

Practical considerations

Implementation in deep learning frameworks

PyTorch

TensorFlow and Keras

JAX, Flax, and Haiku

Practical pattern across frameworks

Dropout in practice: tips and patterns

Common patterns

Debugging dropout

Empirical findings

Historical impact and legacy

References

Improve this article

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Dropout Regularization

Early Stopping

L0 Regularization

History

How dropout works

Training phase

Expected output and inverted dropout

Inference phase

Standard vs inverted dropout

Why dropout works

Preventing co-adaptation

Implicit ensemble

Implicit regularization and noise injection

Sparser, more decorrelated representations

Bayesian interpretation

Dropout rate guidelines

Reference dropout rates from notable models

When not to use dropout

Dropout variants

DropConnect

Spatial dropout