Candidate sampling is a family of training-time optimization techniques used in machine learning to reduce the computational cost of models that must choose among a very large number of output classes. Instead of evaluating every possible class for each training example, candidate sampling selects a small subset of "candidate" classes (typically the true target plus a handful of randomly drawn negatives) and computes the loss function over that subset only. This makes it practical to train neural networks with output vocabularies ranging from hundreds of thousands to millions of classes, a situation that arises routinely in natural language processing, recommendation systems, and large-scale classification.
The idea was popularized by its use in word embedding models such as Word2Vec and has since become a standard component of retrieval systems, language models, and contrastive representation learning frameworks.
Imagine you are a teacher grading a multiple-choice test, but each question has one million answer choices. Reading through all one million options for every single question would take forever. Candidate sampling is like a shortcut: you look at the correct answer and just a few wrong answers picked at random. If the student can reliably tell the correct answer apart from those few wrong ones, they probably know the material. By only checking a small handful of choices instead of all one million, you finish grading much faster, and the student still learns effectively. That is essentially what candidate sampling does for a neural network during training.
Many practical machine learning tasks involve predicting one correct label from an enormous set of possibilities. In language models, the model predicts the next word from a vocabulary that can contain 50,000 to several million tokens. In recommendation systems, the model selects relevant items from a catalog of millions of products, videos, or songs.
The standard approach for multi-class prediction uses the softmax function to convert raw model outputs (logits) into a probability distribution over all classes:
P(y = c | x) = exp(z_c) / sum_j exp(z_j)
where z_c is the logit for class c and the sum in the denominator runs over every class in the vocabulary. Computing this sum, along with the corresponding gradient updates for all class parameters, requires O(|V|) work per training example, where |V| is the vocabulary or class size. When |V| is in the millions, the softmax computation dominates both the forward and backward passes, making training impractically slow.
Candidate sampling techniques address this bottleneck by replacing the full summation with a much smaller one that includes only a subset of classes. The true target class is always included, and a set of K negative (or "noise") classes is drawn from some proposal distribution Q. The loss is then computed over these K + 1 classes rather than all |V| classes, reducing per-example computation from O(|V|) to O(K), where K is typically between 5 and a few thousand.
All candidate sampling methods share the same high-level procedure:
Because only K + 1 classes participate in each update, the vast majority of the output weight matrix is untouched on any given step. Over many iterations, all classes receive updates, but the per-step cost is dramatically reduced.
Several distinct loss formulations fall under the candidate sampling umbrella. They differ in how they use the sampled negatives and in their theoretical guarantees.
Noise contrastive estimation was introduced by Michael Gutmann and Aapo Hyvarinen in 2010 as a general-purpose method for estimating the parameters of unnormalized statistical models. The key idea is to convert the problem of density estimation into a binary classification problem: given a sample, decide whether it came from the true data distribution or from a known noise distribution.
For each training example with true class t, NCE draws K noise samples from distribution Q and sets up a logistic regression problem. The model learns to assign high probability to the event "this class is the real target" and low probability to the event "this class is noise." The NCE objective for a single example is:
L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-s(c_k) + log(K * Q(c_k))))
where s(c) is the model's raw score (logit) for class c, sigma is the sigmoid function, and Q(c) is the probability of sampling class c as noise.
An important theoretical property of NCE is that as the number of noise samples K increases, the NCE gradient converges to the gradient of the full softmax cross-entropy loss. Gutmann and Hyvarinen showed that NCE produces a consistent estimator, meaning it recovers the true model parameters given enough data. Empirical work has shown that roughly 25 noise samples can match the performance of the full softmax in language modeling tasks.
NCE also has the unique ability to learn the normalization constant (partition function) as a model parameter, sidestepping the need to compute it explicitly. This is valuable for energy-based models and other settings where the normalizing constant is intractable.
Negative sampling was introduced by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean in their 2013 paper "Distributed Representations of Words and Phrases and their Compositionality." It is a simplified variant of NCE that was designed specifically for learning word embeddings.
The main simplification is that negative sampling fixes the normalization constant to 1, removing the need to estimate it. For each positive word-context pair (w, c), K negative context words are drawn from a noise distribution. Mikolov et al. found that raising the unigram frequency distribution to the 3/4 power worked best in practice:
Q(w) = freq(w)^(3/4) / sum_j freq(j)^(3/4)
This exponent gives rarer words a higher sampling probability than strict unigram sampling would, which helps the model learn better representations for less frequent words.
The negative sampling objective is:
L_NEG = -log(sigma(v_w . v_c)) - sum_{k=1}^{K} log(sigma(-v_w . v_{c_k}))
where v_w and v_c are the embedding vectors of the word and context.
While negative sampling lacks the theoretical consistency guarantees of NCE (it does not converge to maximum likelihood estimation as K grows), it proved extremely effective for learning word embeddings and became the default training method for Word2Vec's Skip-gram model. Mikolov et al. reported that negative sampling with K between 5 and 20 outperformed hierarchical softmax on word analogy tasks.
Sampled softmax (also called "sampled softmax cross-entropy") retains the structure of the full softmax loss but restricts the normalization to a sampled subset of classes. Instead of summing over all |V| classes in the denominator, the sum runs over the target class plus K randomly drawn negatives.
The sampled softmax loss for a single example is:
L_SS = -z_t + log(exp(z_t) + sum_{k=1}^{K} exp(z_{c_k} - log(K * Q(c_k) / (1 - Q(t)))))
The correction term log(K * Q(c_k) / (1 - Q(t))) adjusts for the fact that classes are not sampled uniformly from the full distribution but according to Q. This bias correction is sometimes called the "logQ correction."
Sampled softmax differs from NCE in that it keeps the multi-class softmax structure rather than reducing the problem to binary classification. It always includes the target class exactly once, which reduces variance compared to purely sampled approaches. The method is implemented in TensorFlow as tf.nn.sampled_softmax_loss and is widely used for training language models and retrieval systems with large output spaces.
For a scenario with 100,000 classes and 100 negative samples, sampled softmax reduces computation from evaluating 100,000 scores per example to evaluating just 101, a roughly 1000x speedup.
Importance sampling uses a proposal distribution Q to approximate the expensive expectation over all vocabulary words in the softmax denominator. Rather than computing scores for every class, it draws K samples from Q and reweights each sample by the ratio of the true probability to the sampling probability:
sum_j exp(z_j) ≈ (1/K) * sum_{k=1}^{K} exp(z_{c_k}) / Q(c_k)
This provides an unbiased estimate of the partition function, but can suffer from high variance when Q is a poor match for the true distribution. Adaptive importance sampling methods address this by adjusting Q during training to better match the model's evolving output distribution, achieving speedups of roughly 100x over full softmax.
Importance sampling for neural language models was explored by Bengio and Senecal (2003, 2008) and later applied to neural machine translation by Jean et al. (2015), who used it to train models with vocabularies of 500,000 words.
Although not a sampling method in the strict sense, hierarchical softmax is often discussed alongside candidate sampling because it addresses the same computational bottleneck. Proposed by Morin and Bengio in 2005, hierarchical softmax replaces the flat softmax layer with a binary tree structure in which words sit at the leaves. Computing the probability of a word requires traversing a path from the root to the corresponding leaf, which takes O(log |V|) time instead of O(|V|).
Hierarchical softmax can yield speedups of 50x or more during training but requires careful construction of the tree. The original paper used WordNet to build the hierarchy. Later work by Mikolov et al. used Huffman coding, assigning shorter paths to more frequent words.
Adaptive softmax, proposed by Grave, Joulin, Cisse, Grangier, and Jegou in 2017, exploits the highly skewed (Zipfian) distribution of word frequencies. It groups classes into clusters of varying size and assigns different embedding dimensions to each cluster. Frequent classes get more parameters and are evaluated first; rare classes share a smaller, cheaper representation. This approach is especially well suited to GPU architectures because it minimizes memory transfers. On the One Billion Word benchmark, adaptive softmax achieved accuracy close to full softmax while running significantly faster.
| Method | Type | Theoretical guarantee | Learns partition function | Typical K | Complexity per example | Primary use case |
|---|---|---|---|---|---|---|
| Full softmax | Exact | N/A (exact) | Computed exactly | N/A | O(|V|) | Small vocabularies |
| Noise contrastive estimation (NCE) | Sampling | Consistent estimator; converges to MLE as K grows | Yes (as a parameter) | 25-100 | O(K) | Language modeling, density estimation |
| Negative sampling | Sampling | Not consistent for MLE | No (fixed to 1) | 5-20 | O(K) | Word embeddings (Word2Vec) |
| Sampled softmax | Sampling | Biased; bias decreases with K | No | 100-10,000 | O(K) | Language models, retrieval, recommendations |
| Importance sampling | Sampling | Unbiased estimate of partition function | No | 100-1,000 | O(K) | Neural machine translation |
| Hierarchical softmax | Structural | Exact (given the tree) | Computed via tree | N/A | O(log |V|) | Language models with large vocabularies |
| Adaptive softmax | Structural | Exact (approximate architecture) | Computed per cluster | N/A | O(|V|) but much faster in practice | GPU-based language modeling |
The choice of noise distribution Q has a significant impact on training quality and convergence speed. Several options are commonly used:
| Distribution | Formula | Properties | Used in |
|---|---|---|---|
| Uniform | Q(c) = 1/|V| | Simple; treats all classes equally; may waste samples on irrelevant classes | Baseline comparisons |
| Unigram | Q(c) = freq(c) / N | Matches empirical frequency; popular classes sampled more often | Language modeling, recommendations |
| Smoothed unigram | Q(c) = freq(c)^alpha / Z, alpha < 1 | Gives rare classes more representation; alpha = 3/4 common | Word2Vec negative sampling |
| Log-uniform | Q(c) proportional to 1/(rank(c) + 1) | Approximates Zipfian distribution; used in TensorFlow's log_uniform_candidate_sampler | TensorFlow language models |
| Adaptive / learned | Q updated during training | Reduces variance; better matches true distribution | Adaptive importance sampling |
| In-batch negatives | Other examples in the same mini-batch serve as negatives | No extra sampling needed; efficient on GPUs; introduces popularity bias | Contrastive learning, two-tower retrieval |
When using in-batch negatives (as in many contrastive learning and retrieval systems), popular items appear as negatives more frequently than rare items, creating a sampling bias. The logQ correction addresses this by subtracting log(Q(c)) from the logit of each negative candidate, where Q(c) is estimated from the item's frequency in the training corpus.
Candidate sampling introduces bias because the loss is computed over a non-representative subset of classes. Several correction techniques mitigate this:
LogQ correction. For sampled softmax, the logit for each sampled negative is adjusted by subtracting log(K * Q(c)), where Q(c) is the probability of sampling class c. This correction accounts for the fact that some classes are sampled more frequently than others. If Q matched the true softmax distribution exactly, the correction would produce unbiased gradients, but since this is generally not the case, a residual bias remains.
Rejection sampling. When a sampled negative accidentally matches the target class, it should be discarded (rejected) and replaced with a new sample. Without this step, the target class would be simultaneously pushed up (as the positive) and pushed down (as a negative), introducing noise.
Importance weighting. Each sampled class's contribution to the loss is multiplied by the ratio of its true probability to its sampling probability. This produces an unbiased estimate in expectation but can have high variance if the sampling distribution is far from the target distribution.
Candidate sampling first gained widespread adoption in NLP through Word2Vec. The Skip-gram model with negative sampling (SGNS) learns word embeddings by training a binary classifier to distinguish genuine word-context pairs from randomly corrupted ones. This approach, introduced in 2013, enabled training on billion-word corpora in hours rather than days.
NCE-based training was subsequently adopted for neural language models. Mnih and Teh (2012) showed that NCE could train neural language models with accuracy comparable to full softmax, and Zoph et al. (2016) applied it to large-scale recurrent language models. Importance sampling was used by Jean et al. (2015) to train neural machine translation models over vocabularies of 500,000 words.
Modern subword tokenization methods such as byte pair encoding (BPE) and SentencePiece have reduced vocabulary sizes to 30,000-100,000 tokens, which lessens the need for candidate sampling in standard language models. However, candidate sampling remains relevant for retrieval-augmented generation, entity linking, and other tasks where the output space is large.
In large-scale recommendation, the model must score millions of candidate items for each user. Two-tower architectures (also called dual-encoder models) are a common approach: one tower encodes the user and the other encodes the item, and relevance is measured by the dot product of their embeddings. During training, the full softmax over all items is replaced with sampled softmax or in-batch negative sampling.
Google's YouTube recommendation system uses a two-tower neural deep retrieval model trained with candidate sampling to retrieve personalized video suggestions from a corpus of tens of millions of videos. The system uses mixed negative sampling, combining in-batch negatives with uniformly sampled negatives to counteract the selection bias of implicit user feedback.
A common challenge in recommendation is the popularity bias introduced by in-batch negatives. The logQ correction, proposed by Yi et al. (2019) in the paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations," adjusts logits during training to prevent the model from unfairly penalizing popular items that appear as negatives more frequently.
Candidate sampling ideas underpin modern self-supervised learning methods. The InfoNCE loss, introduced by van den Oord et al. (2018) in Contrastive Predictive Coding (CPC), extends NCE to learn representations by maximizing mutual information between different views of the same data point. The loss treats one positive pair and K negative pairs as inputs to a (K+1)-way classification problem:
L_InfoNCE = -log(exp(sim(q, k+)) / (exp(sim(q, k+)) + sum_{k=1}^{K} exp(sim(q, k_k-))))
where sim denotes a similarity function (typically cosine similarity or dot product).
InfoNCE forms the foundation of influential frameworks including:
The quality of negative samples matters significantly in contrastive learning. "Hard negatives" (negatives that are similar to the positive) provide stronger learning signals but can also destabilize training if not handled carefully.
Beyond contrastive learning, candidate sampling is used in object detection and image recognition tasks with very large label sets. Face recognition models, for example, must distinguish among millions of identities, and sampled softmax or ArcFace-style losses with candidate sampling make training feasible.
The number of negative samples K involves a tradeoff between computational cost and approximation quality. General guidelines include:
Candidate sampling provides the greatest benefit when:
Subword tokenization has reduced vocabulary sizes in many NLP applications, diminishing the need for candidate sampling in standard language modeling. However, it remains essential for retrieval, recommendation, entity linking, extreme multi-label classification, and any task where the output space is inherently large.
A practical optimization is to draw one set of K negatives and share them across all examples in a mini-batch. This reduces the overhead of sampling and can be implemented efficiently as a single matrix multiplication. TensorFlow's sampled_softmax_loss and nce_loss functions use this strategy by default.
Candidate sampling is a training-time technique. At inference time, the model still needs to produce scores for all classes (or use an approximate nearest neighbor index for retrieval). Methods like hierarchical softmax, adaptive softmax, and differentiated softmax can provide speedups at both training and inference time, unlike sampling-based approaches that only accelerate training.
The different candidate sampling losses can be understood as points on a spectrum. Let s(c) denote the model's score for class c and Q(c) the noise distribution.
Full softmax cross-entropy:
L_softmax = -s(t) + log(sum_{c=1}^{|V|} exp(s(c)))
Sampled softmax: replaces the sum over |V| with a sum over the target and K sampled negatives, with bias correction:
L_SS ≈ -s(t) + log(exp(s(t)) + sum_{k=1}^{K} exp(s(c_k) - log(K * Q(c_k))))
NCE: reformulates as K + 1 binary classification problems:
L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-(s(c_k) - log(K * Q(c_k)))))
Negative sampling: simplifies NCE by setting the partition function to 1 (equivalent to dropping the log(K * Q) correction under certain assumptions):
L_NEG = -log(sigma(s(t))) - sum_{k=1}^{K} log(sigma(-s(c_k)))
As K approaches |V|, both sampled softmax and NCE converge to the full softmax loss. Negative sampling does not share this convergence property, which is why it is effective for embedding learning but not suitable for tasks requiring calibrated probability estimates.
Major deep learning frameworks provide built-in support for candidate sampling:
| Framework | Function / Module | Method | Notes |
|---|---|---|---|
| TensorFlow | tf.nn.sampled_softmax_loss | Sampled softmax | Supports logQ correction; shared negatives across batch |
| TensorFlow | tf.nn.nce_loss | NCE | Binary classification formulation |
| TensorFlow | tf.random.log_uniform_candidate_sampler | Sampling utility | Log-uniform distribution for Zipfian vocabularies |
| TensorFlow | tf.random.fixed_unigram_candidate_sampler | Sampling utility | Custom unigram distribution |
| PyTorch | Custom implementation required | Various | No built-in sampled softmax; community libraries available |
| PyTorch | torch.nn.functional.cross_entropy with subset | Sampled softmax | Can be implemented manually by indexing into the weight matrix |
| JAX / Flax | Custom implementation | Various | Efficient with jax.numpy indexing |
| Year | Development | Authors |
|---|---|---|
| 2003 | Importance sampling for neural language models | Bengio and Senecal |
| 2005 | Hierarchical softmax | Morin and Bengio |
| 2010 | Noise contrastive estimation (NCE) | Gutmann and Hyvarinen |
| 2012 | NCE applied to neural language models | Mnih and Teh |
| 2013 | Negative sampling in Word2Vec | Mikolov, Sutskever, Chen, Corrado, Dean |
| 2013 | GloVe word embeddings | Pennington, Socher, Manning |
| 2015 | Importance sampling for neural machine translation | Jean, Cho, Memisevic, Bengio |
| 2017 | Adaptive softmax for GPUs | Grave, Joulin, Cisse, Grangier, Jegou |
| 2018 | InfoNCE and Contrastive Predictive Coding | van den Oord, Li, Vinyals |
| 2019 | Sampling-bias-corrected retrieval (logQ correction) | Yi, Yang, et al. (Google) |
| 2020 | SimCLR contrastive learning framework | Chen, Kornblith, Norouzi, Hinton |
| 2020 | MoCo momentum contrastive learning | He, Fan, Wu, Xie, Girshick |
| 2021 | CLIP cross-modal contrastive learning | Radford et al. (OpenAI) |