# Candidate Sampling

> Source: https://aiwiki.ai/wiki/candidate_sampling
> Updated: 2026-04-27
> Categories: Machine Learning, Natural Language Processing, Neural Networks, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Candidate sampling** is a family of training-time optimization techniques used in [machine learning](/wiki/machine_learning) to reduce the computational cost of models that must choose among a very large number of output classes. Instead of evaluating every possible class for each training example, candidate sampling selects a small subset of "candidate" classes (typically the true target plus a handful of randomly drawn negatives) and computes the [loss function](/wiki/loss_function) over that subset only. This makes it practical to train [neural networks](/wiki/neural_network) with output vocabularies ranging from hundreds of thousands to millions of classes, a situation that arises routinely in [natural language processing](/wiki/natural_language_understanding), [recommendation systems](/wiki/recommender_system), and large-scale classification.

The idea was popularized by its use in [word embedding](/wiki/word_embedding) models such as [Word2Vec](/wiki/word_embedding) and has since become a standard component of retrieval systems, [language models](/wiki/language_model), and contrastive representation learning frameworks.

## ELI5 (Explain like I'm five)

Imagine you are a teacher grading a multiple-choice test, but each question has one million answer choices. Reading through all one million options for every single question would take forever. Candidate sampling is like a shortcut: you look at the correct answer and just a few wrong answers picked at random. If the student can reliably tell the correct answer apart from those few wrong ones, they probably know the material. By only checking a small handful of choices instead of all one million, you finish grading much faster, and the student still learns effectively. That is essentially what candidate sampling does for a neural network during training.

## Motivation and problem setting

Many practical machine learning tasks involve predicting one correct label from an enormous set of possibilities. In [language models](/wiki/language_model), the model predicts the next word from a vocabulary that can contain 50,000 to several million tokens. In [recommendation systems](/wiki/recommender_system), the model selects relevant items from a catalog of millions of products, videos, or songs.

The standard approach for multi-class prediction uses the [softmax](/wiki/softmax) function to convert raw model outputs (logits) into a probability distribution over all classes:

```
P(y = c | x) = exp(z_c) / sum_j exp(z_j)
```

where `z_c` is the logit for class `c` and the sum in the denominator runs over every class in the vocabulary. Computing this sum, along with the corresponding [gradient](/wiki/gradient) updates for all class parameters, requires O(|V|) work per training example, where |V| is the vocabulary or class size. When |V| is in the millions, the softmax computation dominates both the forward and backward passes, making training impractically slow.

Candidate sampling techniques address this bottleneck by replacing the full summation with a much smaller one that includes only a subset of classes. The true target class is always included, and a set of K negative (or "noise") classes is drawn from some proposal distribution Q. The loss is then computed over these K + 1 classes rather than all |V| classes, reducing per-example computation from O(|V|) to O(K), where K is typically between 5 and a few thousand.

## How candidate sampling works

All candidate sampling methods share the same high-level procedure:

1. **Identify the target class.** For a given training example, the correct class (or classes) is known from the label.
2. **Sample negative candidates.** Draw K classes from a noise distribution Q(c). Common choices for Q include the uniform distribution, the unigram (frequency-based) distribution, or the unigram distribution raised to an exponent (e.g., the 3/4 power, as used in Word2Vec).
3. **Compute the loss.** Evaluate the model's scores only for the target class and the K sampled negatives. The exact form of the loss varies across methods (see below).
4. **Update parameters.** Backpropagate gradients only through the target and sampled classes, leaving all other class parameters unchanged for that training step.

Because only K + 1 classes participate in each update, the vast majority of the output weight matrix is untouched on any given step. Over many iterations, all classes receive updates, but the per-step cost is dramatically reduced.

## Types of candidate sampling

Several distinct loss formulations fall under the candidate sampling umbrella. They differ in how they use the sampled negatives and in their theoretical guarantees.

### Noise contrastive estimation (NCE)

[Noise contrastive estimation](/wiki/noise_contrastive_estimation) was introduced by Michael Gutmann and Aapo Hyvarinen in 2010 as a general-purpose method for estimating the parameters of unnormalized statistical models. The key idea is to convert the problem of density estimation into a binary classification problem: given a sample, decide whether it came from the true data distribution or from a known noise distribution.

For each training example with true class `t`, NCE draws K noise samples from distribution Q and sets up a logistic regression problem. The model learns to assign high probability to the event "this class is the real target" and low probability to the event "this class is noise." The NCE objective for a single example is:

```
L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-s(c_k) + log(K * Q(c_k))))
```

where `s(c)` is the model's raw score (logit) for class `c`, `sigma` is the [sigmoid function](/wiki/sigmoid_function), and `Q(c)` is the probability of sampling class `c` as noise.

An important theoretical property of NCE is that as the number of noise samples K increases, the NCE gradient converges to the gradient of the full softmax [cross-entropy](/wiki/cross-entropy) loss. Gutmann and Hyvarinen showed that NCE produces a consistent estimator, meaning it recovers the true model parameters given enough data. Empirical work has shown that roughly 25 noise samples can match the performance of the full softmax in language modeling tasks.

NCE also has the unique ability to learn the normalization constant (partition function) as a model parameter, sidestepping the need to compute it explicitly. This is valuable for energy-based models and other settings where the normalizing constant is intractable.

### Negative sampling

Negative sampling was introduced by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean in their 2013 paper "Distributed Representations of Words and Phrases and their Compositionality." It is a simplified variant of NCE that was designed specifically for learning [word embeddings](/wiki/word_embedding).

The main simplification is that negative sampling fixes the normalization constant to 1, removing the need to estimate it. For each positive word-context pair (w, c), K negative context words are drawn from a noise distribution. Mikolov et al. found that raising the unigram frequency distribution to the 3/4 power worked best in practice:

```
Q(w) = freq(w)^(3/4) / sum_j freq(j)^(3/4)
```

This exponent gives rarer words a higher sampling probability than strict unigram sampling would, which helps the model learn better representations for less frequent words.

The negative sampling objective is:

```
L_NEG = -log(sigma(v_w . v_c)) - sum_{k=1}^{K} log(sigma(-v_w . v_{c_k}))
```

where `v_w` and `v_c` are the embedding vectors of the word and context.

While negative sampling lacks the theoretical consistency guarantees of NCE (it does not converge to maximum likelihood estimation as K grows), it proved extremely effective for learning word embeddings and became the default training method for Word2Vec's Skip-gram model. Mikolov et al. reported that negative sampling with K between 5 and 20 outperformed hierarchical softmax on word analogy tasks.

### Sampled softmax

Sampled softmax (also called "sampled softmax cross-entropy") retains the structure of the full softmax loss but restricts the normalization to a sampled subset of classes. Instead of summing over all |V| classes in the denominator, the sum runs over the target class plus K randomly drawn negatives.

The sampled softmax loss for a single example is:

```
L_SS = -z_t + log(exp(z_t) + sum_{k=1}^{K} exp(z_{c_k} - log(K * Q(c_k) / (1 - Q(t)))))
```

The correction term `log(K * Q(c_k) / (1 - Q(t)))` adjusts for the fact that classes are not sampled uniformly from the full distribution but according to Q. This bias correction is sometimes called the "logQ correction."

Sampled softmax differs from NCE in that it keeps the multi-class softmax structure rather than reducing the problem to binary classification. It always includes the target class exactly once, which reduces variance compared to purely sampled approaches. The method is implemented in TensorFlow as `tf.nn.sampled_softmax_loss` and is widely used for training language models and retrieval systems with large output spaces.

For a scenario with 100,000 classes and 100 negative samples, sampled softmax reduces computation from evaluating 100,000 scores per example to evaluating just 101, a roughly 1000x speedup.

### Importance sampling

Importance sampling uses a proposal distribution Q to approximate the expensive expectation over all vocabulary words in the softmax denominator. Rather than computing scores for every class, it draws K samples from Q and reweights each sample by the ratio of the true probability to the sampling probability:

```
sum_j exp(z_j) ≈ (1/K) * sum_{k=1}^{K} exp(z_{c_k}) / Q(c_k)
```

This provides an unbiased estimate of the partition function, but can suffer from high variance when Q is a poor match for the true distribution. Adaptive importance sampling methods address this by adjusting Q during training to better match the model's evolving output distribution, achieving speedups of roughly 100x over full softmax.

Importance sampling for neural language models was explored by Bengio and Senecal (2003, 2008) and later applied to neural machine translation by Jean et al. (2015), who used it to train models with vocabularies of 500,000 words.

### Hierarchical softmax

Although not a sampling method in the strict sense, [hierarchical softmax](/wiki/softmax) is often discussed alongside candidate sampling because it addresses the same computational bottleneck. Proposed by Morin and Bengio in 2005, hierarchical softmax replaces the flat softmax layer with a binary tree structure in which words sit at the leaves. Computing the probability of a word requires traversing a path from the root to the corresponding leaf, which takes O(log |V|) time instead of O(|V|).

Hierarchical softmax can yield speedups of 50x or more during training but requires careful construction of the tree. The original paper used WordNet to build the hierarchy. Later work by Mikolov et al. used Huffman coding, assigning shorter paths to more frequent words.

### Adaptive softmax

Adaptive softmax, proposed by Grave, Joulin, Cisse, Grangier, and Jegou in 2017, exploits the highly skewed (Zipfian) distribution of word frequencies. It groups classes into clusters of varying size and assigns different embedding dimensions to each cluster. Frequent classes get more parameters and are evaluated first; rare classes share a smaller, cheaper representation. This approach is especially well suited to GPU architectures because it minimizes memory transfers. On the One Billion Word benchmark, adaptive softmax achieved accuracy close to full softmax while running significantly faster.

## Comparison of methods

| Method | Type | Theoretical guarantee | Learns partition function | Typical K | Complexity per example | Primary use case |
|---|---|---|---|---|---|---|
| Full softmax | Exact | N/A (exact) | Computed exactly | N/A | O(\|V\|) | Small vocabularies |
| Noise contrastive estimation (NCE) | Sampling | Consistent estimator; converges to MLE as K grows | Yes (as a parameter) | 25-100 | O(K) | Language modeling, density estimation |
| Negative sampling | Sampling | Not consistent for MLE | No (fixed to 1) | 5-20 | O(K) | Word embeddings (Word2Vec) |
| Sampled softmax | Sampling | Biased; bias decreases with K | No | 100-10,000 | O(K) | Language models, retrieval, recommendations |
| Importance sampling | Sampling | Unbiased estimate of partition function | No | 100-1,000 | O(K) | Neural machine translation |
| Hierarchical softmax | Structural | Exact (given the tree) | Computed via tree | N/A | O(log \|V\|) | Language models with large vocabularies |
| Adaptive softmax | Structural | Exact (approximate architecture) | Computed per cluster | N/A | O(\|V\|) but much faster in practice | GPU-based language modeling |

## Sampling distributions

The choice of noise distribution Q has a significant impact on training quality and convergence speed. Several options are commonly used:

| Distribution | Formula | Properties | Used in |
|---|---|---|---|
| Uniform | Q(c) = 1/\|V\| | Simple; treats all classes equally; may waste samples on irrelevant classes | Baseline comparisons |
| Unigram | Q(c) = freq(c) / N | Matches empirical frequency; popular classes sampled more often | Language modeling, recommendations |
| Smoothed unigram | Q(c) = freq(c)^alpha / Z, alpha < 1 | Gives rare classes more representation; alpha = 3/4 common | Word2Vec negative sampling |
| Log-uniform | Q(c) proportional to 1/(rank(c) + 1) | Approximates Zipfian distribution; used in TensorFlow's `log_uniform_candidate_sampler` | TensorFlow language models |
| Adaptive / learned | Q updated during training | Reduces variance; better matches true distribution | Adaptive importance sampling |
| In-batch negatives | Other examples in the same mini-batch serve as negatives | No extra sampling needed; efficient on GPUs; introduces popularity bias | Contrastive learning, two-tower retrieval |

When using in-batch negatives (as in many contrastive learning and retrieval systems), popular items appear as negatives more frequently than rare items, creating a sampling bias. The logQ correction addresses this by subtracting log(Q(c)) from the logit of each negative candidate, where Q(c) is estimated from the item's frequency in the training corpus.

## Bias correction

Candidate sampling introduces bias because the loss is computed over a non-representative subset of classes. Several correction techniques mitigate this:

**LogQ correction.** For sampled softmax, the logit for each sampled negative is adjusted by subtracting log(K * Q(c)), where Q(c) is the probability of sampling class c. This correction accounts for the fact that some classes are sampled more frequently than others. If Q matched the true softmax distribution exactly, the correction would produce unbiased gradients, but since this is generally not the case, a residual bias remains.

**Rejection sampling.** When a sampled negative accidentally matches the target class, it should be discarded (rejected) and replaced with a new sample. Without this step, the target class would be simultaneously pushed up (as the positive) and pushed down (as a negative), introducing noise.

**Importance weighting.** Each sampled class's contribution to the loss is multiplied by the ratio of its true probability to its sampling probability. This produces an unbiased estimate in expectation but can have high variance if the sampling distribution is far from the target distribution.

## Applications

### Natural language processing

Candidate sampling first gained widespread adoption in NLP through [Word2Vec](/wiki/word_embedding). The Skip-gram model with negative sampling (SGNS) learns word embeddings by training a binary classifier to distinguish genuine word-context pairs from randomly corrupted ones. This approach, introduced in 2013, enabled training on billion-word corpora in hours rather than days.

NCE-based training was subsequently adopted for neural [language models](/wiki/language_model). Mnih and Teh (2012) showed that NCE could train neural language models with accuracy comparable to full softmax, and Zoph et al. (2016) applied it to large-scale recurrent language models. Importance sampling was used by Jean et al. (2015) to train neural machine translation models over vocabularies of 500,000 words.

Modern subword tokenization methods such as [byte pair encoding](/wiki/byte_pair_encoding) (BPE) and SentencePiece have reduced vocabulary sizes to 30,000-100,000 tokens, which lessens the need for candidate sampling in standard language models. However, candidate sampling remains relevant for retrieval-augmented generation, entity linking, and other tasks where the output space is large.

### Recommendation systems

In large-scale recommendation, the model must score millions of candidate items for each user. Two-tower architectures (also called dual-encoder models) are a common approach: one tower encodes the user and the other encodes the item, and relevance is measured by the dot product of their embeddings. During training, the full softmax over all items is replaced with sampled softmax or in-batch negative sampling.

Google's YouTube recommendation system uses a two-tower neural deep retrieval model trained with candidate sampling to retrieve personalized video suggestions from a corpus of tens of millions of videos. The system uses mixed negative sampling, combining in-batch negatives with uniformly sampled negatives to counteract the selection bias of implicit user feedback.

A common challenge in recommendation is the popularity bias introduced by in-batch negatives. The logQ correction, proposed by Yi et al. (2019) in the paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations," adjusts logits during training to prevent the model from unfairly penalizing popular items that appear as negatives more frequently.

### Contrastive representation learning

Candidate sampling ideas underpin modern [self-supervised learning](/wiki/self-supervised_learning) methods. The InfoNCE loss, introduced by van den Oord et al. (2018) in Contrastive Predictive Coding (CPC), extends NCE to learn representations by maximizing mutual information between different views of the same data point. The loss treats one positive pair and K negative pairs as inputs to a (K+1)-way classification problem:

```
L_InfoNCE = -log(exp(sim(q, k+)) / (exp(sim(q, k+)) + sum_{k=1}^{K} exp(sim(q, k_k-))))
```

where `sim` denotes a similarity function (typically cosine similarity or dot product).

InfoNCE forms the foundation of influential frameworks including:

- **SimCLR** (Chen et al., 2020): Uses augmented views of images as positive pairs and other images in the batch as negatives. Requires large batch sizes (4,096 or more) to provide enough negatives.
- **MoCo** (He et al., 2020): Maintains a momentum-updated queue of negatives, decoupling the number of negatives from batch size.
- **CLIP** (Radford et al., 2021): Applies contrastive learning across image-text pairs, using in-batch negatives from both modalities.

The quality of negative samples matters significantly in contrastive learning. "Hard negatives" (negatives that are similar to the positive) provide stronger learning signals but can also destabilize training if not handled carefully.

### Computer vision

Beyond contrastive learning, candidate sampling is used in [object detection](/wiki/object_detection) and [image recognition](/wiki/image_recognition) tasks with very large label sets. Face recognition models, for example, must distinguish among millions of identities, and sampled softmax or [ArcFace-style losses](/wiki/image_recognition) with candidate sampling make training feasible.

## Practical considerations

### Choosing the number of negative samples

The number of negative samples K involves a tradeoff between computational cost and approximation quality. General guidelines include:

- For word embeddings (negative sampling): K = 5 to 20. Mikolov et al. found K = 5 to 20 worked well for large datasets and K = 2 to 5 for small ones.
- For NCE in language modeling: K = 25 to 100. Mnih and Teh (2012) found that K = 25 was sufficient to match full softmax.
- For sampled softmax in retrieval: K = 100 to 10,000, depending on the catalog size and available compute.
- For contrastive learning (InfoNCE): Larger K generally improves performance, with SimCLR using effective values in the thousands via large batch sizes.

### When to use candidate sampling

Candidate sampling provides the greatest benefit when:

- The number of output classes exceeds roughly 100,000.
- The output projection layer (embedding matrix for all classes) dominates computation or memory usage.
- Training time is the primary bottleneck (candidate sampling typically helps only during training, not inference).

Subword tokenization has reduced vocabulary sizes in many NLP applications, diminishing the need for candidate sampling in standard language modeling. However, it remains essential for retrieval, recommendation, entity linking, extreme multi-label classification, and any task where the output space is inherently large.

### Sharing negatives across a batch

A practical optimization is to draw one set of K negatives and share them across all examples in a mini-batch. This reduces the overhead of sampling and can be implemented efficiently as a single matrix multiplication. TensorFlow's `sampled_softmax_loss` and `nce_loss` functions use this strategy by default.

### Candidate sampling at inference time

Candidate sampling is a training-time technique. At inference time, the model still needs to produce scores for all classes (or use an approximate nearest neighbor index for retrieval). Methods like hierarchical softmax, adaptive softmax, and differentiated softmax can provide speedups at both training and inference time, unlike sampling-based approaches that only accelerate training.

## Mathematical relationship between methods

The different candidate sampling losses can be understood as points on a spectrum. Let `s(c)` denote the model's score for class `c` and `Q(c)` the noise distribution.

**Full softmax cross-entropy:**
```
L_softmax = -s(t) + log(sum_{c=1}^{|V|} exp(s(c)))
```

**Sampled softmax:** replaces the sum over |V| with a sum over the target and K sampled negatives, with bias correction:
```
L_SS ≈ -s(t) + log(exp(s(t)) + sum_{k=1}^{K} exp(s(c_k) - log(K * Q(c_k))))
```

**NCE:** reformulates as K + 1 binary classification problems:
```
L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-(s(c_k) - log(K * Q(c_k)))))
```

**Negative sampling:** simplifies NCE by setting the partition function to 1 (equivalent to dropping the log(K * Q) correction under certain assumptions):
```
L_NEG = -log(sigma(s(t))) - sum_{k=1}^{K} log(sigma(-s(c_k)))
```

As K approaches |V|, both sampled softmax and NCE converge to the full softmax loss. Negative sampling does not share this convergence property, which is why it is effective for embedding learning but not suitable for tasks requiring calibrated probability estimates.

## Framework support

Major deep learning frameworks provide built-in support for candidate sampling:

| Framework | Function / Module | Method | Notes |
|---|---|---|---|
| [TensorFlow](/wiki/tensorflow) | `tf.nn.sampled_softmax_loss` | Sampled softmax | Supports logQ correction; shared negatives across batch |
| TensorFlow | `tf.nn.nce_loss` | NCE | Binary classification formulation |
| TensorFlow | `tf.random.log_uniform_candidate_sampler` | Sampling utility | Log-uniform distribution for Zipfian vocabularies |
| TensorFlow | `tf.random.fixed_unigram_candidate_sampler` | Sampling utility | Custom unigram distribution |
| [PyTorch](/wiki/pytorch) | Custom implementation required | Various | No built-in sampled softmax; community libraries available |
| PyTorch | `torch.nn.functional.cross_entropy` with subset | Sampled softmax | Can be implemented manually by indexing into the weight matrix |
| JAX / Flax | Custom implementation | Various | Efficient with `jax.numpy` indexing |

## Historical timeline

| Year | Development | Authors |
|---|---|---|
| 2003 | Importance sampling for neural language models | Bengio and Senecal |
| 2005 | Hierarchical softmax | Morin and Bengio |
| 2010 | Noise contrastive estimation (NCE) | Gutmann and Hyvarinen |
| 2012 | NCE applied to neural language models | Mnih and Teh |
| 2013 | Negative sampling in Word2Vec | Mikolov, Sutskever, Chen, Corrado, Dean |
| 2013 | [GloVe](/wiki/word_embedding) word embeddings | Pennington, Socher, Manning |
| 2015 | Importance sampling for neural machine translation | Jean, Cho, Memisevic, Bengio |
| 2017 | Adaptive softmax for GPUs | Grave, Joulin, Cisse, Grangier, Jegou |
| 2018 | InfoNCE and Contrastive Predictive Coding | van den Oord, Li, Vinyals |
| 2019 | Sampling-bias-corrected retrieval (logQ correction) | Yi, Yang, et al. (Google) |
| 2020 | SimCLR contrastive learning framework | Chen, Kornblith, Norouzi, Hinton |
| 2020 | MoCo momentum contrastive learning | He, Fan, Wu, Xie, Girshick |
| 2021 | CLIP cross-modal contrastive learning | Radford et al. (OpenAI) |

## See also

- [Softmax](/wiki/softmax)
- [Cross-entropy](/wiki/cross-entropy)
- [Word embedding](/wiki/word_embedding)
- [Loss function](/wiki/loss_function)
- [Self-supervised learning](/wiki/self-supervised_learning)
- [Recommendation system](/wiki/recommender_system)
- [Language model](/wiki/language_model)

## References

1. Gutmann, M. and Hyvarinen, A. (2010). "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297-304.
2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." Advances in Neural Information Processing Systems 26 (NeurIPS 2013).
3. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv preprint arXiv:1301.3781.
4. Mnih, A. and Teh, Y.W. (2012). "A fast and simple algorithm for training neural probabilistic language models." Proceedings of the 29th International Conference on Machine Learning (ICML 2012).
5. Morin, F. and Bengio, Y. (2005). "Hierarchical Probabilistic Neural Network Language Model." Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005).
6. Bengio, Y. and Senecal, J.-S. (2003). "Quick Training of Probabilistic Neural Nets by Importance Sampling." Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (AISTATS 2003).
7. Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015). "On Using Very Large Target Vocabulary for Neural Machine Translation." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015).
8. Grave, E., Joulin, A., Cisse, M., Grangier, D., and Jegou, H. (2017). "Efficient softmax approximation for GPUs." Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
9. van den Oord, A., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv preprint arXiv:1807.03748.
10. Yi, X., Yang, J., Hong, L., Cheng, D.Z., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., and Chi, E. (2019). "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations." Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019).
11. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." Proceedings of the 37th International Conference on Machine Learning (ICML 2020).
12. Goldberg, Y. and Levy, O. (2014). "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722.
13. Ma, Z. and Collins, M. (2018). "Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).
14. Ruder, S. (2016). "Approximating the Softmax for Learning Word Embeddings." Blog post. https://www.ruder.io/word-embeddings-softmax/
15. TensorFlow Authors. "Candidate Sampling Algorithms Reference." https://www.tensorflow.org/extras/candidate_sampling.pdf
