Candidate Sampling

Candidate sampling is a family of training-time optimization techniques used in machine learning to reduce the computational cost of models that must choose among a very large number of output classes. Instead of evaluating every possible class for each training example, candidate sampling selects a small subset of "candidate" classes (typically the true target plus a handful of randomly drawn negatives) and computes the loss function over that subset only. This makes it practical to train neural networks with output vocabularies ranging from hundreds of thousands to millions of classes, a situation that arises routinely in natural language processing, recommendation systems, and large-scale classification.

The idea was popularized by its use in word embedding models such as Word2Vec and has since become a standard component of retrieval systems, language models, and contrastive representation learning frameworks.

ELI5 (Explain like I'm five)

Imagine you are a teacher grading a multiple-choice test, but each question has one million answer choices. Reading through all one million options for every single question would take forever. Candidate sampling is like a shortcut: you look at the correct answer and just a few wrong answers picked at random. If the student can reliably tell the correct answer apart from those few wrong ones, they probably know the material. By only checking a small handful of choices instead of all one million, you finish grading much faster, and the student still learns effectively. That is essentially what candidate sampling does for a neural network during training.

Motivation and problem setting

Many practical machine learning tasks involve predicting one correct label from an enormous set of possibilities. In language models, the model predicts the next word from a vocabulary that can contain 50,000 to several million tokens. In recommendation systems, the model selects relevant items from a catalog of millions of products, videos, or songs.

The standard approach for multi-class prediction uses the softmax function to convert raw model outputs (logits) into a probability distribution over all classes:

P(y = c | x) = exp(z_c) / sum_j exp(z_j)

where z_c is the logit for class c and the sum in the denominator runs over every class in the vocabulary. Computing this sum, along with the corresponding gradient updates for all class parameters, requires O(|V|) work per training example, where |V| is the vocabulary or class size. When |V| is in the millions, the softmax computation dominates both the forward and backward passes, making training impractically slow.

Candidate sampling techniques address this bottleneck by replacing the full summation with a much smaller one that includes only a subset of classes. The true target class is always included, and a set of K negative (or "noise") classes is drawn from some proposal distribution Q. The loss is then computed over these K + 1 classes rather than all |V| classes, reducing per-example computation from O(|V|) to O(K), where K is typically between 5 and a few thousand.

How candidate sampling works

All candidate sampling methods share the same high-level procedure:

Identify the target class. For a given training example, the correct class (or classes) is known from the label.
Sample negative candidates. Draw K classes from a noise distribution Q(c). Common choices for Q include the uniform distribution, the unigram (frequency-based) distribution, or the unigram distribution raised to an exponent (e.g., the 3/4 power, as used in Word2Vec).
Compute the loss. Evaluate the model's scores only for the target class and the K sampled negatives. The exact form of the loss varies across methods (see below).
Update parameters. Backpropagate gradients only through the target and sampled classes, leaving all other class parameters unchanged for that training step.

Because only K + 1 classes participate in each update, the vast majority of the output weight matrix is untouched on any given step. Over many iterations, all classes receive updates, but the per-step cost is dramatically reduced.

Types of candidate sampling

Several distinct loss formulations fall under the candidate sampling umbrella. They differ in how they use the sampled negatives and in their theoretical guarantees.

Noise contrastive estimation (NCE)

Noise contrastive estimation was introduced by Michael Gutmann and Aapo Hyvarinen in 2010 as a general-purpose method for estimating the parameters of unnormalized statistical models. The key idea is to convert the problem of density estimation into a binary classification problem: given a sample, decide whether it came from the true data distribution or from a known noise distribution.

For each training example with true class t, NCE draws K noise samples from distribution Q and sets up a logistic regression problem. The model learns to assign high probability to the event "this class is the real target" and low probability to the event "this class is noise." The NCE objective for a single example is:

L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-s(c_k) + log(K * Q(c_k))))

where s(c) is the model's raw score (logit) for class c, sigma is the sigmoid function, and Q(c) is the probability of sampling class c as noise.

An important theoretical property of NCE is that as the number of noise samples K increases, the NCE gradient converges to the gradient of the full softmax cross-entropy loss. Gutmann and Hyvarinen showed that NCE produces a consistent estimator, meaning it recovers the true model parameters given enough data. Empirical work has shown that roughly 25 noise samples can match the performance of the full softmax in language modeling tasks.

NCE also has the unique ability to learn the normalization constant (partition function) as a model parameter, sidestepping the need to compute it explicitly. This is valuable for energy-based models and other settings where the normalizing constant is intractable.

Negative sampling

Negative sampling was introduced by Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean in their 2013 paper "Distributed Representations of Words and Phrases and their Compositionality." It is a simplified variant of NCE that was designed specifically for learning word embeddings.

The main simplification is that negative sampling fixes the normalization constant to 1, removing the need to estimate it. For each positive word-context pair (w, c), K negative context words are drawn from a noise distribution. Mikolov et al. found that raising the unigram frequency distribution to the 3/4 power worked best in practice:

Q(w) = freq(w)^(3/4) / sum_j freq(j)^(3/4)

This exponent gives rarer words a higher sampling probability than strict unigram sampling would, which helps the model learn better representations for less frequent words.

The negative sampling objective is:

L_NEG = -log(sigma(v_w . v_c)) - sum_{k=1}^{K} log(sigma(-v_w . v_{c_k}))

where v_w and v_c are the embedding vectors of the word and context.

While negative sampling lacks the theoretical consistency guarantees of NCE (it does not converge to maximum likelihood estimation as K grows), it proved extremely effective for learning word embeddings and became the default training method for Word2Vec's Skip-gram model. Mikolov et al. reported that negative sampling with K between 5 and 20 outperformed hierarchical softmax on word analogy tasks.

Sampled softmax

Sampled softmax (also called "sampled softmax cross-entropy") retains the structure of the full softmax loss but restricts the normalization to a sampled subset of classes. Instead of summing over all |V| classes in the denominator, the sum runs over the target class plus K randomly drawn negatives.

The sampled softmax loss for a single example is:

L_SS = -z_t + log(exp(z_t) + sum_{k=1}^{K} exp(z_{c_k} - log(K * Q(c_k) / (1 - Q(t)))))

The correction term log(K * Q(c_k) / (1 - Q(t))) adjusts for the fact that classes are not sampled uniformly from the full distribution but according to Q. This bias correction is sometimes called the "logQ correction."

Sampled softmax differs from NCE in that it keeps the multi-class softmax structure rather than reducing the problem to binary classification. It always includes the target class exactly once, which reduces variance compared to purely sampled approaches. The method is implemented in TensorFlow as tf.nn.sampled_softmax_loss and is widely used for training language models and retrieval systems with large output spaces.

For a scenario with 100,000 classes and 100 negative samples, sampled softmax reduces computation from evaluating 100,000 scores per example to evaluating just 101, a roughly 1000x speedup.

Importance sampling

Importance sampling uses a proposal distribution Q to approximate the expensive expectation over all vocabulary words in the softmax denominator. Rather than computing scores for every class, it draws K samples from Q and reweights each sample by the ratio of the true probability to the sampling probability:

sum_j exp(z_j) ≈ (1/K) * sum_{k=1}^{K} exp(z_{c_k}) / Q(c_k)

This provides an unbiased estimate of the partition function, but can suffer from high variance when Q is a poor match for the true distribution. Adaptive importance sampling methods address this by adjusting Q during training to better match the model's evolving output distribution, achieving speedups of roughly 100x over full softmax.

Importance sampling for neural language models was explored by Bengio and Senecal (2003, 2008) and later applied to neural machine translation by Jean et al. (2015), who used it to train models with vocabularies of 500,000 words.

Hierarchical softmax

Although not a sampling method in the strict sense, hierarchical softmax is often discussed alongside candidate sampling because it addresses the same computational bottleneck. Proposed by Morin and Bengio in 2005, hierarchical softmax replaces the flat softmax layer with a binary tree structure in which words sit at the leaves. Computing the probability of a word requires traversing a path from the root to the corresponding leaf, which takes O(log |V|) time instead of O(|V|).

Hierarchical softmax can yield speedups of 50x or more during training but requires careful construction of the tree. The original paper used WordNet to build the hierarchy. Later work by Mikolov et al. used Huffman coding, assigning shorter paths to more frequent words.

Adaptive softmax

Adaptive softmax, proposed by Grave, Joulin, Cisse, Grangier, and Jegou in 2017, exploits the highly skewed (Zipfian) distribution of word frequencies. It groups classes into clusters of varying size and assigns different embedding dimensions to each cluster. Frequent classes get more parameters and are evaluated first; rare classes share a smaller, cheaper representation. This approach is especially well suited to GPU architectures because it minimizes memory transfers. On the One Billion Word benchmark, adaptive softmax achieved accuracy close to full softmax while running significantly faster.

Comparison of methods

Method	Type	Theoretical guarantee	Learns partition function	Typical K	Complexity per example	Primary use case
Full softmax	Exact	N/A (exact)	Computed exactly	N/A	O(\|V\|)	Small vocabularies
Noise contrastive estimation (NCE)	Sampling	Consistent estimator; converges to MLE as K grows	Yes (as a parameter)	25-100	O(K)	Language modeling, density estimation
Negative sampling	Sampling	Not consistent for MLE	No (fixed to 1)	5-20	O(K)	Word embeddings (Word2Vec)
Sampled softmax	Sampling	Biased; bias decreases with K	No	100-10,000	O(K)	Language models, retrieval, recommendations
Importance sampling	Sampling	Unbiased estimate of partition function	No	100-1,000	O(K)	Neural machine translation
Hierarchical softmax	Structural	Exact (given the tree)	Computed via tree	N/A	O(log \|V\|)	Language models with large vocabularies
Adaptive softmax	Structural	Exact (approximate architecture)	Computed per cluster	N/A	O(\|V\|) but much faster in practice	GPU-based language modeling

Sampling distributions

The choice of noise distribution Q has a significant impact on training quality and convergence speed. Several options are commonly used:

Distribution	Formula	Properties	Used in
Uniform	Q(c) = 1/\|V\|	Simple; treats all classes equally; may waste samples on irrelevant classes	Baseline comparisons
Unigram	Q(c) = freq(c) / N	Matches empirical frequency; popular classes sampled more often	Language modeling, recommendations
Smoothed unigram	Q(c) = freq(c)^alpha / Z, alpha < 1	Gives rare classes more representation; alpha = 3/4 common	Word2Vec negative sampling
Log-uniform	Q(c) proportional to 1/(rank(c) + 1)	Approximates Zipfian distribution; used in TensorFlow's `log_uniform_candidate_sampler`	TensorFlow language models
Adaptive / learned	Q updated during training	Reduces variance; better matches true distribution	Adaptive importance sampling
In-batch negatives	Other examples in the same mini-batch serve as negatives	No extra sampling needed; efficient on GPUs; introduces popularity bias	Contrastive learning, two-tower retrieval

When using in-batch negatives (as in many contrastive learning and retrieval systems), popular items appear as negatives more frequently than rare items, creating a sampling bias. The logQ correction addresses this by subtracting log(Q(c)) from the logit of each negative candidate, where Q(c) is estimated from the item's frequency in the training corpus.

Bias correction

Candidate sampling introduces bias because the loss is computed over a non-representative subset of classes. Several correction techniques mitigate this:

LogQ correction. For sampled softmax, the logit for each sampled negative is adjusted by subtracting log(K * Q(c)), where Q(c) is the probability of sampling class c. This correction accounts for the fact that some classes are sampled more frequently than others. If Q matched the true softmax distribution exactly, the correction would produce unbiased gradients, but since this is generally not the case, a residual bias remains.

Rejection sampling. When a sampled negative accidentally matches the target class, it should be discarded (rejected) and replaced with a new sample. Without this step, the target class would be simultaneously pushed up (as the positive) and pushed down (as a negative), introducing noise.

Importance weighting. Each sampled class's contribution to the loss is multiplied by the ratio of its true probability to its sampling probability. This produces an unbiased estimate in expectation but can have high variance if the sampling distribution is far from the target distribution.

Applications

Natural language processing

Candidate sampling first gained widespread adoption in NLP through Word2Vec. The Skip-gram model with negative sampling (SGNS) learns word embeddings by training a binary classifier to distinguish genuine word-context pairs from randomly corrupted ones. This approach, introduced in 2013, enabled training on billion-word corpora in hours rather than days.

NCE-based training was subsequently adopted for neural language models. Mnih and Teh (2012) showed that NCE could train neural language models with accuracy comparable to full softmax, and Zoph et al. (2016) applied it to large-scale recurrent language models. Importance sampling was used by Jean et al. (2015) to train neural machine translation models over vocabularies of 500,000 words.

Modern subword tokenization methods such as byte pair encoding (BPE) and SentencePiece have reduced vocabulary sizes to 30,000-100,000 tokens, which lessens the need for candidate sampling in standard language models. However, candidate sampling remains relevant for retrieval-augmented generation, entity linking, and other tasks where the output space is large.

Recommendation systems

In large-scale recommendation, the model must score millions of candidate items for each user. Two-tower architectures (also called dual-encoder models) are a common approach: one tower encodes the user and the other encodes the item, and relevance is measured by the dot product of their embeddings. During training, the full softmax over all items is replaced with sampled softmax or in-batch negative sampling.

Google's YouTube recommendation system uses a two-tower neural deep retrieval model trained with candidate sampling to retrieve personalized video suggestions from a corpus of tens of millions of videos. The system uses mixed negative sampling, combining in-batch negatives with uniformly sampled negatives to counteract the selection bias of implicit user feedback.

A common challenge in recommendation is the popularity bias introduced by in-batch negatives. The logQ correction, proposed by Yi et al. (2019) in the paper "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations," adjusts logits during training to prevent the model from unfairly penalizing popular items that appear as negatives more frequently.

Contrastive representation learning

Candidate sampling ideas underpin modern self-supervised learning methods. The InfoNCE loss, introduced by van den Oord et al. (2018) in Contrastive Predictive Coding (CPC), extends NCE to learn representations by maximizing mutual information between different views of the same data point. The loss treats one positive pair and K negative pairs as inputs to a (K+1)-way classification problem:

L_InfoNCE = -log(exp(sim(q, k+)) / (exp(sim(q, k+)) + sum_{k=1}^{K} exp(sim(q, k_k-))))

where sim denotes a similarity function (typically cosine similarity or dot product).

InfoNCE forms the foundation of influential frameworks including:

SimCLR (Chen et al., 2020): Uses augmented views of images as positive pairs and other images in the batch as negatives. Requires large batch sizes (4,096 or more) to provide enough negatives.
MoCo (He et al., 2020): Maintains a momentum-updated queue of negatives, decoupling the number of negatives from batch size.
CLIP (Radford et al., 2021): Applies contrastive learning across image-text pairs, using in-batch negatives from both modalities.

The quality of negative samples matters significantly in contrastive learning. "Hard negatives" (negatives that are similar to the positive) provide stronger learning signals but can also destabilize training if not handled carefully.

Computer vision

Beyond contrastive learning, candidate sampling is used in object detection and image recognition tasks with very large label sets. Face recognition models, for example, must distinguish among millions of identities, and sampled softmax or ArcFace-style losses with candidate sampling make training feasible.

Practical considerations

Choosing the number of negative samples

The number of negative samples K involves a tradeoff between computational cost and approximation quality. General guidelines include:

For word embeddings (negative sampling): K = 5 to 20. Mikolov et al. found K = 5 to 20 worked well for large datasets and K = 2 to 5 for small ones.
For NCE in language modeling: K = 25 to 100. Mnih and Teh (2012) found that K = 25 was sufficient to match full softmax.
For sampled softmax in retrieval: K = 100 to 10,000, depending on the catalog size and available compute.
For contrastive learning (InfoNCE): Larger K generally improves performance, with SimCLR using effective values in the thousands via large batch sizes.

When to use candidate sampling

Candidate sampling provides the greatest benefit when:

The number of output classes exceeds roughly 100,000.
The output projection layer (embedding matrix for all classes) dominates computation or memory usage.
Training time is the primary bottleneck (candidate sampling typically helps only during training, not inference).

Subword tokenization has reduced vocabulary sizes in many NLP applications, diminishing the need for candidate sampling in standard language modeling. However, it remains essential for retrieval, recommendation, entity linking, extreme multi-label classification, and any task where the output space is inherently large.

A practical optimization is to draw one set of K negatives and share them across all examples in a mini-batch. This reduces the overhead of sampling and can be implemented efficiently as a single matrix multiplication. TensorFlow's sampled_softmax_loss and nce_loss functions use this strategy by default.

Candidate sampling at inference time

Candidate sampling is a training-time technique. At inference time, the model still needs to produce scores for all classes (or use an approximate nearest neighbor index for retrieval). Methods like hierarchical softmax, adaptive softmax, and differentiated softmax can provide speedups at both training and inference time, unlike sampling-based approaches that only accelerate training.

Mathematical relationship between methods

The different candidate sampling losses can be understood as points on a spectrum. Let s(c) denote the model's score for class c and Q(c) the noise distribution.

Full softmax cross-entropy:

L_softmax = -s(t) + log(sum_{c=1}^{|V|} exp(s(c)))

Sampled softmax: replaces the sum over |V| with a sum over the target and K sampled negatives, with bias correction:

L_SS ≈ -s(t) + log(exp(s(t)) + sum_{k=1}^{K} exp(s(c_k) - log(K * Q(c_k))))

NCE: reformulates as K + 1 binary classification problems:

L_NCE = -log(sigma(s(t) - log(K * Q(t)))) - sum_{k=1}^{K} log(sigma(-(s(c_k) - log(K * Q(c_k)))))

Negative sampling: simplifies NCE by setting the partition function to 1 (equivalent to dropping the log(K * Q) correction under certain assumptions):

L_NEG = -log(sigma(s(t))) - sum_{k=1}^{K} log(sigma(-s(c_k)))

As K approaches |V|, both sampled softmax and NCE converge to the full softmax loss. Negative sampling does not share this convergence property, which is why it is effective for embedding learning but not suitable for tasks requiring calibrated probability estimates.

Framework support

Major deep learning frameworks provide built-in support for candidate sampling:

Framework	Function / Module	Method	Notes
TensorFlow	`tf.nn.sampled_softmax_loss`	Sampled softmax	Supports logQ correction; shared negatives across batch
TensorFlow	`tf.nn.nce_loss`	NCE	Binary classification formulation
TensorFlow	`tf.random.log_uniform_candidate_sampler`	Sampling utility	Log-uniform distribution for Zipfian vocabularies
TensorFlow	`tf.random.fixed_unigram_candidate_sampler`	Sampling utility	Custom unigram distribution
PyTorch	Custom implementation required	Various	No built-in sampled softmax; community libraries available
PyTorch	`torch.nn.functional.cross_entropy` with subset	Sampled softmax	Can be implemented manually by indexing into the weight matrix
JAX / Flax	Custom implementation	Various	Efficient with `jax.numpy` indexing

Historical timeline

Year	Development	Authors
2003	Importance sampling for neural language models	Bengio and Senecal
2005	Hierarchical softmax	Morin and Bengio
2010	Noise contrastive estimation (NCE)	Gutmann and Hyvarinen
2012	NCE applied to neural language models	Mnih and Teh
2013	Negative sampling in Word2Vec	Mikolov, Sutskever, Chen, Corrado, Dean
2013	GloVe word embeddings	Pennington, Socher, Manning
2015	Importance sampling for neural machine translation	Jean, Cho, Memisevic, Bengio
2017	Adaptive softmax for GPUs	Grave, Joulin, Cisse, Grangier, Jegou
2018	InfoNCE and Contrastive Predictive Coding	van den Oord, Li, Vinyals
2019	Sampling-bias-corrected retrieval (logQ correction)	Yi, Yang, et al. (Google)
2020	SimCLR contrastive learning framework	Chen, Kornblith, Norouzi, Hinton
2020	MoCo momentum contrastive learning	He, Fan, Wu, Xie, Girshick
2021	CLIP cross-modal contrastive learning	Radford et al. (OpenAI)

References

Gutmann, M. and Hyvarinen, A. (2010). "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297-304.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). "Distributed Representations of Words and Phrases and their Compositionality." Advances in Neural Information Processing Systems 26 (NeurIPS 2013).
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv preprint arXiv:1301.3781.
Mnih, A. and Teh, Y.W. (2012). "A fast and simple algorithm for training neural probabilistic language models." Proceedings of the 29th International Conference on Machine Learning (ICML 2012).
Morin, F. and Bengio, Y. (2005). "Hierarchical Probabilistic Neural Network Language Model." Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005).
Bengio, Y. and Senecal, J.-S. (2003). "Quick Training of Probabilistic Neural Nets by Importance Sampling." Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics (AISTATS 2003).
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015). "On Using Very Large Target Vocabulary for Neural Machine Translation." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015).
Grave, E., Joulin, A., Cisse, M., Grangier, D., and Jegou, H. (2017). "Efficient softmax approximation for GPUs." Proceedings of the 34th International Conference on Machine Learning (ICML 2017).
van den Oord, A., Li, Y., and Vinyals, O. (2018). "Representation Learning with Contrastive Predictive Coding." arXiv preprint arXiv:1807.03748.
Yi, X., Yang, J., Hong, L., Cheng, D.Z., Heldt, L., Kumthekar, A., Zhao, Z., Wei, L., and Chi, E. (2019). "Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations." Proceedings of the 13th ACM Conference on Recommender Systems (RecSys 2019).
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple Framework for Contrastive Learning of Visual Representations." Proceedings of the 37th International Conference on Machine Learning (ICML 2020).
Goldberg, Y. and Levy, O. (2014). "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722.
Ma, Z. and Collins, M. (2018). "Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018).
Ruder, S. (2016). "Approximating the Softmax for Learning Word Embeddings." Blog post. https://www.ruder.io/word-embeddings-softmax/
TensorFlow Authors. "Candidate Sampling Algorithms Reference." https://www.tensorflow.org/extras/candidate_sampling.pdf

ELI5 (Explain like I'm five)

Motivation and problem setting

How candidate sampling works

Types of candidate sampling

Noise contrastive estimation (NCE)

Negative sampling

Sampled softmax

Importance sampling

Hierarchical softmax

Adaptive softmax

Comparison of methods

Sampling distributions

Bias correction

Applications

Natural language processing

Recommendation systems

Contrastive representation learning

Computer vision

Practical considerations

Choosing the number of negative samples

When to use candidate sampling

Sharing negatives across a batch

Candidate sampling at inference time

Mathematical relationship between methods

Framework support

Historical timeline

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

Curriculum learning

GELU (Gaussian Error Linear Unit)

Agentic Context Engineering

ELI5 (Explain like I'm five)

Motivation and problem setting

How candidate sampling works

Types of candidate sampling

Noise contrastive estimation (NCE)

Negative sampling

Sampled softmax

Importance sampling

Hierarchical softmax

Adaptive softmax

Comparison of methods

Sampling distributions

Bias correction

Applications

Natural language processing

Recommendation systems

Contrastive representation learning

Computer vision

Practical considerations

Choosing the number of negative samples

When to use candidate sampling

Sharing negatives across a batch

Candidate sampling at inference time

Mathematical relationship between methods

Framework support

Historical timeline

See also

References

Related Articles

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

Curriculum learning

GELU (Gaussian Error Linear Unit)

Agentic Context Engineering