Upweighting

See also: Machine learning terms

Upweighting is the practice of assigning a larger weight to certain training examples (or groups of examples) so they contribute more to the loss function and gradient updates than the rest of the data. The technique is sometimes called instance weighting, sample weighting, or example reweighting. It is one of the most common tools for working with an imbalanced dataset, correcting for distribution shift, mixing pretraining corpora, and emphasizing high quality or high cost data points without changing the dataset itself.

Upweighting differs from upsampling, which physically duplicates rows, and from downsampling, which discards majority class rows. With upweighting the dataset stays the same size; only the contribution of each row to the loss changes. The two approaches give the same expected gradient but produce different gradient noise, which matters in practice for stochastic optimizers.

definition and mathematical form

A standard supervised loss treats every training example equally:

L = (1/N) * sum_i loss(y_i, y_hat_i)

A weighted loss attaches a non-negative scalar w_i to each example:

L = (1/sum_i w_i) * sum_i w_i * loss(y_i, y_hat_i)

If w_i is larger than 1 the example is upweighted. If w_i is between 0 and 1 it is downweighted. Setting all w_i to 1 recovers the standard loss. Weights act as multipliers on the per-example gradient, so a row with weight 5 produces five times the gradient signal of a row with weight 1 during backpropagation. Frameworks usually expose this through a sample weight array passed alongside the labels, or through a per-class weight vector applied automatically based on the label of each row.

common scenarios

Class imbalance is the textbook motivation. In a fraud detection dataset where 0.5% of transactions are fraudulent, a model trained with uniform weights can reach 99.5% accuracy by always predicting "not fraud." Assigning a much larger weight to fraud examples forces the optimizer to take their misclassification cost seriously. This connects directly to class weight schemes and cost-sensitive learning, where misclassification costs vary across classes.

The scenarios below cover the common reasons practitioners reach for upweighting.

Scenario	What gets upweighted	Why
Class imbalance	The minority class	Prevent the model from collapsing onto the majority label
Cost-sensitive learning	High-cost mistakes (often false negatives)	Match the loss function to real-world cost structure
Importance sampling	Rare but consequential samples	Get an unbiased estimate of an integral or expectation
Covariate shift correction	Training points whose features look like the test set	Recover an unbiased risk estimate when train and test distributions differ
Domain adaptation	Source-domain points that resemble the target domain	Transfer information across distributions
Self-training and semi-supervised learning	Pseudo-labels with high confidence scores	Trust the model's own labels in proportion to its certainty
Active learning	Newly labeled informative points	Reflect that recently queried samples were chosen because they were valuable
LLM pretraining	Higher quality or higher signal data domains	Allocate more of the compute budget to domains that improve downstream loss
RLHF and preference learning	High-confidence or high-margin preferences	Avoid letting noisy comparisons drown out clean ones

class weighting in scikit-learn

scikit-learn exposes upweighting through two arguments. The class_weight parameter applies a weight per class label. The sample_weight parameter, passed to .fit(), applies a weight per row. Both can be combined; the effective weight on a row is the product of the two.

When class_weight='balanced' is set, scikit-learn computes the weight for each class with the formula:

weight(c) = n_samples / (n_classes * count(c))

In binary classification with n positive and n_neg negative examples this reduces to:

w_pos = n_total / (2 * n_pos)
w_neg = n_total / (2 * n_neg)

For a 1:99 split the formula yields w_pos = 50 and w_neg = 0.505, so the rare class contributes roughly 99 times the per-row gradient of the common class. The same formula is used by compute_class_weight and compute_sample_weight in sklearn.utils.class_weight. A custom dictionary like class_weight={0: 1, 1: 10} overrides the heuristic for cases where the analyst knows the cost ratio.

reweighting versus replicating examples

Upweighting a row by a factor of w and inserting w copies of that row both change the expected gradient in the same way. They are equivalent in expectation but not in stochastic behavior. An, Ying, and collaborators showed in 2020 that resampling produces lower gradient noise around the optimum than reweighting when training with stochastic gradient descent, because reweighting introduces large-magnitude updates from rare events that destabilize the iterates. Reweighting, on the other hand, requires no extra memory, leaves the dataset clean, and integrates naturally with k-fold cross validation. The choice often comes down to whether the optimizer is stable enough to handle large weight ratios.

A hybrid approach is also common: run mild oversampling to keep the minority class visible in every batch, then apply moderate upweighting on top to fine-tune the cost balance. Combining SMOTE with class weighting is a frequent recipe for tabular fraud and medical screening problems.

focal loss as soft dynamic upweighting

Focal loss, introduced by Lin and colleagues at ICCV 2017 for dense object detection, is a smooth dynamic version of upweighting. The standard cross-entropy loss treats every misclassified example with the same weight, even when the model is already very confident. Focal loss multiplies cross-entropy by a focusing term that shrinks the contribution of easy examples and leaves the contribution of hard ones almost untouched.

The formula is:

FL(p_t) = -alpha * (1 - p_t)^gamma * log(p_t)

where p_t is the model's predicted probability for the true class, alpha is a per-class weighting factor, and gamma is a focusing parameter. With gamma = 2 and a confident easy example at p_t = 0.9, the modulating factor is (1 - 0.9)^2 = 0.01, shrinking that example's contribution by 100x. The same example would contribute the full cross-entropy if gamma = 0. The RetinaNet paper reported alpha = 0.25 and gamma = 2 as the values that worked best on COCO. Focal loss is therefore a way to upweight hard examples implicitly by downweighting easy ones, which is mathematically the same up to a normalization constant.

importance weighting under covariate shift

In covariate shift the marginal input distribution changes between training and deployment while the conditional distribution P(y|x) stays fixed. Standard maximum likelihood loses consistency in this setting. Sugiyama and Kawanabe showed that a weighted loss with importance weights recovers consistency:

w(x) = p_test(x) / p_train(x)

Each training example is upweighted in proportion to how much more frequently a similar input appears in the test distribution. Examples that look like test inputs get larger weights; examples that are over-represented in training get smaller ones. The same density-ratio idea drives off-policy correction in reinforcement learning and propensity weighting in causal inference. Direct density-ratio estimators such as KLIEP and uLSIF are usually preferred over estimating the two densities separately, since density estimation in high dimensions is known to be unstable.

A practical caveat: if the density ratio has heavy tails the variance of the importance-weighted loss explodes, and clipping or self-normalizing the weights becomes necessary. Practitioners often cap weights at a fixed multiple of the median or use a Pareto-smoothed estimator.

upweighting in large language model training

Upweighting at the corpus level is now standard practice in language model pretraining. The Llama 1 paper, for example, used a hand-tuned mixture in which Wikipedia and books were sampled at higher rates per epoch than CommonCrawl. These domain weights are essentially a global form of upweighting, applied at the data loader rather than the loss.

DoReMi (Xie et al., NeurIPS 2023) automated this choice. The method trains a small proxy model with Group Distributionally Robust Optimization to find domain weights that minimize worst-case excess loss across domains, then uses those weights to train a much larger model. Reported gains on The Pile included a 2.6x speedup to baseline accuracy and a 6.5-point improvement in average few-shot accuracy at the same training budget. Notably DoReMi sometimes lowers the weight on a domain and still improves perplexity on it, because the freed-up budget improves shared representations.

Upweighting also appears inside RLHF training pipelines. Preferences with larger reward-model margins, or those flagged as high quality by human annotators, are sometimes weighted more heavily during reward model training and policy optimization. The same logic shows up in supervised fine-tuning, where curated instruction data is typically given a higher sampling weight than scraped instruction data.

implementation across frameworks

Most machine learning libraries support upweighting through one or both of two mechanisms: a per-class weight array and a per-row weight array. The table below summarizes the API surface in widely used frameworks.

Framework	Per-class weights	Per-sample weights	Notes
scikit-learn	`class_weight={0:1, 1:10}` or `class_weight='balanced'`	`sample_weight=array` passed to `.fit()`	Most estimators support both; effective weight is the product
PyTorch	`nn.CrossEntropyLoss(weight=tensor)` and `nn.BCEWithLogitsLoss(pos_weight=...)`	Multiply the per-element loss by a weight tensor and sum manually, or use `WeightedRandomSampler`	Loss functions provide a `reduction='none'` mode for custom weighting
TensorFlow / Keras	`class_weight={0:1, 1:10}` argument to `model.fit()`	`sample_weight=array` argument to `model.fit()`	Both accepted simultaneously
XGBoost	`scale_pos_weight=ratio` for binary; per-row `weight` in `DMatrix`	`weight` field on `DMatrix`	`scale_pos_weight` is binary-only; multiclass needs sample weights
LightGBM	`class_weight='balanced'` or dictionary; `is_unbalance` and `scale_pos_weight`	`weight` argument to `Dataset`	Multiple knobs that interact; documentation recommends choosing one
CatBoost	`class_weights=[1, 10]` or `auto_class_weights='Balanced'`	`sample_weight` argument	Auto mode mirrors the scikit-learn formula

A typical scikit-learn workflow with custom weights looks like this:

from sklearn.utils.class_weight import compute_sample_weight
weights = compute_sample_weight(class_weight='balanced', y=y_train)
model.fit(X_train, y_train, sample_weight=weights)

In PyTorch, weights are usually attached to the loss object once and reused across batches:

import torch
class_weights = torch.tensor([0.5, 2.0, 1.0])
criterion = torch.nn.CrossEntropyLoss(weight=class_weights)

For binary problems with severe imbalance, BCEWithLogitsLoss(pos_weight=...) accepts a single positive-class multiplier that plays the same role as XGBoost's scale_pos_weight.

Upweighting is one of several families of methods for handling imbalance and shifted distributions. The choice depends on dataset size, optimizer behavior, and whether the cost ratio is known.

Technique	What it does	When to prefer it
Upweighting (this article)	Multiplies the loss contribution of selected rows	Default first try; cheap, no extra storage, integrates with all gradient-based learners
Upsampling	Duplicates minority rows in the training set	Low compute budget per epoch; small datasets where each batch must contain minority examples
Downsampling	Drops majority rows	Very large majority class; willing to sacrifice information for training speed
SMOTE	Synthesizes new minority points by interpolation	Tabular data with continuous features; minority class too small for plain duplication
Focal loss	Soft dynamic downweighting of easy examples	Object detection and dense prediction; when most examples are trivially classified
Cost-sensitive learning	Builds explicit cost matrix into the loss	Real-world cost ratios are known and asymmetric (medical screening, fraud)
Threshold tuning	Adjusts the decision threshold post-hoc	Model is calibrated; quick fix without retraining

limitations and pitfalls

Upweighting is not a free lunch. Several failure modes show up in practice.

Extreme weight ratios destabilize training. A weight of 1000 on a single row produces a gradient step a thousand times larger than a typical step, which can blow up the parameters or push the optimizer into a bad region of the loss surface. Gradient clipping and weight smoothing (for example, capping weights at the 99th percentile) are common defenses.

Upweighting amplifies label noise. If a heavily upweighted example happens to have an incorrect label, the model is forced to fit that error. Practitioners often apply a confident-learning pass to clean labels before relying on heavy weights.

Validation sets should not be upweighted. The point of validation is to measure honest performance on a representative sample, so weighted validation loss does not match the deployment metric. Validation should use uniform weights or weights matching the test distribution. The same applies to calibration: a model trained with class weights has shifted predicted probabilities and usually needs Platt scaling or isotonic regression before its outputs can be read as calibrated probabilities.

Upweighting changes effective sample size. The variance of a weighted estimator scales with sum(w^2) / (sum(w))^2, so heavily skewed weights reduce statistical power. This is the effective sample size formula from survey statistics, and it explains why importance-weighted estimators with heavy tails often need many more samples than they appear to.

Finally, upweighting alone cannot fix a fundamentally biased data collection pipeline. If the minority class is missing entire subpopulations, no per-row weight can recover information that was never collected.

explain like i'm 5

Imagine a teacher grading a class of 100 students. Ninety-nine of them are studying for an easy quiz, and one is studying for a really important final exam. If the teacher spends the same amount of time on each student, the one with the big exam barely gets any help. Upweighting is like telling the teacher: "Spend ten minutes per student on the quiz kids and an hour on the exam student." The teacher now pays much more attention to the rare, important case. In machine learning, the model is the teacher, and the weights tell it which examples to take more seriously.

references

scikit-learn developers. compute_class_weight reference. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). Focal Loss for Dense Object Detection. IEEE International Conference on Computer Vision (ICCV). https://arxiv.org/abs/1708.02002
Sugiyama, M. and Kawanabe, M. (2012). Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation. MIT Press.
Sugiyama, M., Krauledat, M., and Muller, K.-R. (2007). Covariate Shift Adaptation by Importance Weighted Cross Validation. Journal of Machine Learning Research, 8, 985 to 1005. https://jmlr.org/papers/v8/sugiyama07a.html
An, J., Ying, L., and Zhu, Y. (2020). Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients. arXiv:2009.13447. https://arxiv.org/abs/2009.13447
Xie, S. M., Pham, H., Dong, X., Du, N., Liu, H., Lu, Y., Liang, P., Le, Q. V., Ma, T., and Yu, A. W. (2023). DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining. NeurIPS 2023. https://arxiv.org/abs/2305.10429
Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971. https://arxiv.org/abs/2302.13971
Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016. XGBoost parameter documentation: https://xgboost.readthedocs.io/en/stable/parameter.html
PyTorch developers. torch.nn.CrossEntropyLoss. https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321 to 357.

definition and mathematical form

common scenarios

class weighting in scikit-learn

reweighting versus replicating examples

focal loss as soft dynamic upweighting

importance weighting under covariate shift

upweighting in large language model training

implementation across frameworks

related techniques and how to choose

limitations and pitfalls

explain like i'm 5

references

Improve this article

Related Articles

ARC-AGI 2

Majority class

Minority class

Undersampling

AUC-ROC

Machine learning terms/Clustering

definition and mathematical form

common scenarios

class weighting in scikit-learn

reweighting versus replicating examples

focal loss as soft dynamic upweighting

importance weighting under covariate shift

upweighting in large language model training

implementation across frameworks

related techniques and how to choose

limitations and pitfalls

explain like i'm 5

references

Related Articles

ARC-AGI 2

Majority class

Minority class

Undersampling

AUC-ROC

Machine learning terms/Clustering