See also: Machine learning terms
Upweighting is the practice of assigning a larger weight to certain training examples (or groups of examples) so they contribute more to the loss function and gradient updates than the rest of the data. The technique is sometimes called instance weighting, sample weighting, or example reweighting. It is one of the most common tools for working with an imbalanced dataset, correcting for distribution shift, mixing pretraining corpora, and emphasizing high quality or high cost data points without changing the dataset itself.
Upweighting differs from upsampling, which physically duplicates rows, and from downsampling, which discards majority class rows. With upweighting the dataset stays the same size; only the contribution of each row to the loss changes. The two approaches give the same expected gradient but produce different gradient noise, which matters in practice for stochastic optimizers.
A standard supervised loss treats every training example equally:
L = (1/N) * sum_i loss(y_i, y_hat_i)
A weighted loss attaches a non-negative scalar w_i to each example:
L = (1/sum_i w_i) * sum_i w_i * loss(y_i, y_hat_i)
If w_i is larger than 1 the example is upweighted. If w_i is between 0 and 1 it is downweighted. Setting all w_i to 1 recovers the standard loss. Weights act as multipliers on the per-example gradient, so a row with weight 5 produces five times the gradient signal of a row with weight 1 during backpropagation. Frameworks usually expose this through a sample weight array passed alongside the labels, or through a per-class weight vector applied automatically based on the label of each row.
Class imbalance is the textbook motivation. In a fraud detection dataset where 0.5% of transactions are fraudulent, a model trained with uniform weights can reach 99.5% accuracy by always predicting "not fraud." Assigning a much larger weight to fraud examples forces the optimizer to take their misclassification cost seriously. This connects directly to class weight schemes and cost-sensitive learning, where misclassification costs vary across classes.
The scenarios below cover the common reasons practitioners reach for upweighting.
| Scenario | What gets upweighted | Why |
|---|---|---|
| Class imbalance | The minority class | Prevent the model from collapsing onto the majority label |
| Cost-sensitive learning | High-cost mistakes (often false negatives) | Match the loss function to real-world cost structure |
| Importance sampling | Rare but consequential samples | Get an unbiased estimate of an integral or expectation |
| Covariate shift correction | Training points whose features look like the test set | Recover an unbiased risk estimate when train and test distributions differ |
| Domain adaptation | Source-domain points that resemble the target domain | Transfer information across distributions |
| Self-training and semi-supervised learning | Pseudo-labels with high confidence scores | Trust the model's own labels in proportion to its certainty |
| Active learning | Newly labeled informative points | Reflect that recently queried samples were chosen because they were valuable |
| LLM pretraining | Higher quality or higher signal data domains | Allocate more of the compute budget to domains that improve downstream loss |
| RLHF and preference learning | High-confidence or high-margin preferences | Avoid letting noisy comparisons drown out clean ones |
scikit-learn exposes upweighting through two arguments. The class_weight parameter applies a weight per class label. The sample_weight parameter, passed to .fit(), applies a weight per row. Both can be combined; the effective weight on a row is the product of the two.
When class_weight='balanced' is set, scikit-learn computes the weight for each class with the formula:
weight(c) = n_samples / (n_classes * count(c))
In binary classification with n positive and n_neg negative examples this reduces to:
w_pos = n_total / (2 * n_pos)
w_neg = n_total / (2 * n_neg)
For a 1:99 split the formula yields w_pos = 50 and w_neg = 0.505, so the rare class contributes roughly 99 times the per-row gradient of the common class. The same formula is used by compute_class_weight and compute_sample_weight in sklearn.utils.class_weight. A custom dictionary like class_weight={0: 1, 1: 10} overrides the heuristic for cases where the analyst knows the cost ratio.
Upweighting a row by a factor of w and inserting w copies of that row both change the expected gradient in the same way. They are equivalent in expectation but not in stochastic behavior. An, Ying, and collaborators showed in 2020 that resampling produces lower gradient noise around the optimum than reweighting when training with stochastic gradient descent, because reweighting introduces large-magnitude updates from rare events that destabilize the iterates. Reweighting, on the other hand, requires no extra memory, leaves the dataset clean, and integrates naturally with k-fold cross validation. The choice often comes down to whether the optimizer is stable enough to handle large weight ratios.
A hybrid approach is also common: run mild oversampling to keep the minority class visible in every batch, then apply moderate upweighting on top to fine-tune the cost balance. Combining SMOTE with class weighting is a frequent recipe for tabular fraud and medical screening problems.
Focal loss, introduced by Lin and colleagues at ICCV 2017 for dense object detection, is a smooth dynamic version of upweighting. The standard cross-entropy loss treats every misclassified example with the same weight, even when the model is already very confident. Focal loss multiplies cross-entropy by a focusing term that shrinks the contribution of easy examples and leaves the contribution of hard ones almost untouched.
The formula is:
FL(p_t) = -alpha * (1 - p_t)^gamma * log(p_t)
where p_t is the model's predicted probability for the true class, alpha is a per-class weighting factor, and gamma is a focusing parameter. With gamma = 2 and a confident easy example at p_t = 0.9, the modulating factor is (1 - 0.9)^2 = 0.01, shrinking that example's contribution by 100x. The same example would contribute the full cross-entropy if gamma = 0. The RetinaNet paper reported alpha = 0.25 and gamma = 2 as the values that worked best on COCO. Focal loss is therefore a way to upweight hard examples implicitly by downweighting easy ones, which is mathematically the same up to a normalization constant.
In covariate shift the marginal input distribution changes between training and deployment while the conditional distribution P(y|x) stays fixed. Standard maximum likelihood loses consistency in this setting. Sugiyama and Kawanabe showed that a weighted loss with importance weights recovers consistency:
w(x) = p_test(x) / p_train(x)
Each training example is upweighted in proportion to how much more frequently a similar input appears in the test distribution. Examples that look like test inputs get larger weights; examples that are over-represented in training get smaller ones. The same density-ratio idea drives off-policy correction in reinforcement learning and propensity weighting in causal inference. Direct density-ratio estimators such as KLIEP and uLSIF are usually preferred over estimating the two densities separately, since density estimation in high dimensions is known to be unstable.
A practical caveat: if the density ratio has heavy tails the variance of the importance-weighted loss explodes, and clipping or self-normalizing the weights becomes necessary. Practitioners often cap weights at a fixed multiple of the median or use a Pareto-smoothed estimator.
Upweighting at the corpus level is now standard practice in language model pretraining. The Llama 1 paper, for example, used a hand-tuned mixture in which Wikipedia and books were sampled at higher rates per epoch than CommonCrawl. These domain weights are essentially a global form of upweighting, applied at the data loader rather than the loss.
DoReMi (Xie et al., NeurIPS 2023) automated this choice. The method trains a small proxy model with Group Distributionally Robust Optimization to find domain weights that minimize worst-case excess loss across domains, then uses those weights to train a much larger model. Reported gains on The Pile included a 2.6x speedup to baseline accuracy and a 6.5-point improvement in average few-shot accuracy at the same training budget. Notably DoReMi sometimes lowers the weight on a domain and still improves perplexity on it, because the freed-up budget improves shared representations.
Upweighting also appears inside RLHF training pipelines. Preferences with larger reward-model margins, or those flagged as high quality by human annotators, are sometimes weighted more heavily during reward model training and policy optimization. The same logic shows up in supervised fine-tuning, where curated instruction data is typically given a higher sampling weight than scraped instruction data.
Most machine learning libraries support upweighting through one or both of two mechanisms: a per-class weight array and a per-row weight array. The table below summarizes the API surface in widely used frameworks.
| Framework | Per-class weights | Per-sample weights | Notes |
|---|---|---|---|
| scikit-learn | class_weight={0:1, 1:10} or class_weight='balanced' | sample_weight=array passed to .fit() | Most estimators support both; effective weight is the product |
| PyTorch | nn.CrossEntropyLoss(weight=tensor) and nn.BCEWithLogitsLoss(pos_weight=...) | Multiply the per-element loss by a weight tensor and sum manually, or use WeightedRandomSampler | Loss functions provide a reduction='none' mode for custom weighting |
| TensorFlow / Keras | class_weight={0:1, 1:10} argument to model.fit() | sample_weight=array argument to model.fit() | Both accepted simultaneously |
| XGBoost | scale_pos_weight=ratio for binary; per-row weight in DMatrix | weight field on DMatrix | scale_pos_weight is binary-only; multiclass needs sample weights |
| LightGBM | class_weight='balanced' or dictionary; is_unbalance and scale_pos_weight | weight argument to Dataset | Multiple knobs that interact; documentation recommends choosing one |
| CatBoost | class_weights=[1, 10] or auto_class_weights='Balanced' | sample_weight argument | Auto mode mirrors the scikit-learn formula |
A typical scikit-learn workflow with custom weights looks like this:
from sklearn.utils.class_weight import compute_sample_weight
weights = compute_sample_weight(class_weight='balanced', y=y_train)
model.fit(X_train, y_train, sample_weight=weights)
In PyTorch, weights are usually attached to the loss object once and reused across batches:
import torch
class_weights = torch.tensor([0.5, 2.0, 1.0])
criterion = torch.nn.CrossEntropyLoss(weight=class_weights)
For binary problems with severe imbalance, BCEWithLogitsLoss(pos_weight=...) accepts a single positive-class multiplier that plays the same role as XGBoost's scale_pos_weight.
Upweighting is one of several families of methods for handling imbalance and shifted distributions. The choice depends on dataset size, optimizer behavior, and whether the cost ratio is known.
| Technique | What it does | When to prefer it |
|---|---|---|
| Upweighting (this article) | Multiplies the loss contribution of selected rows | Default first try; cheap, no extra storage, integrates with all gradient-based learners |
| Upsampling | Duplicates minority rows in the training set | Low compute budget per epoch; small datasets where each batch must contain minority examples |
| Downsampling | Drops majority rows | Very large majority class; willing to sacrifice information for training speed |
| SMOTE | Synthesizes new minority points by interpolation | Tabular data with continuous features; minority class too small for plain duplication |
| Focal loss | Soft dynamic downweighting of easy examples | Object detection and dense prediction; when most examples are trivially classified |
| Cost-sensitive learning | Builds explicit cost matrix into the loss | Real-world cost ratios are known and asymmetric (medical screening, fraud) |
| Threshold tuning | Adjusts the decision threshold post-hoc | Model is calibrated; quick fix without retraining |
Upweighting is not a free lunch. Several failure modes show up in practice.
Extreme weight ratios destabilize training. A weight of 1000 on a single row produces a gradient step a thousand times larger than a typical step, which can blow up the parameters or push the optimizer into a bad region of the loss surface. Gradient clipping and weight smoothing (for example, capping weights at the 99th percentile) are common defenses.
Upweighting amplifies label noise. If a heavily upweighted example happens to have an incorrect label, the model is forced to fit that error. Practitioners often apply a confident-learning pass to clean labels before relying on heavy weights.
Validation sets should not be upweighted. The point of validation is to measure honest performance on a representative sample, so weighted validation loss does not match the deployment metric. Validation should use uniform weights or weights matching the test distribution. The same applies to calibration: a model trained with class weights has shifted predicted probabilities and usually needs Platt scaling or isotonic regression before its outputs can be read as calibrated probabilities.
Upweighting changes effective sample size. The variance of a weighted estimator scales with sum(w^2) / (sum(w))^2, so heavily skewed weights reduce statistical power. This is the effective sample size formula from survey statistics, and it explains why importance-weighted estimators with heavy tails often need many more samples than they appear to.
Finally, upweighting alone cannot fix a fundamentally biased data collection pipeline. If the minority class is missing entire subpopulations, no per-row weight can recover information that was never collected.
Imagine a teacher grading a class of 100 students. Ninety-nine of them are studying for an easy quiz, and one is studying for a really important final exam. If the teacher spends the same amount of time on each student, the one with the big exam barely gets any help. Upweighting is like telling the teacher: "Spend ten minutes per student on the quiz kids and an hour on the exam student." The teacher now pays much more attention to the rare, important case. In machine learning, the model is the teacher, and the weights tell it which examples to take more seriously.