Undersampling

Undersampling is a class-imbalance handling technique in machine learning that reduces the number of majority class examples in a training set so the minority class is no longer drowned out. The idea is straightforward: if one label dominates the data, throw away some of those samples until the class distribution is closer to balanced, then train a classifier on the smaller, more balanced subset. Undersampling sits opposite to oversampling, which adds minority examples instead of removing majority ones, and it is one of the oldest and most widely used remedies for an imbalanced dataset.

The technique is most often used in tabular settings such as fraud detection, churn prediction, medical screening, and credit risk, where the positive class is rare and tree-based learners or linear models are the workhorse. It is less common in modern deep learning, where class weights, focal loss, and batch sampling are usually preferred, but it remains a useful tool in any pipeline that has to deal with skewed labels and limited compute.

why undersample

Class imbalance hurts standard learners because the loss is dominated by majority examples. A logistic regression trained on a 99.8/0.2 split will happily push its decision boundary to predict the negative class on every input and still report 99.8% accuracy. Undersampling fixes this by changing the empirical distribution the learner sees during training. After dropping majority points, the gradient is no longer overwhelmed by the dominant label, and the optimizer is forced to spend capacity modeling the minority class.

Three practical motivations push practitioners toward undersampling rather than oversampling.

Faster training. A balanced subset of a 100-million-row dataset with 0.1% positive prevalence is around 200,000 rows, two orders of magnitude smaller than the original. Iteration on a small balanced set is cheap, and the entire pipeline (cross-validation, hyperparameter search, calibration) becomes interactive instead of overnight.
Sometimes better generalization. When the majority class contains a lot of redundant or noisy points, removing them can sharpen the decision boundary and reduce variance in subsequent splits. Several empirical surveys (Batista, Prati, and Monard 2004; Lemaitre, Nogueira, and Aridas 2017) show undersampling matching or beating oversampling on tabular benchmarks, particularly when paired with ensembles.
Reduced majority dominance in the loss. Undersampling is equivalent in expectation to reweighting the loss so that each class contributes the same total. This is the same intuition behind cost-sensitive learning and the class_weight='balanced' option in scikit-learn, but achieved by throwing away data instead of changing the loss function.

The trade-off is information loss. Every removed majority example is a real observation the model never sees. Whether this matters depends on how redundant the majority class is and how much you can afford to lose.

undersampling vs oversampling

Undersampling and oversampling attack the same problem from opposite directions. The table below summarizes the main differences.

Aspect	Undersampling	Oversampling
Operates on	Majority class	Minority class
Effect on dataset size	Smaller	Larger
Risk of overfitting	Lower (no synthetic or duplicated points)	Higher (copies or interpolated points)
Risk of information loss	Higher (real majority samples discarded)	Lower (no original data dropped)
Training time	Faster than baseline	Slower than baseline
Common methods	Random undersampling, Tomek links, ENN, NearMiss, OSS	Random oversampling, SMOTE, ADASYN, Borderline-SMOTE
Works well when	Majority class is large and partially redundant	Minority class is too small to define its own region
Often combined with	Ensembles, threshold tuning, calibration	Cleaning steps such as Tomek links or ENN

In practice the choice is rarely binary. SMOTE+Tomek and SMOTE+ENN combine an oversampling step with an undersampling cleaning step, and balanced ensemble methods such as EasyEnsemble and Balanced Random Forest build many small undersampled subsets and average over them. The goal in all of these is to keep the noise-reduction benefit of cleaning the majority class while avoiding the variance and information loss of a single aggressive cut.

random undersampling

The simplest method is random undersampling, which uniformly drops majority examples until the desired ratio is reached. If the majority class has 100,000 samples and the minority class has 1,000, random undersampling at a 1:1 ratio keeps a random 1,000 majority points and discards the other 99,000.

Random undersampling is fast, requires no hyperparameter tuning beyond the target ratio, and works on any data type because it does not depend on a distance metric or a feature-space neighborhood. It is a strong baseline in imbalanced-learn and in many production pipelines, and it underlies more advanced ensemble methods such as Balanced Random Forest and RUSBoost.

The drawbacks are also straightforward. Removing data uniformly throws away potentially informative majority points along with redundant ones. The resulting model has higher variance, since the kept subset is small and depends on the random seed, and the predicted probabilities are biased because the prior in the training set no longer matches the true prior in the test distribution. The first issue is usually addressed with ensembling over multiple undersampled subsets; the second is addressed with calibration or prior correction, both discussed below.

informed undersampling techniques

Informed undersampling methods choose which majority points to drop instead of relying on chance. Most of them are descendants of the prototype-selection literature from the 1960s and 1970s on k-nearest neighbors, repurposed for class imbalance. The table below summarizes the main families.

Method	Year	Core idea	What it removes	Reference
Tomek links	1976	A Tomek link is a pair of nearest neighbors with different labels; remove the majority point in each link	Borderline majority points adjacent to minority points	Tomek (1976)
Edited Nearest Neighbors (ENN)	1972	Remove any majority point whose label disagrees with the majority vote of its $k$ nearest neighbors	Majority points that look like minority points to a kNN classifier	Wilson (1972)
Repeated ENN (RENN)	derived	Apply ENN repeatedly until no further removals occur	More aggressive cleaning than a single ENN pass	imbalanced-learn
All-kNN	derived	Run ENN with $k=1, 2, \ldots, K$ and keep only points that survive every pass	Even more aggressive cleaning, varying neighborhood size	Tomek (1976)
Condensed Nearest Neighbor (CNN)	1968	Iteratively keep majority points needed to correctly classify the rest under 1-NN	Redundant majority points far from the boundary	Hart (1968)
One-Sided Selection (OSS)	1997	Apply CNN to drop redundant majority points, then Tomek links to drop borderline ones	Both redundant and borderline majority points	Kubat and Matwin (1997)
NearMiss-1	2003	Keep majority points with the smallest mean distance to the three closest minority points	Majority points far from the minority class	Mani and Zhang (2003)
NearMiss-2	2003	Keep majority points with the smallest mean distance to the three farthest minority points	Same idea, different reference set	Mani and Zhang (2003)
NearMiss-3	2003	For each minority point, keep its $k$ closest majority neighbors	Majority points not adjacent to any minority point	Mani and Zhang (2003)
Neighborhood Cleaning Rule (NCR)	2001	Combine ENN with rules that protect minority points whose neighborhood is mostly majority	Noisy majority points and majority neighbors of misclassified minority points	Laurikkala (2001)
Cluster Centroids	classical	Cluster the majority class with k-means and replace each cluster with its centroid	Many majority points, replaced by a smaller set of representatives	imbalanced-learn

A few of these deserve a closer look because they show up in almost every imbalanced-learning pipeline.

tomek links

Tomek links were introduced by Ivan Tomek in 1976 in IEEE Transactions on Systems, Man, and Cybernetics (volume 6, pages 769 to 772) as a modification of the Condensed Nearest Neighbor rule. Two examples form a Tomek link if they are each other's nearest neighbors and they belong to different classes. In a binary classification setting, every Tomek link contains exactly one majority point and one minority point. Removing the majority point in the link cleans up the boundary by erasing the point that is most likely to be confusing the classifier. The minority point is left untouched.

Tomek-link cleaning rarely produces a balanced dataset by itself because the number of links is usually much smaller than the imbalance gap. It is most useful as a postprocessing step after SMOTE or random oversampling. Batista, Prati, and Monard (2004) found SMOTE+Tomek to be one of the strongest baselines on a panel of 13 imbalanced datasets.

edited nearest neighbors and its variants

Edited Nearest Neighbors (ENN) was proposed by Dennis L. Wilson in 1972 in the same journal (volume 2, pages 408 to 421). The rule is simple: train a kNN classifier on the data, then remove every majority point whose label disagrees with the majority vote of its $k$ nearest neighbors. The original paper used $k = 3$. Wilson showed that, asymptotically, classifying with 1-NN on the edited set approaches the Bayes risk in many problems with only a few preclassified samples. ENN is more aggressive than Tomek links because it removes any majority point that looks like a minority point under kNN, not only those that form a mutual nearest-neighbor pair.

Repeated ENN (RENN) applies ENN over and over until no more points are removed. All-kNN runs ENN once for each value of $k$ from 1 up to a chosen maximum and keeps only points that survive every pass. Both produce cleaner but smaller training sets than a single ENN pass.

condensed nearest neighbor and one-sided selection

Condensed Nearest Neighbor (CNN), proposed by Peter E. Hart in 1968 in IEEE Transactions on Information Theory (volume 14, pages 515 to 516), was originally a storage-reduction trick for nearest-neighbor classifiers. The algorithm grows a subset $S$ by going through the training data and adding any point that is misclassified when 1-NN is run on the current $S$. Points in dense interior regions are usually classified correctly by their neighbors and never enter $S$, so the final subset retains mostly border points.

One-Sided Selection (OSS), proposed by Miroslav Kubat and Stan Matwin at ICML 1997, plugs CNN into the imbalance setting. The algorithm keeps all minority points and applies CNN only to the majority class to drop redundant interior points, then applies Tomek-link cleaning to remove borderline majority points. The resulting subset preserves the shape of the majority distribution while stripping away both noise near the boundary and redundancy in the bulk.

nearmiss

The NearMiss family was introduced by Inderjeet Mani and I-Hsin Zhang in 2003 in the Workshop on Learning from Imbalanced Datasets held at ICML. Three variants are commonly used. NearMiss-1 keeps majority points with the smallest mean distance to their three closest minority points; NearMiss-2 uses the three farthest minority points; NearMiss-3 keeps, for each minority point, the $k$ closest majority points. NearMiss-1 tends to retain majority points that sit close to minority clusters, which is useful when the boundary is the part of the majority class you want the classifier to learn. NearMiss-3 guarantees that every minority point keeps some majority neighbors, which makes it more robust to outliers.

neighborhood cleaning rule

The Neighborhood Cleaning Rule (NCR) was proposed by Jorma Laurikkala in 2001 at the Eighth Conference on Artificial Intelligence in Medicine in Europe. NCR combines ENN with a second pass that finds majority neighbors of misclassified minority points and removes them as well. On Laurikkala's experiments NCR outperformed both random undersampling and one-sided selection on a panel of medical datasets, particularly when the small class was clinically important.

ensemble undersampling

A single undersampled subset throws away most of the majority data and produces a high-variance classifier. Ensemble undersampling reduces this variance by training many classifiers on different random subsets and averaging their predictions. The same idea drives bagging on full datasets, but here each bag is a balanced sample drawn from the imbalanced original.

Method	Year	Idea
EasyEnsemble	2009	Sample several balanced subsets from the majority class, train an AdaBoost classifier on each, then aggregate predictions
BalanceCascade	2009	Same as EasyEnsemble but sequential; after each round, correctly classified majority points are removed before the next subset is drawn
RUSBoost	2010	At each AdaBoost iteration, randomly undersample the majority class to a target ratio before fitting the weak learner
Balanced Bagging	classical	Bagging where each bootstrap sample is balanced by undersampling the majority class
Balanced Random Forest	2004	Each tree in the forest is grown on a balanced bootstrap sample drawn by undersampling the majority class

EasyEnsemble and BalanceCascade were proposed by Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou in 2009 in IEEE Transactions on Systems, Man, and Cybernetics, Part B (volume 39, pages 539 to 550). The paper showed that both methods consistently improved AUC, F-measure, and G-mean over single-classifier undersampling, with training time comparable to a single AdaBoost run when the same number of weak learners is used.

RUSBoost, introduced by Chris Seiffert, Taghi Khoshgoftaar, Jason Van Hulse, and Amri Napolitano in 2010 in IEEE Transactions on Systems, Man and Cybernetics, Part A (volume 40, pages 185 to 197), takes the same direction by interleaving random undersampling with AdaBoost weight updates. The authors showed that RUSBoost matched or beat SMOTEBoost on most benchmarks while being simpler and faster.

Balanced Random Forest, proposed by Chao Chen, Andy Liaw, and Leo Breiman in Technical Report 666 from the UC Berkeley Department of Statistics in 2004, is the random-forest counterpart of these methods. Each tree in the forest is grown on a bootstrap sample that is balanced by undersampling the majority class, so the forest's per-class error rates are decoupled from the original prevalence.

combined sampling

Undersampling and oversampling can be combined to take advantage of both. The two most cited combined methods are SMOTE+Tomek and SMOTE+ENN, both proposed by Gustavo Batista, Ronaldo Prati, and Maria Carolina Monard in their 2003 and 2004 papers on hybrid balancing methods. The idea is to enrich the minority class with synthetic samples, then clean the result with an undersampling method that removes any boundary-crossing or noisy points (whether original majority or synthetic minority) that the SMOTE step has created.

SMOTE+Tomek removes Tomek-link pairs from the post-SMOTE dataset, which deletes both members of any boundary pair where SMOTE has placed a synthetic minority point too close to a real majority point. SMOTE+ENN is more aggressive, since ENN removes any point whose label disagrees with its kNN neighborhood. On Batista's panel of 13 datasets, SMOTE+ENN ranked first on 10 of them.

risks and pitfalls

Undersampling solves one problem and introduces several others.

Information loss. Every removed majority point is data the model will never see. If the majority class is heterogeneous, undersampling can erase entire subregions, and the resulting classifier will fail on inputs that come from those subregions at test time. NearMiss-2 and NearMiss-3 are particularly aggressive in this respect because they drop majority points far from any minority cluster.

Distorted prior and biased probabilities. Suppose the true positive prevalence is 1% but you train on a 50/50 undersampled set. The trained classifier reports posterior probabilities consistent with a 50/50 prior, not a 1/99 one. The ranking of predictions is unaffected, so AUC stays the same, but the predicted probabilities are systematically too high for the positive class. Pozzolo, Caelen, Johnson, and Bontempi (2015) gave a closed-form correction: if $p_s$ is the score from the model trained on the undersampled set and $\beta$ is the fraction of the original majority class that was kept, the corrected probability is $p = \beta p_s / (\beta p_s - p_s + 1)$. Recent work (Wallace and Dahabreh 2024) has shown that Platt scaling and isotonic regression should be used cautiously after undersampling because they were not designed to correct prior shift; closed-form prior correction or decision-threshold tuning on a held-out validation set with the original prevalence is usually safer.

Sensitivity to which points are removed. Random undersampling depends on the random seed, and informed methods such as Tomek links and ENN depend on the distance metric, the value of $k$, and the feature scaling. Two undersampled training sets drawn from the same data can produce noticeably different classifiers, especially when the imbalance is extreme. Ensembling over multiple undersampled subsets is the standard remedy.

Leakage and improper validation. Undersampling must be applied only to the training fold of each cross-validation split, never to the validation or test fold. Undersampling the entire dataset before splitting leaks information across folds and produces optimistic estimates of test performance. The imblearn.pipeline.Pipeline class enforces this by applying samplers only during fit and not during predict.

practical guidance

A few rules of thumb summarize the working knowledge in the imbalanced-learning community.

Start with the simplest baseline: train your model on the original imbalanced data with class weights and a tuned decision threshold. Often this matches or beats elaborate sampling.
If you do undersample, ensemble. A single random subset has high variance; an EasyEnsemble or Balanced Random Forest with 10 to 50 subsets is much more stable.
Combine sampling with cost-sensitive learning or threshold tuning. The two interventions are complementary.
Apply samplers inside the cross-validation loop, never before. Use imblearn.pipeline.Pipeline or sklearn.pipeline.Pipeline with imbalanced-learn samplers.
Recalibrate predicted probabilities if you need them. Use a closed-form prior correction or fit a calibration model (Platt scaling, isotonic regression) on a held-out set that preserves the original prevalence.
Measure with metrics that are not dominated by the majority class. AUC, precision-recall AUC, F1, and G-mean are all preferable to raw accuracy.

implementations

The imbalanced-learn library, introduced by Guillaume Lemaitre, Fernando Nogueira, and Christos Aridas in Journal of Machine Learning Research (volume 18, number 17, pages 1 to 5) in 2017, is the de facto standard implementation in Python. It is part of the scikit-learn-contrib ecosystem and depends only on numpy, scipy, and scikit-learn. The table below lists the main undersampling APIs.

imbalanced-learn class	Module	Method
`RandomUnderSampler`	`imblearn.under_sampling`	Random undersampling
`TomekLinks`	`imblearn.under_sampling`	Tomek-link cleaning
`EditedNearestNeighbours`	`imblearn.under_sampling`	Wilson's ENN
`RepeatedEditedNearestNeighbours`	`imblearn.under_sampling`	Repeated ENN
`AllKNN`	`imblearn.under_sampling`	ENN with varying $k$
`CondensedNearestNeighbour`	`imblearn.under_sampling`	Hart's CNN
`OneSidedSelection`	`imblearn.under_sampling`	OSS (CNN + Tomek)
`NearMiss`	`imblearn.under_sampling`	NearMiss-1, -2, -3 (chosen by `version` parameter)
`NeighbourhoodCleaningRule`	`imblearn.under_sampling`	NCR
`ClusterCentroids`	`imblearn.under_sampling`	k-means cluster-centroid undersampling
`InstanceHardnessThreshold`	`imblearn.under_sampling`	Drop majority points classified with low confidence by a baseline learner
`EasyEnsembleClassifier`	`imblearn.ensemble`	EasyEnsemble
`RUSBoostClassifier`	`imblearn.ensemble`	RUSBoost
`BalancedBaggingClassifier`	`imblearn.ensemble`	Balanced bagging
`BalancedRandomForestClassifier`	`imblearn.ensemble`	Balanced Random Forest
`SMOTETomek`	`imblearn.combine`	SMOTE + Tomek links
`SMOTEENN`	`imblearn.combine`	SMOTE + ENN

A minimal example pipeline looks like this.

from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
    ("under", RandomUnderSampler(sampling_strategy=0.5, random_state=0)),
    ("clf", LogisticRegression(max_iter=1000)),
])

scores = cross_val_score(pipeline, X, y, scoring="average_precision", cv=5)

The sampling_strategy=0.5 parameter requests a final ratio of 1 minority to 2 majority. The sampler is applied only to the training data inside each cross-validation fold, so test scores reflect the original prevalence.

Outside Python, the UBL package on CRAN provides RandomUnderClassif, TomekClassif, CNNClassif, ENNClassif, and NCLClassif for R users, and the themis package provides recipe steps that integrate with the tidymodels framework. MATLAB ships RUSBoost as part of the Statistics and Machine Learning Toolbox.

modern context

The interest in undersampling techniques peaked in the late 2000s when tabular learners were the dominant family of models. Two shifts have changed how the technique is used today.

First, deep learning has moved many imbalance problems away from sampling. Class weights inside the cross-entropy loss, focal loss (Lin et al. 2017), online hard example mining, and balanced batch sampling all let the optimizer see every example without dropping any data. For an object detector or a language model, throwing away 99% of the negative class is rarely worth it, since the model has plenty of capacity and the negative examples carry useful gradient.

Second, calibration has become a first-class concern. Pipelines that produce ranked probability scores (credit risk, ad ranking, click prediction) cannot tolerate the prior shift introduced by aggressive undersampling without an explicit correction. The classical undersample-then-train recipe has shifted to undersample-then-train-then-recalibrate, often using held-out unsampled data to fit the calibration step.

Undersampling is still the right tool in several common situations. On heavily imbalanced tabular data with millions of majority rows and tight latency budgets, it is often the only way to fit a random forest or gradient boosting model in a reasonable amount of time. On streaming data with concept drift, ensembling over fresh undersampled subsets handles both class imbalance and non-stationarity. And on cost-sensitive problems where the false-negative penalty dominates, balanced ensembles such as Balanced Random Forest and EasyEnsemble are reliable defaults.

worked example

Consider a spam classifier trained on a dataset with 100,000 emails, of which 99,000 are legitimate and 1,000 are spam, an imbalance ratio of 99:1. A baseline logistic regression trained on the raw data tends to predict "legitimate" almost always and reports about 99% accuracy with near-zero recall on spam.

Applying random undersampling at a 1:1 ratio drops 98,000 legitimate emails uniformly at random and keeps a balanced training set of 2,000 emails. A logistic regression trained on this balanced subset reaches a much higher recall on spam, perhaps in the range of 80% to 90%, but precision drops because the trained probabilities are now consistent with a 50/50 prior and overestimate the chance of spam. To get usable probabilities back, the model is recalibrated with the closed-form prior correction $p = \beta p_s / (\beta p_s - p_s + 1)$ where $\beta = 2000 / 99000$, which restores the predicted base rate to roughly 1%.

A more robust pipeline replaces the single undersampled fit with an EasyEnsemble of 25 logistic regressions, each trained on a different balanced subset, and aggregates their predictions by averaging. The ensemble has lower variance than any single model and usually produces a higher AUC. If the team has time, they may swap the EasyEnsemble for SMOTE+Tomek (oversample the spam class, then clean the boundary with Tomek links) and compare the two on a held-out validation set with the original prevalence. The final choice is usually the model that maximizes precision-recall AUC at the operating recall threshold the product needs.

references

Hart, P. E. (1968). "The condensed nearest neighbor rule." IEEE Transactions on Information Theory, 14(3), 515 to 516.
Wilson, D. L. (1972). "Asymptotic properties of nearest neighbor rules using edited data." IEEE Transactions on Systems, Man, and Cybernetics, 2(3), 408 to 421.
Tomek, I. (1976). "Two modifications of CNN." IEEE Transactions on Systems, Man, and Cybernetics, 6(11), 769 to 772.
Kubat, M. and Matwin, S. (1997). "Addressing the curse of imbalanced training sets: one-sided selection." Proceedings of the 14th International Conference on Machine Learning, 179 to 186.
Laurikkala, J. (2001). "Improving identification of difficult small classes by balancing class distribution." Artificial Intelligence in Medicine, AIME 2001, Lecture Notes in Computer Science 2101, Springer.
Mani, I. and Zhang, I. (2003). "kNN approach to unbalanced data distributions: a case study involving information extraction." Workshop on Learning from Imbalanced Datasets, ICML.
Batista, G. E. A. P. A., Bazzan, A. L. C., and Monard, M. C. (2003). "Balancing training data for automated annotation of keywords: a case study." Proceedings of the Second Brazilian Workshop on Bioinformatics.
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). "A study of the behavior of several methods for balancing machine learning training data." ACM SIGKDD Explorations Newsletter, 6(1), 20 to 29.
Chen, C., Liaw, A., and Breiman, L. (2004). "Using random forest to learn imbalanced data." Technical Report 666, Department of Statistics, University of California Berkeley.
Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). "Exploratory undersampling for class-imbalance learning." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539 to 550.
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2010). "RUSBoost: a hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 40(1), 185 to 197.
Pozzolo, A. D., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). "Calibrating probability with undersampling for unbalanced classification." IEEE Symposium Series on Computational Intelligence, 159 to 166.
Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). "Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning." Journal of Machine Learning Research, 18(17), 1 to 5.
Krawczyk, B. (2016). "Learning from imbalanced data: open challenges and future directions." Progress in Artificial Intelligence, 5(4), 221 to 232.
imbalanced-learn user guide. https://imbalanced-learn.org/stable/under_sampling.html

Undersampling

why undersample

undersampling vs oversampling

random undersampling

informed undersampling techniques

tomek links

edited nearest neighbors and its variants

condensed nearest neighbor and one-sided selection

nearmiss

neighborhood cleaning rule

ensemble undersampling

combined sampling

risks and pitfalls

practical guidance

implementations

modern context

worked example

see also

references

Improve this article

why undersample

undersampling vs oversampling

random undersampling

informed undersampling techniques

tomek links

edited nearest neighbors and its variants

condensed nearest neighbor and one-sided selection

nearmiss

neighborhood cleaning rule

ensemble undersampling

combined sampling

risks and pitfalls

practical guidance

implementations

modern context

worked example

see also

references

why undersample

undersampling vs oversampling

random undersampling

informed undersampling techniques

tomek links

edited nearest neighbors and its variants

condensed nearest neighbor and one-sided selection

nearmiss

neighborhood cleaning rule

ensemble undersampling

combined sampling

risks and pitfalls

practical guidance

implementations

modern context

worked example

see also

references

Improve this article

Related Articles

Majority class

Minority class

Sampling with replacement

Subsampling

Upweighting

Temperature sampling

why undersample

undersampling vs oversampling

random undersampling

informed undersampling techniques

tomek links

edited nearest neighbors and its variants

condensed nearest neighbor and one-sided selection

nearmiss

neighborhood cleaning rule

ensemble undersampling

combined sampling

risks and pitfalls

practical guidance

implementations

modern context

worked example

see also

references

Related Articles

Majority class

Minority class

Sampling with replacement

Subsampling

Upweighting

Temperature sampling