Undersampling is a class-imbalance handling technique in machine learning that reduces the number of majority class examples in a training set so the minority class is no longer drowned out. The idea is straightforward: if one label dominates the data, throw away some of those samples until the class distribution is closer to balanced, then train a classifier on the smaller, more balanced subset. Undersampling sits opposite to oversampling, which adds minority examples instead of removing majority ones, and it is one of the oldest and most widely used remedies for an imbalanced dataset.
The technique is most often used in tabular settings such as fraud detection, churn prediction, medical screening, and credit risk, where the positive class is rare and tree-based learners or linear models are the workhorse. It is less common in modern deep learning, where class weights, focal loss, and batch sampling are usually preferred, but it remains a useful tool in any pipeline that has to deal with skewed labels and limited compute.
Class imbalance hurts standard learners because the loss is dominated by majority examples. A logistic regression trained on a 99.8/0.2 split will happily push its decision boundary to predict the negative class on every input and still report 99.8% accuracy. Undersampling fixes this by changing the empirical distribution the learner sees during training. After dropping majority points, the gradient is no longer overwhelmed by the dominant label, and the optimizer is forced to spend capacity modeling the minority class.
Three practical motivations push practitioners toward undersampling rather than oversampling.
class_weight='balanced' option in scikit-learn, but achieved by throwing away data instead of changing the loss function.The trade-off is information loss. Every removed majority example is a real observation the model never sees. Whether this matters depends on how redundant the majority class is and how much you can afford to lose.
Undersampling and oversampling attack the same problem from opposite directions. The table below summarizes the main differences.
| Aspect | Undersampling | Oversampling |
|---|---|---|
| Operates on | Majority class | Minority class |
| Effect on dataset size | Smaller | Larger |
| Risk of overfitting | Lower (no synthetic or duplicated points) | Higher (copies or interpolated points) |
| Risk of information loss | Higher (real majority samples discarded) | Lower (no original data dropped) |
| Training time | Faster than baseline | Slower than baseline |
| Common methods | Random undersampling, Tomek links, ENN, NearMiss, OSS | Random oversampling, SMOTE, ADASYN, Borderline-SMOTE |
| Works well when | Majority class is large and partially redundant | Minority class is too small to define its own region |
| Often combined with | Ensembles, threshold tuning, calibration | Cleaning steps such as Tomek links or ENN |
In practice the choice is rarely binary. SMOTE+Tomek and SMOTE+ENN combine an oversampling step with an undersampling cleaning step, and balanced ensemble methods such as EasyEnsemble and Balanced Random Forest build many small undersampled subsets and average over them. The goal in all of these is to keep the noise-reduction benefit of cleaning the majority class while avoiding the variance and information loss of a single aggressive cut.
The simplest method is random undersampling, which uniformly drops majority examples until the desired ratio is reached. If the majority class has 100,000 samples and the minority class has 1,000, random undersampling at a 1:1 ratio keeps a random 1,000 majority points and discards the other 99,000.
Random undersampling is fast, requires no hyperparameter tuning beyond the target ratio, and works on any data type because it does not depend on a distance metric or a feature-space neighborhood. It is a strong baseline in imbalanced-learn and in many production pipelines, and it underlies more advanced ensemble methods such as Balanced Random Forest and RUSBoost.
The drawbacks are also straightforward. Removing data uniformly throws away potentially informative majority points along with redundant ones. The resulting model has higher variance, since the kept subset is small and depends on the random seed, and the predicted probabilities are biased because the prior in the training set no longer matches the true prior in the test distribution. The first issue is usually addressed with ensembling over multiple undersampled subsets; the second is addressed with calibration or prior correction, both discussed below.
Informed undersampling methods choose which majority points to drop instead of relying on chance. Most of them are descendants of the prototype-selection literature from the 1960s and 1970s on k-nearest neighbors, repurposed for class imbalance. The table below summarizes the main families.
| Method | Year | Core idea | What it removes | Reference |
|---|---|---|---|---|
| Tomek links | 1976 | A Tomek link is a pair of nearest neighbors with different labels; remove the majority point in each link | Borderline majority points adjacent to minority points | Tomek (1976) |
| Edited Nearest Neighbors (ENN) | 1972 | Remove any majority point whose label disagrees with the majority vote of its $k$ nearest neighbors | Majority points that look like minority points to a kNN classifier | Wilson (1972) |
| Repeated ENN (RENN) | derived | Apply ENN repeatedly until no further removals occur | More aggressive cleaning than a single ENN pass | imbalanced-learn |
| All-kNN | derived | Run ENN with $k=1, 2, \ldots, K$ and keep only points that survive every pass | Even more aggressive cleaning, varying neighborhood size | Tomek (1976) |
| Condensed Nearest Neighbor (CNN) | 1968 | Iteratively keep majority points needed to correctly classify the rest under 1-NN | Redundant majority points far from the boundary | Hart (1968) |
| One-Sided Selection (OSS) | 1997 | Apply CNN to drop redundant majority points, then Tomek links to drop borderline ones | Both redundant and borderline majority points | Kubat and Matwin (1997) |
| NearMiss-1 | 2003 | Keep majority points with the smallest mean distance to the three closest minority points | Majority points far from the minority class | Mani and Zhang (2003) |
| NearMiss-2 | 2003 | Keep majority points with the smallest mean distance to the three farthest minority points | Same idea, different reference set | Mani and Zhang (2003) |
| NearMiss-3 | 2003 | For each minority point, keep its $k$ closest majority neighbors | Majority points not adjacent to any minority point | Mani and Zhang (2003) |
| Neighborhood Cleaning Rule (NCR) | 2001 | Combine ENN with rules that protect minority points whose neighborhood is mostly majority | Noisy majority points and majority neighbors of misclassified minority points | Laurikkala (2001) |
| Cluster Centroids | classical | Cluster the majority class with k-means and replace each cluster with its centroid | Many majority points, replaced by a smaller set of representatives | imbalanced-learn |
A few of these deserve a closer look because they show up in almost every imbalanced-learning pipeline.
Tomek links were introduced by Ivan Tomek in 1976 in IEEE Transactions on Systems, Man, and Cybernetics (volume 6, pages 769 to 772) as a modification of the Condensed Nearest Neighbor rule. Two examples form a Tomek link if they are each other's nearest neighbors and they belong to different classes. In a binary classification setting, every Tomek link contains exactly one majority point and one minority point. Removing the majority point in the link cleans up the boundary by erasing the point that is most likely to be confusing the classifier. The minority point is left untouched.
Tomek-link cleaning rarely produces a balanced dataset by itself because the number of links is usually much smaller than the imbalance gap. It is most useful as a postprocessing step after SMOTE or random oversampling. Batista, Prati, and Monard (2004) found SMOTE+Tomek to be one of the strongest baselines on a panel of 13 imbalanced datasets.
Edited Nearest Neighbors (ENN) was proposed by Dennis L. Wilson in 1972 in the same journal (volume 2, pages 408 to 421). The rule is simple: train a kNN classifier on the data, then remove every majority point whose label disagrees with the majority vote of its $k$ nearest neighbors. The original paper used $k = 3$. Wilson showed that, asymptotically, classifying with 1-NN on the edited set approaches the Bayes risk in many problems with only a few preclassified samples. ENN is more aggressive than Tomek links because it removes any majority point that looks like a minority point under kNN, not only those that form a mutual nearest-neighbor pair.
Repeated ENN (RENN) applies ENN over and over until no more points are removed. All-kNN runs ENN once for each value of $k$ from 1 up to a chosen maximum and keeps only points that survive every pass. Both produce cleaner but smaller training sets than a single ENN pass.
Condensed Nearest Neighbor (CNN), proposed by Peter E. Hart in 1968 in IEEE Transactions on Information Theory (volume 14, pages 515 to 516), was originally a storage-reduction trick for nearest-neighbor classifiers. The algorithm grows a subset $S$ by going through the training data and adding any point that is misclassified when 1-NN is run on the current $S$. Points in dense interior regions are usually classified correctly by their neighbors and never enter $S$, so the final subset retains mostly border points.
One-Sided Selection (OSS), proposed by Miroslav Kubat and Stan Matwin at ICML 1997, plugs CNN into the imbalance setting. The algorithm keeps all minority points and applies CNN only to the majority class to drop redundant interior points, then applies Tomek-link cleaning to remove borderline majority points. The resulting subset preserves the shape of the majority distribution while stripping away both noise near the boundary and redundancy in the bulk.
The NearMiss family was introduced by Inderjeet Mani and I-Hsin Zhang in 2003 in the Workshop on Learning from Imbalanced Datasets held at ICML. Three variants are commonly used. NearMiss-1 keeps majority points with the smallest mean distance to their three closest minority points; NearMiss-2 uses the three farthest minority points; NearMiss-3 keeps, for each minority point, the $k$ closest majority points. NearMiss-1 tends to retain majority points that sit close to minority clusters, which is useful when the boundary is the part of the majority class you want the classifier to learn. NearMiss-3 guarantees that every minority point keeps some majority neighbors, which makes it more robust to outliers.
The Neighborhood Cleaning Rule (NCR) was proposed by Jorma Laurikkala in 2001 at the Eighth Conference on Artificial Intelligence in Medicine in Europe. NCR combines ENN with a second pass that finds majority neighbors of misclassified minority points and removes them as well. On Laurikkala's experiments NCR outperformed both random undersampling and one-sided selection on a panel of medical datasets, particularly when the small class was clinically important.
A single undersampled subset throws away most of the majority data and produces a high-variance classifier. Ensemble undersampling reduces this variance by training many classifiers on different random subsets and averaging their predictions. The same idea drives bagging on full datasets, but here each bag is a balanced sample drawn from the imbalanced original.
| Method | Year | Idea |
|---|---|---|
| EasyEnsemble | 2009 | Sample several balanced subsets from the majority class, train an AdaBoost classifier on each, then aggregate predictions |
| BalanceCascade | 2009 | Same as EasyEnsemble but sequential; after each round, correctly classified majority points are removed before the next subset is drawn |
| RUSBoost | 2010 | At each AdaBoost iteration, randomly undersample the majority class to a target ratio before fitting the weak learner |
| Balanced Bagging | classical | Bagging where each bootstrap sample is balanced by undersampling the majority class |
| Balanced Random Forest | 2004 | Each tree in the forest is grown on a balanced bootstrap sample drawn by undersampling the majority class |
EasyEnsemble and BalanceCascade were proposed by Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou in 2009 in IEEE Transactions on Systems, Man, and Cybernetics, Part B (volume 39, pages 539 to 550). The paper showed that both methods consistently improved AUC, F-measure, and G-mean over single-classifier undersampling, with training time comparable to a single AdaBoost run when the same number of weak learners is used.
RUSBoost, introduced by Chris Seiffert, Taghi Khoshgoftaar, Jason Van Hulse, and Amri Napolitano in 2010 in IEEE Transactions on Systems, Man and Cybernetics, Part A (volume 40, pages 185 to 197), takes the same direction by interleaving random undersampling with AdaBoost weight updates. The authors showed that RUSBoost matched or beat SMOTEBoost on most benchmarks while being simpler and faster.
Balanced Random Forest, proposed by Chao Chen, Andy Liaw, and Leo Breiman in Technical Report 666 from the UC Berkeley Department of Statistics in 2004, is the random-forest counterpart of these methods. Each tree in the forest is grown on a bootstrap sample that is balanced by undersampling the majority class, so the forest's per-class error rates are decoupled from the original prevalence.
Undersampling and oversampling can be combined to take advantage of both. The two most cited combined methods are SMOTE+Tomek and SMOTE+ENN, both proposed by Gustavo Batista, Ronaldo Prati, and Maria Carolina Monard in their 2003 and 2004 papers on hybrid balancing methods. The idea is to enrich the minority class with synthetic samples, then clean the result with an undersampling method that removes any boundary-crossing or noisy points (whether original majority or synthetic minority) that the SMOTE step has created.
SMOTE+Tomek removes Tomek-link pairs from the post-SMOTE dataset, which deletes both members of any boundary pair where SMOTE has placed a synthetic minority point too close to a real majority point. SMOTE+ENN is more aggressive, since ENN removes any point whose label disagrees with its kNN neighborhood. On Batista's panel of 13 datasets, SMOTE+ENN ranked first on 10 of them.
Undersampling solves one problem and introduces several others.
Information loss. Every removed majority point is data the model will never see. If the majority class is heterogeneous, undersampling can erase entire subregions, and the resulting classifier will fail on inputs that come from those subregions at test time. NearMiss-2 and NearMiss-3 are particularly aggressive in this respect because they drop majority points far from any minority cluster.
Distorted prior and biased probabilities. Suppose the true positive prevalence is 1% but you train on a 50/50 undersampled set. The trained classifier reports posterior probabilities consistent with a 50/50 prior, not a 1/99 one. The ranking of predictions is unaffected, so AUC stays the same, but the predicted probabilities are systematically too high for the positive class. Pozzolo, Caelen, Johnson, and Bontempi (2015) gave a closed-form correction: if $p_s$ is the score from the model trained on the undersampled set and $\beta$ is the fraction of the original majority class that was kept, the corrected probability is $p = \beta p_s / (\beta p_s - p_s + 1)$. Recent work (Wallace and Dahabreh 2024) has shown that Platt scaling and isotonic regression should be used cautiously after undersampling because they were not designed to correct prior shift; closed-form prior correction or decision-threshold tuning on a held-out validation set with the original prevalence is usually safer.
Sensitivity to which points are removed. Random undersampling depends on the random seed, and informed methods such as Tomek links and ENN depend on the distance metric, the value of $k$, and the feature scaling. Two undersampled training sets drawn from the same data can produce noticeably different classifiers, especially when the imbalance is extreme. Ensembling over multiple undersampled subsets is the standard remedy.
Leakage and improper validation. Undersampling must be applied only to the training fold of each cross-validation split, never to the validation or test fold. Undersampling the entire dataset before splitting leaks information across folds and produces optimistic estimates of test performance. The imblearn.pipeline.Pipeline class enforces this by applying samplers only during fit and not during predict.
A few rules of thumb summarize the working knowledge in the imbalanced-learning community.
imblearn.pipeline.Pipeline or sklearn.pipeline.Pipeline with imbalanced-learn samplers.The imbalanced-learn library, introduced by Guillaume Lemaitre, Fernando Nogueira, and Christos Aridas in Journal of Machine Learning Research (volume 18, number 17, pages 1 to 5) in 2017, is the de facto standard implementation in Python. It is part of the scikit-learn-contrib ecosystem and depends only on numpy, scipy, and scikit-learn. The table below lists the main undersampling APIs.
| imbalanced-learn class | Module | Method |
|---|---|---|
RandomUnderSampler | imblearn.under_sampling | Random undersampling |
TomekLinks | imblearn.under_sampling | Tomek-link cleaning |
EditedNearestNeighbours | imblearn.under_sampling | Wilson's ENN |
RepeatedEditedNearestNeighbours | imblearn.under_sampling | Repeated ENN |
AllKNN | imblearn.under_sampling | ENN with varying $k$ |
CondensedNearestNeighbour | imblearn.under_sampling | Hart's CNN |
OneSidedSelection | imblearn.under_sampling | OSS (CNN + Tomek) |
NearMiss | imblearn.under_sampling | NearMiss-1, -2, -3 (chosen by version parameter) |
NeighbourhoodCleaningRule | imblearn.under_sampling | NCR |
ClusterCentroids | imblearn.under_sampling | k-means cluster-centroid undersampling |
InstanceHardnessThreshold | imblearn.under_sampling | Drop majority points classified with low confidence by a baseline learner |
EasyEnsembleClassifier | imblearn.ensemble | EasyEnsemble |
RUSBoostClassifier | imblearn.ensemble | RUSBoost |
BalancedBaggingClassifier | imblearn.ensemble | Balanced bagging |
BalancedRandomForestClassifier | imblearn.ensemble | Balanced Random Forest |
SMOTETomek | imblearn.combine | SMOTE + Tomek links |
SMOTEENN | imblearn.combine | SMOTE + ENN |
A minimal example pipeline looks like this.
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
pipeline = Pipeline([
("under", RandomUnderSampler(sampling_strategy=0.5, random_state=0)),
("clf", LogisticRegression(max_iter=1000)),
])
scores = cross_val_score(pipeline, X, y, scoring="average_precision", cv=5)
The sampling_strategy=0.5 parameter requests a final ratio of 1 minority to 2 majority. The sampler is applied only to the training data inside each cross-validation fold, so test scores reflect the original prevalence.
Outside Python, the UBL package on CRAN provides RandomUnderClassif, TomekClassif, CNNClassif, ENNClassif, and NCLClassif for R users, and the themis package provides recipe steps that integrate with the tidymodels framework. MATLAB ships RUSBoost as part of the Statistics and Machine Learning Toolbox.
The interest in undersampling techniques peaked in the late 2000s when tabular learners were the dominant family of models. Two shifts have changed how the technique is used today.
First, deep learning has moved many imbalance problems away from sampling. Class weights inside the cross-entropy loss, focal loss (Lin et al. 2017), online hard example mining, and balanced batch sampling all let the optimizer see every example without dropping any data. For an object detector or a language model, throwing away 99% of the negative class is rarely worth it, since the model has plenty of capacity and the negative examples carry useful gradient.
Second, calibration has become a first-class concern. Pipelines that produce ranked probability scores (credit risk, ad ranking, click prediction) cannot tolerate the prior shift introduced by aggressive undersampling without an explicit correction. The classical undersample-then-train recipe has shifted to undersample-then-train-then-recalibrate, often using held-out unsampled data to fit the calibration step.
Undersampling is still the right tool in several common situations. On heavily imbalanced tabular data with millions of majority rows and tight latency budgets, it is often the only way to fit a random forest or gradient boosting model in a reasonable amount of time. On streaming data with concept drift, ensembling over fresh undersampled subsets handles both class imbalance and non-stationarity. And on cost-sensitive problems where the false-negative penalty dominates, balanced ensembles such as Balanced Random Forest and EasyEnsemble are reliable defaults.
Consider a spam classifier trained on a dataset with 100,000 emails, of which 99,000 are legitimate and 1,000 are spam, an imbalance ratio of 99:1. A baseline logistic regression trained on the raw data tends to predict "legitimate" almost always and reports about 99% accuracy with near-zero recall on spam.
Applying random undersampling at a 1:1 ratio drops 98,000 legitimate emails uniformly at random and keeps a balanced training set of 2,000 emails. A logistic regression trained on this balanced subset reaches a much higher recall on spam, perhaps in the range of 80% to 90%, but precision drops because the trained probabilities are now consistent with a 50/50 prior and overestimate the chance of spam. To get usable probabilities back, the model is recalibrated with the closed-form prior correction $p = \beta p_s / (\beta p_s - p_s + 1)$ where $\beta = 2000 / 99000$, which restores the predicted base rate to roughly 1%.
A more robust pipeline replaces the single undersampled fit with an EasyEnsemble of 25 logistic regressions, each trained on a different balanced subset, and aggregates their predictions by averaging. The ensemble has lower variance than any single model and usually produces a higher AUC. If the team has time, they may swap the EasyEnsemble for SMOTE+Tomek (oversample the spam class, then clean the boundary with Tomek links) and compare the two on a held-out validation set with the original prevalence. The final choice is usually the model that maximizes precision-recall AUC at the operating recall threshold the product needs.