# Undersampling

> Source: https://aiwiki.ai/wiki/undersampling
> Updated: 2026-06-25
> Categories: Machine Learning, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Undersampling** is a [class imbalance](/wiki/class_imbalance) handling technique in [machine learning](/wiki/machine_learning) that removes examples from the [majority class](/wiki/majority_class) of a training set so the [minority class](/wiki/minority_class) is no longer drowned out. Concretely: if one label dominates the data, you discard some of those samples until the class distribution is closer to balanced, then train a classifier on the smaller, more balanced subset. The main methods are random undersampling, Tomek links, NearMiss-1/2/3, Condensed Nearest Neighbour, Edited Nearest Neighbours, One-Sided Selection, Neighbourhood Cleaning Rule, and Cluster Centroids, all implemented in the `imbalanced-learn` Python library.[13][15] Undersampling sits opposite to [oversampling](/wiki/oversampling), which adds minority examples instead of removing majority ones, and it is one of the oldest and most widely used remedies for an [imbalanced dataset](/wiki/imbalanced_dataset).

The technique is most often used in tabular settings such as fraud detection, churn prediction, medical screening, and credit risk, where the positive class is rare and tree-based learners or linear models are the workhorse. It is less common in modern deep learning, where class weights, [focal loss](/wiki/focal_loss), and batch sampling are usually preferred, but it remains a useful tool in any pipeline that has to deal with skewed labels and limited compute.

The `imbalanced-learn` documentation frames the whole family in one line: "One way of handling imbalanced datasets is to reduce the number of observations from all classes but the minority class."[15]

## What is undersampling, and what are the two families of methods?

Undersampling methods split into two groups, a distinction the `imbalanced-learn` user guide draws explicitly.[15] Controlled undersampling methods "reduce the number of observations from the targeted classes to a number specified by the user," so you set the final ratio (random undersampling, NearMiss, and Cluster Centroids all belong here). Cleaning undersampling methods instead "clean" the feature space by removing noisy or too-easy-to-classify majority points, and the final count "varies with the cleaning method and cannot be specified by the user" (Tomek links, Edited Nearest Neighbours, One-Sided Selection, and the Neighbourhood Cleaning Rule belong here).[15] The practical consequence is that controlled methods can hit any target balance but may delete useful data, while cleaning methods only prune the boundary and rarely balance the data on their own.

## Why undersample?

Class imbalance hurts standard learners because the loss is dominated by majority examples. A logistic regression trained on a 99.8/0.2 split will happily push its [decision boundary](/wiki/decision_boundary) to predict the negative class on every input and still report 99.8% accuracy. Undersampling fixes this by changing the empirical distribution the learner sees during training. After dropping majority points, the gradient is no longer overwhelmed by the dominant label, and the optimizer is forced to spend capacity modeling the minority class.

Three practical motivations push practitioners toward undersampling rather than oversampling.

1. **Faster training.** A balanced subset of a 100-million-row dataset with 0.1% positive prevalence is around 200,000 rows, two orders of magnitude smaller than the original. Iteration on a small balanced set is cheap, and the entire pipeline (cross-validation, hyperparameter search, calibration) becomes interactive instead of overnight.
2. **Sometimes better generalization.** When the majority class contains a lot of redundant or noisy points, removing them can sharpen the [decision boundary](/wiki/decision_boundary) and reduce variance in subsequent splits. Several empirical surveys (Batista, Prati, and Monard 2004; Lemaitre, Nogueira, and Aridas 2017) show undersampling matching or beating oversampling on tabular benchmarks, particularly when paired with ensembles.[8][13]
3. **Reduced majority dominance in the loss.** Undersampling is equivalent in expectation to reweighting the loss so that each class contributes the same total. This is the same intuition behind [cost-sensitive learning](/wiki/cost-sensitive_learning) and the `class_weight='balanced'` option in [scikit-learn](/wiki/scikit-learn), but achieved by throwing away data instead of changing the loss function.

The trade-off is information loss. Every removed majority example is a real observation the model never sees. Whether this matters depends on how redundant the majority class is and how much you can afford to lose.

## How does undersampling differ from oversampling?

Undersampling and [oversampling](/wiki/oversampling) attack the same problem from opposite directions. The table below summarizes the main differences.

| Aspect | Undersampling | Oversampling |
|---|---|---|
| Operates on | Majority class | Minority class |
| Effect on dataset size | Smaller | Larger |
| Risk of overfitting | Lower (no synthetic or duplicated points) | Higher (copies or interpolated points) |
| Risk of information loss | Higher (real majority samples discarded) | Lower (no original data dropped) |
| Training time | Faster than baseline | Slower than baseline |
| Common methods | Random undersampling, [Tomek links](/wiki/tomek_links), ENN, NearMiss, OSS | Random oversampling, [SMOTE](/wiki/smote), ADASYN, Borderline-SMOTE |
| Works well when | Majority class is large and partially redundant | Minority class is too small to define its own region |
| Often combined with | Ensembles, threshold tuning, calibration | Cleaning steps such as Tomek links or ENN |

In practice the choice is rarely binary. SMOTE+Tomek and SMOTE+ENN combine an oversampling step with an undersampling cleaning step, and balanced ensemble methods such as EasyEnsemble and Balanced [Random Forest](/wiki/random_forest) build many small undersampled subsets and average over them. The goal in all of these is to keep the noise-reduction benefit of cleaning the majority class while avoiding the variance and information loss of a single aggressive cut.

## What is random undersampling?

The simplest method is random undersampling, which uniformly drops majority examples until the desired ratio is reached. If the majority class has 100,000 samples and the minority class has 1,000, random undersampling at a 1:1 ratio keeps a random 1,000 majority points and discards the other 99,000.

Random undersampling is fast, requires no hyperparameter tuning beyond the target ratio, and works on any data type because it does not depend on a distance metric or a feature-space neighborhood. It is a strong baseline in `imbalanced-learn` and in many production pipelines, and it underlies more advanced ensemble methods such as [Balanced Random Forest](/wiki/random_forest) and RUSBoost.

The drawbacks are also straightforward. Removing data uniformly throws away potentially informative majority points along with redundant ones. The resulting model has higher variance, since the kept subset is small and depends on the random seed, and the predicted probabilities are biased because the prior in the training set no longer matches the true prior in the test distribution. The first issue is usually addressed with ensembling over multiple undersampled subsets; the second is addressed with calibration or prior correction, both discussed below.

## What are the informed undersampling techniques?

Informed undersampling methods choose which majority points to drop instead of relying on chance. Most of them are descendants of the prototype-selection literature from the 1960s and 1970s on [k-nearest neighbors](/wiki/k_nearest_neighbors), repurposed for class imbalance. The table below summarizes the main families.

| Method | Year | Core idea | What it removes | Reference |
|---|---|---|---|---|
| [Tomek links](/wiki/tomek_links) | 1976 | A Tomek link is a pair of nearest neighbors with different labels; remove the majority point in each link | Borderline majority points adjacent to minority points | Tomek (1976) |
| Edited Nearest Neighbors (ENN) | 1972 | Remove any majority point whose label disagrees with the majority vote of its $k$ nearest neighbors | Majority points that look like minority points to a kNN classifier | Wilson (1972) |
| Repeated ENN (RENN) | derived | Apply ENN repeatedly until no further removals occur | More aggressive cleaning than a single ENN pass | imbalanced-learn |
| All-kNN | derived | Run ENN with $k=1, 2, \ldots, K$ and keep only points that survive every pass | Even more aggressive cleaning, varying neighborhood size | Tomek (1976) |
| Condensed Nearest Neighbor (CNN) | 1968 | Iteratively keep majority points needed to correctly classify the rest under 1-NN | Redundant majority points far from the boundary | Hart (1968) |
| One-Sided Selection (OSS) | 1997 | Apply CNN to drop redundant majority points, then Tomek links to drop borderline ones | Both redundant and borderline majority points | Kubat and Matwin (1997) |
| NearMiss-1 | 2003 | Keep majority points with the smallest mean distance to the three closest minority points | Majority points far from the minority class | Mani and Zhang (2003) |
| NearMiss-2 | 2003 | Keep majority points with the smallest mean distance to the three farthest minority points | Same idea, different reference set | Mani and Zhang (2003) |
| NearMiss-3 | 2003 | For each minority point keep its $M$ nearest majority neighbors, then among those keep the points whose mean distance to their $N$ nearest minority points is largest | Majority points not adjacent to any minority point | Mani and Zhang (2003) |
| Neighborhood Cleaning Rule (NCR) | 2001 | Combine ENN with rules that protect minority points whose neighborhood is mostly majority | Noisy majority points and majority neighbors of misclassified minority points | Laurikkala (2001) |
| Cluster Centroids | classical | Cluster the majority class with k-means and replace each cluster with its centroid | Many majority points, replaced by a smaller set of representatives | imbalanced-learn |

A few of these deserve a closer look because they show up in almost every imbalanced-learning pipeline.

### What are Tomek links?

[Tomek links](/wiki/tomek_links) were introduced by Ivan Tomek in 1976 in *IEEE Transactions on Systems, Man, and Cybernetics* (volume 6, pages 769 to 772) as a modification of the Condensed Nearest Neighbor rule.[3] As the `imbalanced-learn` docs put it, "A Tomek's link exists when two samples from different classes are closest neighbors to each other."[15] In a binary classification setting, every Tomek link contains exactly one majority point and one minority point. Removing the majority point in the link cleans up the boundary by erasing the point that is most likely to be confusing the classifier. The minority point is left untouched.

Tomek-link cleaning rarely produces a balanced dataset by itself because the number of links is usually much smaller than the imbalance gap. It is most useful as a postprocessing step after [SMOTE](/wiki/smote) or random oversampling. Batista, Prati, and Monard (2004) found SMOTE+Tomek to be one of the strongest baselines on a panel of 13 imbalanced datasets.[8]

### How do Edited Nearest Neighbors and its variants work?

Edited Nearest Neighbors (ENN) was proposed by Dennis L. Wilson in 1972 in the same journal (volume 2, pages 408 to 421). The rule is simple: train a kNN classifier on the data, then remove every majority point whose label disagrees with the majority vote of its $k$ nearest neighbors. The original paper used $k = 3$. Wilson showed that, asymptotically, classifying with 1-NN on the edited set approaches the Bayes risk in many problems with only a few preclassified samples.[2] ENN is more aggressive than Tomek links because it removes any majority point that looks like a minority point under kNN, not only those that form a mutual nearest-neighbor pair.

Repeated ENN (RENN) applies ENN over and over until no more points are removed. All-kNN runs ENN once for each value of $k$ from 1 up to a chosen maximum and keeps only points that survive every pass. Both produce cleaner but smaller training sets than a single ENN pass.

### How do Condensed Nearest Neighbor and One-Sided Selection work?

Condensed Nearest Neighbor (CNN), proposed by Peter E. Hart in 1968 in *IEEE Transactions on Information Theory* (volume 14, pages 515 to 516), was originally a storage-reduction trick for nearest-neighbor classifiers.[1] The algorithm grows a subset $S$ by going through the training data and adding any point that is misclassified when 1-NN is run on the current $S$. Points in dense interior regions are usually classified correctly by their neighbors and never enter $S$, so the final subset retains mostly border points.

One-Sided Selection (OSS), proposed by Miroslav Kubat and Stan Matwin at ICML 1997, plugs CNN into the imbalance setting.[4] The algorithm keeps all minority points and applies CNN only to the majority class to drop redundant interior points, then applies Tomek-link cleaning to remove borderline majority points. The resulting subset preserves the shape of the majority distribution while stripping away both noise near the boundary and redundancy in the bulk.

### How does NearMiss work?

The NearMiss family was introduced by Inderjeet Mani and I-Hsin Zhang in 2003 in the *Workshop on Learning from Imbalanced Datasets* held at ICML.[6] Three variants are commonly used. NearMiss-1 keeps majority points with the smallest mean distance to their three closest minority points; NearMiss-2 uses the three farthest minority points; NearMiss-3 is a two-step algorithm that first keeps, for each minority point, its $M$ nearest majority neighbors, then among those retains the majority points whose mean distance to their $N$ nearest minority points is largest.[15] NearMiss-1 tends to retain majority points that sit close to minority clusters, which is useful when the boundary is the part of the majority class you want the classifier to learn. NearMiss-3 guarantees that every minority point keeps some majority neighbors, which makes it more robust to outliers.

### What is the Neighborhood Cleaning Rule?

The Neighborhood Cleaning Rule (NCR) was proposed by Jorma Laurikkala in 2001 at the Eighth Conference on Artificial Intelligence in Medicine in Europe.[5] NCR combines ENN with a second pass that finds majority neighbors of misclassified minority points and removes them as well. On Laurikkala's experiments NCR outperformed both random undersampling and one-sided selection on a panel of medical datasets, particularly when the small class was clinically important.[5]

## How does ensemble undersampling reduce variance?

A single undersampled subset throws away most of the majority data and produces a high-variance classifier. Ensemble undersampling reduces this variance by training many classifiers on different random subsets and averaging their predictions. The same idea drives bagging on full datasets, but here each bag is a balanced sample drawn from the imbalanced original.

| Method | Year | Idea |
|---|---|---|
| EasyEnsemble | 2009 | Sample several balanced subsets from the majority class, train an AdaBoost classifier on each, then aggregate predictions |
| BalanceCascade | 2009 | Same as EasyEnsemble but sequential; after each round, correctly classified majority points are removed before the next subset is drawn |
| RUSBoost | 2010 | At each AdaBoost iteration, randomly undersample the majority class to a target ratio before fitting the weak learner |
| Balanced Bagging | classical | Bagging where each bootstrap sample is balanced by undersampling the majority class |
| Balanced [Random Forest](/wiki/random_forest) | 2004 | Each tree in the forest is grown on a balanced bootstrap sample drawn by undersampling the majority class |

EasyEnsemble and BalanceCascade were proposed by Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou in 2009 in *IEEE Transactions on Systems, Man, and Cybernetics, Part B* (volume 39, pages 539 to 550).[10] As the authors describe it, "EasyEnsemble samples several subsets from the majority class, trains a learner using each of them, and combines the outputs of those learners."[10] The paper showed that both methods consistently improved AUC, F-measure, and G-mean over single-classifier undersampling, with training time comparable to a single AdaBoost run when the same number of weak learners is used.[10]

RUSBoost, introduced by Chris Seiffert, Taghi Khoshgoftaar, Jason Van Hulse, and Amri Napolitano in 2010 in *IEEE Transactions on Systems, Man and Cybernetics, Part A* (volume 40, pages 185 to 197), takes the same direction by interleaving random undersampling with AdaBoost weight updates.[11] The authors showed that RUSBoost matched or beat SMOTEBoost on most benchmarks while being simpler and faster.[11]

Balanced [Random Forest](/wiki/random_forest), proposed by Chao Chen, Andy Liaw, and Leo Breiman in Technical Report 666 from the UC Berkeley Department of Statistics in 2004, is the random-forest counterpart of these methods.[9] Each tree in the forest is grown on a bootstrap sample that is balanced by undersampling the majority class, so the forest's per-class error rates are decoupled from the original prevalence.

## Can you combine undersampling with oversampling?

Undersampling and [oversampling](/wiki/oversampling) can be combined to take advantage of both. The two most cited combined methods are SMOTE+Tomek and SMOTE+ENN, both proposed by Gustavo Batista, Ronaldo Prati, and Maria Carolina Monard in their 2003 and 2004 papers on hybrid balancing methods.[7][8] The idea is to enrich the minority class with synthetic samples, then clean the result with an undersampling method that removes any boundary-crossing or noisy points (whether original majority or synthetic minority) that the SMOTE step has created.

SMOTE+Tomek removes Tomek-link pairs from the post-SMOTE dataset, which deletes both members of any boundary pair where SMOTE has placed a synthetic minority point too close to a real majority point. SMOTE+ENN is more aggressive, since ENN removes any point whose label disagrees with its kNN neighborhood. On Batista's panel of 13 datasets, SMOTE+ENN ranked first on 10 of them.[8]

## What are the risks and pitfalls of undersampling?

Undersampling solves one problem and introduces several others.

**Information loss.** Every removed majority point is data the model will never see. If the majority class is heterogeneous, undersampling can erase entire subregions, and the resulting classifier will fail on inputs that come from those subregions at test time. NearMiss-2 and NearMiss-3 are particularly aggressive in this respect because they drop majority points far from any minority cluster.

**Distorted prior and biased probabilities.** Suppose the true positive prevalence is 1% but you train on a 50/50 undersampled set. The trained classifier reports posterior probabilities consistent with a 50/50 prior, not a 1/99 one. The ranking of predictions is unaffected, so AUC stays the same, but the predicted probabilities are systematically too high for the positive class. Dal Pozzolo, Caelen, Johnson, and Bontempi (2015) showed that undersampling "warps the posterior probabilities" and gave a closed-form correction: if $p_s$ is the score from the model trained on the undersampled set and $\beta$ is the fraction of the original majority class that was kept, the corrected probability is $p = \beta p_s / (\beta p_s - p_s + 1)$.[12] Later analysis (Phelps, Lizotte, and Woolford 2024) showed that Platt scaling "should not be used for calibration after undersampling without critical thought," because it was not designed to correct prior shift; closed-form prior correction, the modified Platt-scaling variant those authors propose, or decision-threshold tuning on a held-out validation set with the original prevalence is usually safer.[16]

**Sensitivity to which points are removed.** Random undersampling depends on the random seed, and informed methods such as Tomek links and ENN depend on the distance metric, the value of $k$, and the feature scaling. Two undersampled training sets drawn from the same data can produce noticeably different classifiers, especially when the imbalance is extreme. Ensembling over multiple undersampled subsets is the standard remedy.

**Leakage and improper validation.** Undersampling must be applied only to the training fold of each cross-validation split, never to the validation or test fold. Undersampling the entire dataset before splitting leaks information across folds and produces optimistic estimates of test performance. The `imblearn.pipeline.Pipeline` class enforces this by applying samplers only during `fit` and not during `predict`.[15]

## When should you use undersampling, and how? (practical guidance)

A few rules of thumb summarize the working knowledge in the imbalanced-learning community.

1. Start with the simplest baseline: train your model on the original imbalanced data with class weights and a tuned [decision threshold](/wiki/classification_threshold). Often this matches or beats elaborate sampling.
2. If you do undersample, ensemble. A single random subset has high variance; an EasyEnsemble or Balanced Random Forest with 10 to 50 subsets is much more stable.
3. Combine sampling with [cost-sensitive learning](/wiki/cost-sensitive_learning) or threshold tuning. The two interventions are complementary.
4. Apply samplers inside the cross-validation loop, never before. Use `imblearn.pipeline.Pipeline` or `sklearn.pipeline.Pipeline` with imbalanced-learn samplers.
5. Recalibrate predicted probabilities if you need them. Use a closed-form prior correction or fit a calibration model (Platt scaling, isotonic regression) on a held-out set that preserves the original prevalence.
6. Measure with metrics that are not dominated by the majority class. AUC, [precision-recall](/wiki/precision-recall_curve) AUC, F1, and G-mean are all preferable to raw [accuracy](/wiki/accuracy).

## How is undersampling implemented in software?

The `imbalanced-learn` library, introduced by Guillaume Lemaitre, Fernando Nogueira, and Christos Aridas in *Journal of Machine Learning Research* (volume 18, number 17, pages 1 to 5) in 2017, is the de facto standard implementation in Python.[13] It is part of the `scikit-learn-contrib` ecosystem and depends only on numpy, scipy, and [scikit-learn](/wiki/scikit-learn). The table below lists the main undersampling APIs.

| imbalanced-learn class | Module | Method |
|---|---|---|
| `RandomUnderSampler` | `imblearn.under_sampling` | Random undersampling |
| `TomekLinks` | `imblearn.under_sampling` | Tomek-link cleaning |
| `EditedNearestNeighbours` | `imblearn.under_sampling` | Wilson's ENN |
| `RepeatedEditedNearestNeighbours` | `imblearn.under_sampling` | Repeated ENN |
| `AllKNN` | `imblearn.under_sampling` | ENN with varying $k$ |
| `CondensedNearestNeighbour` | `imblearn.under_sampling` | Hart's CNN |
| `OneSidedSelection` | `imblearn.under_sampling` | OSS (CNN + Tomek) |
| `NearMiss` | `imblearn.under_sampling` | NearMiss-1, -2, -3 (chosen by `version` parameter) |
| `NeighbourhoodCleaningRule` | `imblearn.under_sampling` | NCR |
| `ClusterCentroids` | `imblearn.under_sampling` | k-means cluster-centroid undersampling |
| `InstanceHardnessThreshold` | `imblearn.under_sampling` | Drop majority points classified with low confidence by a baseline learner |
| `EasyEnsembleClassifier` | `imblearn.ensemble` | EasyEnsemble |
| `RUSBoostClassifier` | `imblearn.ensemble` | RUSBoost |
| `BalancedBaggingClassifier` | `imblearn.ensemble` | Balanced bagging |
| `BalancedRandomForestClassifier` | `imblearn.ensemble` | Balanced Random Forest |
| `SMOTETomek` | `imblearn.combine` | SMOTE + Tomek links |
| `SMOTEENN` | `imblearn.combine` | SMOTE + ENN |

A minimal example pipeline looks like this.

```python
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipeline = Pipeline([
    ("under", RandomUnderSampler(sampling_strategy=0.5, random_state=0)),
    ("clf", LogisticRegression(max_iter=1000)),
])

scores = cross_val_score(pipeline, X, y, scoring="average_precision", cv=5)
```

The `sampling_strategy=0.5` parameter requests a final ratio of 1 minority to 2 majority. The sampler is applied only to the training data inside each cross-validation fold, so test scores reflect the original prevalence.

Outside Python, the `UBL` package on CRAN provides `RandomUnderClassif`, `TomekClassif`, `CNNClassif`, `ENNClassif`, and `NCLClassif` for R users, and the `themis` package provides recipe steps that integrate with the `tidymodels` framework. MATLAB ships RUSBoost as part of the Statistics and Machine Learning Toolbox.

## Is undersampling still used in modern machine learning?

The interest in undersampling techniques peaked in the late 2000s when tabular learners were the dominant family of models. Two shifts have changed how the technique is used today.

First, deep learning has moved many imbalance problems away from sampling. Class weights inside the cross-entropy loss, [focal loss](/wiki/focal_loss) (Lin et al. 2017), online hard example mining, and balanced batch sampling all let the optimizer see every example without dropping any data. For an object detector or a language model, throwing away 99% of the negative class is rarely worth it, since the model has plenty of capacity and the negative examples carry useful gradient.

Second, calibration has become a first-class concern. Pipelines that produce ranked probability scores (credit risk, ad ranking, click prediction) cannot tolerate the prior shift introduced by aggressive undersampling without an explicit correction. The classical undersample-then-train recipe has shifted to undersample-then-train-then-recalibrate, often using held-out unsampled data to fit the calibration step.

Undersampling is still the right tool in several common situations. On heavily imbalanced tabular data with millions of majority rows and tight latency budgets, it is often the only way to fit a [random forest](/wiki/random_forest) or [gradient boosting](/wiki/gradient_boosting) model in a reasonable amount of time. On streaming data with concept drift, ensembling over fresh undersampled subsets handles both class imbalance and non-stationarity.[14] And on cost-sensitive problems where the false-negative penalty dominates, balanced ensembles such as Balanced Random Forest and EasyEnsemble are reliable defaults.

## Worked example: undersampling a 99:1 spam classifier

Consider a spam classifier trained on a dataset with 100,000 emails, of which 99,000 are legitimate and 1,000 are spam, an imbalance ratio of 99:1. A baseline logistic regression trained on the raw data tends to predict "legitimate" almost always and reports about 99% accuracy with near-zero recall on spam.

Applying random undersampling at a 1:1 ratio drops 98,000 legitimate emails uniformly at random and keeps a balanced training set of 2,000 emails. A logistic regression trained on this balanced subset reaches a much higher recall on spam, perhaps in the range of 80% to 90%, but precision drops because the trained probabilities are now consistent with a 50/50 prior and overestimate the chance of spam. To get usable probabilities back, the model is recalibrated with the closed-form prior correction $p = \beta p_s / (\beta p_s - p_s + 1)$ where $\beta = 2000 / 99000$, which restores the predicted base rate to roughly 1%.

A more robust pipeline replaces the single undersampled fit with an EasyEnsemble of 25 logistic regressions, each trained on a different balanced subset, and aggregates their predictions by averaging. The ensemble has lower variance than any single model and usually produces a higher AUC. If the team has time, they may swap the EasyEnsemble for SMOTE+Tomek (oversample the spam class, then clean the boundary with Tomek links) and compare the two on a held-out validation set with the original prevalence. The final choice is usually the model that maximizes precision-recall AUC at the operating recall threshold the product needs.

## ELI5: undersampling in plain terms

Imagine a classroom where 99 students always shout "no" and 1 student quietly says "yes." If a teacher only listens to the loudest answer, they will always hear "no" and never learn that "yes" is even possible. Undersampling sends most of the "no" students out of the room so the teacher can actually hear the one "yes." The risk is that some of those "no" students had something useful to say, and now the teacher never hears it, which is why people often bring several different rooms (ensembles) and double-check the answers afterward (calibration).

## See also

- [Oversampling](/wiki/oversampling)
- [SMOTE](/wiki/smote)
- [Class imbalance](/wiki/class_imbalance)
- [Imbalanced dataset](/wiki/imbalanced_dataset)
- [Majority class](/wiki/majority_class)
- [Minority class](/wiki/minority_class)
- [Cost-sensitive learning](/wiki/cost-sensitive_learning)
- [Tomek links](/wiki/tomek_links)
- [imbalanced-learn](/wiki/imbalanced-learn)
- [Random forest](/wiki/random_forest)

## References

1. Hart, P. E. (1968). "The condensed nearest neighbor rule." *IEEE Transactions on Information Theory*, 14(3), 515 to 516.
2. Wilson, D. L. (1972). "Asymptotic properties of nearest neighbor rules using edited data." *IEEE Transactions on Systems, Man, and Cybernetics*, 2(3), 408 to 421.
3. Tomek, I. (1976). "Two modifications of CNN." *IEEE Transactions on Systems, Man, and Cybernetics*, 6(11), 769 to 772.
4. Kubat, M. and Matwin, S. (1997). "Addressing the curse of imbalanced training sets: one-sided selection." *Proceedings of the 14th International Conference on Machine Learning*, 179 to 186.
5. Laurikkala, J. (2001). "Improving identification of difficult small classes by balancing class distribution." *Artificial Intelligence in Medicine, AIME 2001*, Lecture Notes in Computer Science 2101, Springer.
6. Mani, I. and Zhang, I. (2003). "kNN approach to unbalanced data distributions: a case study involving information extraction." *Workshop on Learning from Imbalanced Datasets*, ICML.
7. Batista, G. E. A. P. A., Bazzan, A. L. C., and Monard, M. C. (2003). "Balancing training data for automated annotation of keywords: a case study." *Proceedings of the Second Brazilian Workshop on Bioinformatics*.
8. Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). "A study of the behavior of several methods for balancing machine learning training data." *ACM SIGKDD Explorations Newsletter*, 6(1), 20 to 29.
9. Chen, C., Liaw, A., and Breiman, L. (2004). "Using random forest to learn imbalanced data." *Technical Report 666*, Department of Statistics, University of California Berkeley.
10. Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). "Exploratory undersampling for class-imbalance learning." *IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)*, 39(2), 539 to 550.
11. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2010). "RUSBoost: a hybrid approach to alleviating class imbalance." *IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans*, 40(1), 185 to 197.
12. Pozzolo, A. D., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). "Calibrating probability with undersampling for unbalanced classification." *IEEE Symposium Series on Computational Intelligence (SSCI)*, 159 to 166.
13. Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). "Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning." *Journal of Machine Learning Research*, 18(17), 1 to 5.
14. Krawczyk, B. (2016). "Learning from imbalanced data: open challenges and future directions." *Progress in Artificial Intelligence*, 5(4), 221 to 232.
15. imbalanced-learn user guide, "3. Under-sampling." https://imbalanced-learn.org/stable/under_sampling.html
16. Phelps, N., Lizotte, D. J., and Woolford, D. G. (2024). "Using Platt's scaling for calibration after undersampling: limitations and how to address them." arXiv:2410.18144.

