# Oversampling

> Source: https://aiwiki.ai/wiki/oversampling
> Updated: 2026-06-23
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Oversampling is a data preprocessing technique in [machine learning](/wiki/machine_learning) that fixes [class imbalance](/wiki/class-imbalanced_dataset) by increasing the number of minority class examples in the [training set](/wiki/training_set), either by duplicating existing samples (random oversampling) or by generating new [synthetic data](/wiki/synthetic_data) points (SMOTE, ADASYN, and their variants). It rebalances the class distribution before training so that a [classification model](/wiki/classification_model) does not learn a bias toward the majority class. The most widely used synthetic method, SMOTE, was introduced by Nitesh V. Chawla and colleagues in a 2002 paper in the *Journal of Artificial Intelligence Research* (volume 16, pages 321-357) that has since been cited tens of thousands of times.[1]

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

When a training set contains far fewer examples of one class than another, a classification model trained on that data tends to develop a bias toward the majority class, predicting it more often and performing poorly on the underrepresented class. Oversampling counteracts this problem by adding more minority class samples to the training data, either by duplicating existing ones or by generating entirely new synthetic examples.

Class imbalance is common across many real-world domains. In fraud detection, legitimate transactions vastly outnumber fraudulent ones. In medical diagnosis, healthy patients far exceed those with rare diseases. In manufacturing quality control, defective items make up a small fraction of total output. Standard classifiers such as [logistic regression](/wiki/logistic_regression), [decision trees](/wiki/decision_tree), and [neural networks](/wiki/neural_network) struggle with these skewed distributions because their [loss functions](/wiki/loss_function) are dominated by the majority class. Oversampling provides a data-level solution that rebalances the class distribution before model training begins.

Oversampling is one of several strategies for handling class imbalance. Others include [downsampling](/wiki/downsampling) (removing majority class instances), cost-sensitive learning (assigning higher misclassification penalties to the minority class), and threshold tuning (adjusting the decision threshold after training). Each approach has distinct trade-offs, and practitioners often combine multiple strategies for best results. Whether resampling actually improves prediction over these alternatives is an active debate, covered below.

## What is random oversampling?

Random oversampling is the simplest form of the technique. It works by randomly selecting instances from the minority class and duplicating them until the desired class ratio is achieved. If the majority class has 1,000 examples and the minority class has 100, random oversampling would replicate minority instances until there are also 1,000 minority examples in the training set.

### Advantages

- Extremely simple to implement and understand.
- Does not require assumptions about the feature space or data distribution.
- Preserves the original data exactly without introducing artificial points.
- Works with any data type, including tabular, text, and image data.

### Disadvantages

- Increases the risk of [overfitting](/wiki/overfitting) because the model trains on exact copies of existing minority samples. The classifier may memorize specific instances rather than learning generalizable patterns.
- Does not add new information to the training set. The model sees the same minority examples repeatedly, which can lead to overly narrow decision boundaries.
- Increases training time in proportion to the duplication factor without improving the diversity of the training data.

Despite its limitations, random oversampling remains a useful baseline. Research has shown that it is robust across many datasets and can perform comparably to more sophisticated synthetic methods, particularly when combined with appropriate regularization techniques.

## What is SMOTE and how does it work?

SMOTE (Synthetic Minority Over-sampling Technique) is the most widely used synthetic oversampling method. It was introduced by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer in their 2002 paper "SMOTE: Synthetic Minority Over-sampling Technique," published in the *Journal of Artificial Intelligence Research* (Volume 16, pages 321-357).[1] The paper has been cited tens of thousands of times and remains one of the most influential contributions to the study of class imbalance in machine learning. Its central claim is that "a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class."[1]

### How SMOTE works

Instead of duplicating existing minority class instances, SMOTE generates new synthetic samples by interpolating between existing minority class instances and their nearest neighbors.[1] The algorithm proceeds as follows:

1. For each minority class instance **x_i**, identify its **k** nearest neighbors within the minority class (the default is k = 5).
2. Randomly select one of the k nearest neighbors, call it **x_nn**.
3. Compute the difference vector between **x_i** and **x_nn**.
4. Multiply this difference by a random number **lambda** uniformly distributed between 0 and 1.
5. Add the scaled difference to **x_i** to produce the new synthetic sample: **x_new = x_i + lambda * (x_nn - x_i)**.
6. Repeat until the desired number of synthetic samples has been generated.

Because the synthetic samples lie along line segments connecting real minority class instances in the feature space, they are plausible and maintain the general distribution of the minority class without being exact duplicates.

### Advantages of SMOTE over random oversampling

- Reduces the risk of overfitting by introducing novel, non-duplicate data points.
- Expands the decision region of the minority class, making the learned decision boundary more generalizable.
- Encourages the classifier to learn broader patterns rather than memorizing specific instances.

### Limitations of SMOTE

- Operates only on continuous numerical features. Standard SMOTE cannot handle categorical or mixed data types directly (though variants like SMOTE-NC address this).
- May generate noisy synthetic samples if the minority class overlaps significantly with the majority class, because interpolation between a minority sample and a neighbor near the decision boundary can produce a point that falls in the majority class region.
- Can yield poorly calibrated models that overestimate the probability of belonging to the minority class.
- The k-nearest neighbors parameter must be chosen carefully. If k is too small, synthetic samples cluster tightly around existing points. If k is too large, synthetic samples may span inappropriate regions of the feature space.

## What are the main SMOTE variants?

Since its introduction, SMOTE has inspired a large family of variants designed to address its limitations.[6] The following table summarizes the most widely used ones.

| Variant | Authors / Year | Key Idea | When to use |
|---|---|---|---|
| Borderline-SMOTE | Han, Wang, & Mao (2005) | Generates synthetic samples only from minority instances near the decision boundary ("danger" zone) | When boundary samples are most informative for classification |
| ADASYN | He, Bai, Garcia, & Li (2008) | Generates more synthetic samples for minority instances that are harder to learn (more majority neighbors) | When different minority regions have varying difficulty levels |
| SVM-SMOTE | Nguyen, Cooper, & Kamei (2011) | Uses a support vector machine to identify support vectors, then applies SMOTE to those instances | When the SVM decision boundary is a good guide for where to synthesize |
| KMeans-SMOTE | Last, Douzas, & Bacao (2018) | Clusters the data with k-means, then applies SMOTE within minority-dense clusters | When the minority class has multiple sub-clusters |
| SMOTE-NC | Chawla et al. (2002) | Extends SMOTE to handle datasets with both nominal (categorical) and continuous features | When the dataset contains a mix of feature types |
| SMOTE-N | Chawla et al. (2002) | Handles purely nominal (categorical) features using the Value Difference Metric | When all features are categorical |

### Borderline-SMOTE

Borderline-SMOTE, introduced by Han, Wang, and Mao in 2005, improves on standard SMOTE by focusing synthetic sample generation on the decision boundary region.[3] The algorithm classifies each minority instance into one of three categories based on its neighborhood composition:

- **Safe:** The majority of the instance's k nearest neighbors belong to the minority class. These points are well inside the minority region and do not need additional support.
- **Danger (borderline):** Roughly half of the neighbors belong to the majority class. These points sit near the decision boundary and are the most informative for the classifier.
- **Noise:** Nearly all neighbors belong to the majority class. These points are likely outliers or mislabeled.

Borderline-SMOTE generates synthetic samples only from the "danger" instances, concentrating new data where it matters most. This approach typically produces better decision boundaries than standard SMOTE because it does not waste synthetic samples in regions already well-represented or in noisy outlier regions.

Two sub-variants exist. Borderline-SMOTE1 interpolates only between borderline minority instances and their minority-class neighbors. Borderline-SMOTE2 also allows interpolation toward majority-class neighbors, producing samples slightly closer to the majority class region.[3]

### ADASYN (Adaptive Synthetic Sampling)

ADASYN was proposed by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li at the 2008 IEEE International Joint Conference on Neural Networks (IJCNN).[2] Like Borderline-SMOTE, ADASYN focuses on harder-to-learn minority instances, but it uses a different mechanism to determine how many synthetic samples each instance should receive.

For each minority instance, ADASYN computes the ratio of majority-class neighbors to total neighbors. Instances surrounded by more majority-class neighbors are considered harder to classify and receive proportionally more synthetic samples. This adaptive distribution shifts the classifier's attention toward the most challenging regions of the feature space.

The two key benefits of ADASYN over standard SMOTE are: (1) it reduces the bias introduced by class imbalance by generating data proportional to difficulty, and (2) it adaptively shifts the decision boundary toward difficult examples, improving classification accuracy in those regions.[2]

## How is oversampling combined with undersampling?

Practitioners often achieve better results by combining oversampling of the minority class with undersampling of the majority class. Two hybrid methods are particularly popular.

### SMOTE-Tomek

SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then uses Tomek links to clean the resulting dataset. A Tomek link is a pair of instances from different classes that are each other's nearest neighbor. By removing the majority-class member of each Tomek link, the method increases the separation between classes along the decision boundary. This post-processing step helps remove noisy or ambiguous instances, including some of the synthetic samples that SMOTE may have placed in overlapping regions.

### SMOTE-ENN

SMOTE-ENN combines SMOTE with Edited Nearest Neighbors (ENN). After SMOTE generates synthetic samples, ENN removes any instance (majority or minority) whose class label disagrees with the majority class of its three nearest neighbors. This cleaning step is more aggressive than Tomek links and tends to produce cleaner class regions with wider margins between them.

Research by Batista, Prati, and Monard (2004) found that both SMOTE-Tomek and SMOTE-ENN consistently outperformed standalone SMOTE across a range of datasets, with SMOTE-ENN generally providing the largest improvements due to its more thorough cleaning process.[4]

## Can deep generative models be used for oversampling?

Recent advances in deep generative models have opened new avenues for oversampling that go beyond the linear interpolation of SMOTE.

### GAN-based oversampling

[Generative adversarial networks](/wiki/generative_adversarial_network_gan) (GANs) can learn the underlying distribution of the minority class and generate highly realistic synthetic samples.[10] A GAN consists of a generator network that creates synthetic data and a discriminator network that distinguishes real from generated samples. When trained on minority class data, the generator learns to produce new instances that resemble the minority class distribution.

Several GAN architectures have been adapted for oversampling:

| Method | Description |
|---|---|
| Conditional GAN (CGAN) | Conditions generation on class labels, allowing targeted minority-class sample generation |
| Wasserstein GAN (WGAN) | Uses the Wasserstein distance for more stable training, reducing mode collapse |
| BAGAN (Balancing GAN) | Specifically designed for class balancing; initializes the generator with autoencoder pre-training |
| CTGAN | Designed for tabular data; uses mode-specific normalization for mixed data types |

GAN-based oversampling can capture complex, non-linear distributions that SMOTE's linear interpolation misses.[10] However, GANs require significantly more computational resources, are harder to train (mode collapse, training instability), and may not outperform simpler methods on small or low-dimensional datasets.

### VAE-based oversampling

[Variational autoencoders](/wiki/variational_autoencoder) (VAEs) provide another generative approach. A VAE learns a compressed latent representation of the minority class data and can sample from this latent space to generate new instances. VAEs produce smoother and more diverse samples than GANs in some settings and offer a more stable training process, though the generated samples tend to be blurrier or less sharply defined in image domains.

## How does oversampling differ across data types?

The choice of oversampling technique depends heavily on the type of data being processed.

### Tabular data

Tabular data is the most common setting for oversampling. SMOTE and its variants work well with continuous numerical features. For datasets containing categorical features, SMOTE-NC or SMOTE-N should be used. GAN-based methods such as CTGAN handle mixed feature types by applying mode-specific normalization to continuous columns and using softmax for categorical columns.

### Text data

Oversampling text data for [natural language processing](/wiki/natural_language_understanding) (NLP) tasks presents unique challenges because text is high-dimensional and discrete. Simple duplication of minority-class documents is possible but offers the same overfitting risks as random oversampling with tabular data. More effective approaches include synonym replacement, back-translation (translating text to another language and back), contextual augmentation using pre-trained language models, and applying SMOTE in the embedding space (after converting text to dense vector representations using models like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) or sentence transformers). A 2025 study in *Scientific Reports* benchmarked 31 SMOTE-based oversampling techniques on text classification datasets using transformer-based vectorization, finding that performance varies considerably depending on the embedding model and the degree of imbalance.[9]

### Image data

For image classification tasks, oversampling overlaps with [data augmentation](/wiki/data_augmentation). Standard augmentation techniques (rotation, flipping, cropping, color jittering) applied selectively to minority-class images serve as a form of oversampling while also increasing diversity. GAN-based generation of minority-class images is also effective, particularly for medical imaging tasks where minority conditions are rare and collecting additional real data is expensive or ethically constrained. Unlike direct pixel-space interpolation (which would produce ghostly blended images), GANs generate plausible new images that reflect the learned distribution of the minority class.

## Why must oversampling happen inside cross-validation?

A critical and frequently misunderstood aspect of oversampling is its interaction with [cross-validation](/wiki/cross-validation). Applying oversampling before splitting the data into cross-validation folds introduces data leakage, because synthetic samples generated from the entire dataset may contain information about instances that end up in the validation fold.[7] This leads to overly optimistic performance estimates that do not reflect real-world generalization.

A 2024 study published in *Scientific Reports* quantified this problem in a radiomics context, finding that applying oversampling before cross-validation inflated AUC by up to 0.34, sensitivity by up to 0.33, and balanced accuracy by up to 0.37 compared to the correct procedure.[8]

### The correct procedure

Oversampling must be applied **inside** each cross-validation fold, only to the training portion of the split. The validation fold should always reflect the original, imbalanced class distribution to provide an honest estimate of model performance.[7] The correct workflow is:

1. Split the data into k folds using stratified k-fold cross-validation (to preserve class proportions in each fold).
2. For each fold iteration, apply oversampling only to the training folds.
3. Train the model on the oversampled training data.
4. Evaluate on the held-out validation fold (which remains in its original, imbalanced state).

The imbalanced-learn library provides a `Pipeline` class that automates this workflow, ensuring that oversampling is applied only within the training step of each cross-validation fold.[5]

## How do you implement oversampling with imbalanced-learn?

The imbalanced-learn library (also known as imblearn) is the standard Python toolkit for handling class imbalance. Released as part of the scikit-learn-contrib project and described in a 2017 *Journal of Machine Learning Research* paper, it is built on top of [scikit-learn](/wiki/scikit-learn), depends only on NumPy, SciPy, and scikit-learn, and provides a consistent API for all oversampling, undersampling, and combination methods.[5]

The following table lists the key oversampling classes available in imbalanced-learn:

| Class | Method | Key parameter |
|---|---|---|
| `RandomOverSampler` | Random duplication of minority instances | `sampling_strategy`: target ratio or count |
| `SMOTE` | Synthetic interpolation between minority neighbors | `k_neighbors`: number of nearest neighbors (default 5) |
| `BorderlineSMOTE` | SMOTE applied to borderline minority instances | `m_neighbors`: neighborhood size for safety classification |
| `ADASYN` | Adaptive synthetic generation weighted by difficulty | `n_neighbors`: number of nearest neighbors |
| `SVMSMOTE` | SMOTE applied to SVM support vectors | `svm_estimator`: the SVM classifier to use |
| `KMeansSMOTE` | SMOTE within k-means clusters | `kmeans_estimator`: k-means configuration |
| `SMOTENC` | SMOTE for mixed nominal and continuous features | `categorical_features`: indices of categorical columns |
| `SMOTEN` | SMOTE for purely nominal features | `categorical_features`: indices of categorical columns |

A typical usage pattern with imbalanced-learn's pipeline integration looks like this:

```python
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Define pipeline with SMOTE inside
pipeline = Pipeline([
    ('smote', SMOTE(k_neighbors=5, random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validate correctly (SMOTE applied only to training folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')
```

This pipeline approach ensures that SMOTE is applied only to the training data within each cross-validation fold, preventing data leakage.

## Oversampling vs. undersampling

The choice between oversampling and [downsampling](/wiki/downsampling) (undersampling) depends on the specific characteristics of the dataset and the problem.

| Factor | Oversampling | Undersampling |
|---|---|---|
| Dataset size change | Increases (adds minority instances) | Decreases (removes majority instances) |
| Information preservation | Retains all original data | May discard informative majority examples |
| Overfitting risk | Higher, especially with random duplication | Lower, though majority class may be under-learned |
| Training time | Longer (more data) | Shorter (less data) |
| Best for | Small datasets where losing majority data is harmful | Large datasets with redundant majority examples |
| Synthetic data risk | May introduce noisy or unrealistic synthetic points | No synthetic data generated (in basic approaches) |

In practice, the best strategy often combines both approaches. Oversampling the minority class with SMOTE and then cleaning the dataset with Tomek links or ENN (undersampling) has been shown to outperform either technique in isolation across many benchmark datasets.[4]

## Does resampling actually help, or should you use class weights instead?

Whether resampling improves prediction at all is an active and contested research question. A growing body of evidence finds that for strong modern classifiers, resampling offers little or no benefit over simply adjusting class weights or the decision threshold, and it can actively harm probability calibration.

In the 2022 study "To SMOTE, or not to SMOTE?," Yotam Elor and Hadar Averbuch-Elor ran extensive experiments using state-of-the-art boosted-tree classifiers (XGBoost, CatBoost, and LightGBM) alongside weaker learners. They concluded that "balancing does not improve prediction performance for strong classifiers," and that balancing is effective mainly when using a weak classifier or when exceptionally good oversampler hyperparameters are known in advance.[11] In other words, a strong classifier left on imbalanced data, with the decision threshold tuned afterward, often matches or beats the same classifier trained on resampled data.

For models that must output trustworthy probabilities, the calibration cost can be severe. Van den Goorbergh and colleagues, in a 2022 *Journal of the American Medical Informatics Association* paper, simulated random undersampling, random oversampling, and SMOTE on logistic-regression risk models. They reported that "all imbalance correction methods led to poor calibration (strong overestimation of the probability to belong to the minority class), but not to better discrimination in terms of the area under the receiver operating characteristic curve."[12] The authors recommended against applying imbalance corrections for risk prediction.

These findings do not make oversampling obsolete, but they reframe when it is worth using. Cost-sensitive learning (class weights), which raises the misclassification penalty on the minority class, and threshold tuning, which shifts the decision cutoff after training, are now common alternatives that avoid generating synthetic data and preserve calibrated probabilities. Resampling tends to add the most value with weaker learners, with very severe imbalance, or when the downstream metric rewards minority [recall](/wiki/recall) rather than calibrated probabilities.

## When should you use oversampling?

Oversampling is most appropriate in the following scenarios:

- The dataset is small and removing majority class instances (undersampling) would leave too few training examples overall.
- The minority class is severely underrepresented (for example, less than 5% of the dataset), making it difficult for the classifier to learn meaningful patterns.
- The cost of misclassifying minority instances is high (for example, failing to detect fraud or missing a disease diagnosis).
- The evaluation metric prioritizes minority class performance, such as [recall](/wiki/recall), [F1 score](/wiki/f1_score), or area under the [ROC curve](/wiki/roc_receiver_operating_characteristic_curve).

Oversampling may be less effective or unnecessary when:

- The dataset is already large and the imbalance is moderate. Cost-sensitive learning or threshold adjustment may suffice.
- The classifier inherently handles imbalance well (for example, [gradient boosting](/wiki/gradient_boosting) methods with built-in class weighting), where resampling has been shown to add little.[11]
- The problem requires well-calibrated probability estimates, since oversampling can distort predicted probabilities.[12]

## Explain Like I'm 5 (ELI5)

Imagine you are trying to teach a computer to tell the difference between cats and dogs using a big pile of photos. But there is a problem: you have 100 photos of dogs and only 5 photos of cats. The computer looks at all the photos and thinks, "Almost everything is a dog!" So whenever it sees a new picture, it just guesses "dog" every time, even when the picture is actually a cat.

Oversampling is like making more copies of those 5 cat photos so the computer has an equal number of cat and dog pictures to study. The simplest way is to just photocopy the same cat photos over and over. A smarter way, called SMOTE, is like taking two cat photos and blending them together to make a brand-new cat photo that looks a little different from both. That way, the computer gets to see more variety in what cats look like, and it does a much better job of recognizing cats alongside dogs.

## References

1. Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302
2. He, H., Bai, Y., Garcia, E.A., & Li, S. (2008). "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning." *Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)*, 1322-1328.
3. Han, H., Wang, W.Y., & Mao, B.H. (2005). "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning." *Advances in Intelligent Computing (ICIC 2005), Lecture Notes in Computer Science*, 3644, 878-887.
4. Batista, G.E., Prati, R.C., & Monard, M.C. (2004). "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data." *ACM SIGKDD Explorations Newsletter*, 6(1), 20-29.
5. Lemaitre, G., Nogueira, F., & Aridas, C.K. (2017). "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning." *Journal of Machine Learning Research*, 18(17), 1-5. https://jmlr.org/papers/v18/16-365.html
6. Douzas, G., Bacao, F., & Last, F. (2018). "Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on K-Means and SMOTE." *Information Sciences*, 465, 1-20.
7. Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., & Santos, J. (2022). "Cross-validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches." *IEEE Computational Intelligence Magazine*, 13(4), 59-76.
8. Park, J.E., Kim, D., Kim, H.S., et al. (2024). "Applying oversampling before cross-validation will lead to high bias in radiomics." *Scientific Reports*, 14, 12081.
9. Sharma, T., Gosain, A., & Malhotra, R. (2025). "A comprehensive evaluation of oversampling techniques for enhancing text classification performance." *Scientific Reports*, 15, 5891.
10. Engelmann, J. & Lessmann, S. (2021). "Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning." *Expert Systems with Applications*, 174, 114582.
11. Elor, Y. & Averbuch-Elor, H. (2022). "To SMOTE, or not to SMOTE?" arXiv preprint arXiv:2201.08528. https://arxiv.org/abs/2201.08528
12. van den Goorbergh, R., van Smeden, M., Timmerman, D., & Van Calster, B. (2022). "The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression." *Journal of the American Medical Informatics Association*, 29(9), 1525-1534. https://doi.org/10.1093/jamia/ocac093