Oversampling

Introduction

Oversampling is a data preprocessing technique used in machine learning to address class imbalance by increasing the number of instances in the minority class. When a training set contains far fewer examples of one class than another, a classification model trained on that data tends to develop a bias toward the majority class, predicting it more often and performing poorly on the underrepresented class. Oversampling counteracts this problem by adding more minority class samples to the training data, either by duplicating existing ones or by generating entirely new synthetic examples.

Class imbalance is common across many real-world domains. In fraud detection, legitimate transactions vastly outnumber fraudulent ones. In medical diagnosis, healthy patients far exceed those with rare diseases. In manufacturing quality control, defective items make up a small fraction of total output. Standard classifiers such as logistic regression, decision trees, and neural networks struggle with these skewed distributions because their loss functions are dominated by the majority class. Oversampling provides a data-level solution that rebalances the class distribution before model training begins.

Oversampling is one of several strategies for handling class imbalance. Others include downsampling (removing majority class instances), cost-sensitive learning (assigning higher misclassification penalties to the minority class), and threshold tuning (adjusting the decision threshold after training). Each approach has distinct trade-offs, and practitioners often combine multiple strategies for best results.

Random oversampling

Random oversampling is the simplest form of the technique. It works by randomly selecting instances from the minority class and duplicating them until the desired class ratio is achieved. If the majority class has 1,000 examples and the minority class has 100, random oversampling would replicate minority instances until there are also 1,000 minority examples in the training set.

Advantages

Extremely simple to implement and understand.
Does not require assumptions about the feature space or data distribution.
Preserves the original data exactly without introducing artificial points.
Works with any data type, including tabular, text, and image data.

Disadvantages

Increases the risk of overfitting because the model trains on exact copies of existing minority samples. The classifier may memorize specific instances rather than learning generalizable patterns.
Does not add new information to the training set. The model sees the same minority examples repeatedly, which can lead to overly narrow decision boundaries.
Increases training time in proportion to the duplication factor without improving the diversity of the training data.

Despite its limitations, random oversampling remains a useful baseline. Research has shown that it is robust across many datasets and can perform comparably to more sophisticated synthetic methods, particularly when combined with appropriate regularization techniques.

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE is the most widely used synthetic oversampling method. It was introduced by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer in their 2002 paper "SMOTE: Synthetic Minority Over-sampling Technique," published in the Journal of Artificial Intelligence Research (Volume 16, pages 321-357). The paper has been cited tens of thousands of times and remains one of the most influential contributions to the study of class imbalance in machine learning.

How SMOTE works

Instead of duplicating existing minority class instances, SMOTE generates new synthetic samples by interpolating between existing minority class instances and their nearest neighbors. The algorithm proceeds as follows:

For each minority class instance x_i, identify its k nearest neighbors within the minority class (the default is k = 5).
Randomly select one of the k nearest neighbors, call it x_nn.
Compute the difference vector between x_i and x_nn.
Multiply this difference by a random number lambda uniformly distributed between 0 and 1.
Add the scaled difference to x_i to produce the new synthetic sample: x_new = x_i + lambda * (x_nn - x_i).
Repeat until the desired number of synthetic samples has been generated.

Because the synthetic samples lie along line segments connecting real minority class instances in the feature space, they are plausible and maintain the general distribution of the minority class without being exact duplicates.

Advantages of SMOTE over random oversampling

Reduces the risk of overfitting by introducing novel, non-duplicate data points.
Expands the decision region of the minority class, making the learned decision boundary more generalizable.
Encourages the classifier to learn broader patterns rather than memorizing specific instances.

Limitations of SMOTE

Operates only on continuous numerical features. Standard SMOTE cannot handle categorical or mixed data types directly (though variants like SMOTE-NC address this).
May generate noisy synthetic samples if the minority class overlaps significantly with the majority class, because interpolation between a minority sample and a neighbor near the decision boundary can produce a point that falls in the majority class region.
Can yield poorly calibrated models that overestimate the probability of belonging to the minority class.
The k-nearest neighbors parameter must be chosen carefully. If k is too small, synthetic samples cluster tightly around existing points. If k is too large, synthetic samples may span inappropriate regions of the feature space.

SMOTE variants

Since its introduction, SMOTE has inspired a large family of variants designed to address its limitations. The following table summarizes the most widely used ones.

Variant	Authors / Year	Key Idea	When to use
Borderline-SMOTE	Han, Wang, & Mao (2005)	Generates synthetic samples only from minority instances near the decision boundary ("danger" zone)	When boundary samples are most informative for classification
ADASYN	He, Bai, Garcia, & Li (2008)	Generates more synthetic samples for minority instances that are harder to learn (more majority neighbors)	When different minority regions have varying difficulty levels
SVM-SMOTE	Nguyen, Cooper, & Kamei (2011)	Uses a support vector machine to identify support vectors, then applies SMOTE to those instances	When the SVM decision boundary is a good guide for where to synthesize
KMeans-SMOTE	Last, Douzas, & Bacao (2018)	Clusters the data with k-means, then applies SMOTE within minority-dense clusters	When the minority class has multiple sub-clusters
SMOTE-NC	Chawla et al. (2002)	Extends SMOTE to handle datasets with both nominal (categorical) and continuous features	When the dataset contains a mix of feature types
SMOTE-N	Chawla et al. (2002)	Handles purely nominal (categorical) features using the Value Difference Metric	When all features are categorical

Borderline-SMOTE

Borderline-SMOTE, introduced by Han, Wang, and Mao in 2005, improves on standard SMOTE by focusing synthetic sample generation on the decision boundary region. The algorithm classifies each minority instance into one of three categories based on its neighborhood composition:

Safe: The majority of the instance's k nearest neighbors belong to the minority class. These points are well inside the minority region and do not need additional support.
Danger (borderline): Roughly half of the neighbors belong to the majority class. These points sit near the decision boundary and are the most informative for the classifier.
Noise: Nearly all neighbors belong to the majority class. These points are likely outliers or mislabeled.

Borderline-SMOTE generates synthetic samples only from the "danger" instances, concentrating new data where it matters most. This approach typically produces better decision boundaries than standard SMOTE because it does not waste synthetic samples in regions already well-represented or in noisy outlier regions.

Two sub-variants exist. Borderline-SMOTE1 interpolates only between borderline minority instances and their minority-class neighbors. Borderline-SMOTE2 also allows interpolation toward majority-class neighbors, producing samples slightly closer to the majority class region.

ADASYN (Adaptive Synthetic Sampling)

ADASYN was proposed by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li at the 2008 IEEE International Joint Conference on Neural Networks (IJCNN). Like Borderline-SMOTE, ADASYN focuses on harder-to-learn minority instances, but it uses a different mechanism to determine how many synthetic samples each instance should receive.

For each minority instance, ADASYN computes the ratio of majority-class neighbors to total neighbors. Instances surrounded by more majority-class neighbors are considered harder to classify and receive proportionally more synthetic samples. This adaptive distribution shifts the classifier's attention toward the most challenging regions of the feature space.

The two key benefits of ADASYN over standard SMOTE are: (1) it reduces the bias introduced by class imbalance by generating data proportional to difficulty, and (2) it adaptively shifts the decision boundary toward difficult examples, improving classification accuracy in those regions.

Combination approaches (oversampling plus undersampling)

Practitioners often achieve better results by combining oversampling of the minority class with undersampling of the majority class. Two hybrid methods are particularly popular.

SMOTE-Tomek

SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then uses Tomek links to clean the resulting dataset. A Tomek link is a pair of instances from different classes that are each other's nearest neighbor. By removing the majority-class member of each Tomek link, the method increases the separation between classes along the decision boundary. This post-processing step helps remove noisy or ambiguous instances, including some of the synthetic samples that SMOTE may have placed in overlapping regions.

SMOTE-ENN

SMOTE-ENN combines SMOTE with Edited Nearest Neighbors (ENN). After SMOTE generates synthetic samples, ENN removes any instance (majority or minority) whose class label disagrees with the majority class of its three nearest neighbors. This cleaning step is more aggressive than Tomek links and tends to produce cleaner class regions with wider margins between them.

Research by Batista, Prati, and Monard (2004) found that both SMOTE-Tomek and SMOTE-ENN consistently outperformed standalone SMOTE across a range of datasets, with SMOTE-ENN generally providing the largest improvements due to its more thorough cleaning process.

Generative oversampling

Recent advances in deep generative models have opened new avenues for oversampling that go beyond the linear interpolation of SMOTE.

GAN-based oversampling

Generative adversarial networks (GANs) can learn the underlying distribution of the minority class and generate highly realistic synthetic samples. A GAN consists of a generator network that creates synthetic data and a discriminator network that distinguishes real from generated samples. When trained on minority class data, the generator learns to produce new instances that resemble the minority class distribution.

Several GAN architectures have been adapted for oversampling:

Method	Description
Conditional GAN (CGAN)	Conditions generation on class labels, allowing targeted minority-class sample generation
Wasserstein GAN (WGAN)	Uses the Wasserstein distance for more stable training, reducing mode collapse
BAGAN (Balancing GAN)	Specifically designed for class balancing; initializes the generator with autoencoder pre-training
CTGAN	Designed for tabular data; uses mode-specific normalization for mixed data types

GAN-based oversampling can capture complex, non-linear distributions that SMOTE's linear interpolation misses. However, GANs require significantly more computational resources, are harder to train (mode collapse, training instability), and may not outperform simpler methods on small or low-dimensional datasets.

VAE-based oversampling

Variational autoencoders (VAEs) provide another generative approach. A VAE learns a compressed latent representation of the minority class data and can sample from this latent space to generate new instances. VAEs produce smoother and more diverse samples than GANs in some settings and offer a more stable training process, though the generated samples tend to be blurrier or less sharply defined in image domains.

Oversampling across data types

The choice of oversampling technique depends heavily on the type of data being processed.

Tabular data

Tabular data is the most common setting for oversampling. SMOTE and its variants work well with continuous numerical features. For datasets containing categorical features, SMOTE-NC or SMOTE-N should be used. GAN-based methods such as CTGAN handle mixed feature types by applying mode-specific normalization to continuous columns and using softmax for categorical columns.

Text data

Oversampling text data for natural language processing (NLP) tasks presents unique challenges because text is high-dimensional and discrete. Simple duplication of minority-class documents is possible but offers the same overfitting risks as random oversampling with tabular data. More effective approaches include synonym replacement, back-translation (translating text to another language and back), contextual augmentation using pre-trained language models, and applying SMOTE in the embedding space (after converting text to dense vector representations using models like BERT or sentence transformers). A 2025 study in Scientific Reports benchmarked 31 SMOTE-based oversampling techniques on text classification datasets using transformer-based vectorization, finding that performance varies considerably depending on the embedding model and the degree of imbalance.

Image data

For image classification tasks, oversampling overlaps with data augmentation. Standard augmentation techniques (rotation, flipping, cropping, color jittering) applied selectively to minority-class images serve as a form of oversampling while also increasing diversity. GAN-based generation of minority-class images is also effective, particularly for medical imaging tasks where minority conditions are rare and collecting additional real data is expensive or ethically constrained. Unlike direct pixel-space interpolation (which would produce ghostly blended images), GANs generate plausible new images that reflect the learned distribution of the minority class.

Oversampling in cross-validation

A critical and frequently misunderstood aspect of oversampling is its interaction with cross-validation. Applying oversampling before splitting the data into cross-validation folds introduces data leakage, because synthetic samples generated from the entire dataset may contain information about instances that end up in the validation fold. This leads to overly optimistic performance estimates that do not reflect real-world generalization.

A 2024 study published in Scientific Reports quantified this problem in a radiomics context, finding that applying oversampling before cross-validation inflated AUC by up to 0.34, sensitivity by up to 0.33, and balanced accuracy by up to 0.37 compared to the correct procedure.

The correct procedure

Oversampling must be applied inside each cross-validation fold, only to the training portion of the split. The validation fold should always reflect the original, imbalanced class distribution to provide an honest estimate of model performance. The correct workflow is:

Split the data into k folds using stratified k-fold cross-validation (to preserve class proportions in each fold).
For each fold iteration, apply oversampling only to the training folds.
Train the model on the oversampled training data.
Evaluate on the held-out validation fold (which remains in its original, imbalanced state).

The imbalanced-learn library provides a Pipeline class that automates this workflow, ensuring that oversampling is applied only within the training step of each cross-validation fold.

Implementation with imbalanced-learn

The imbalanced-learn library (also known as imblearn) is the standard Python toolkit for handling class imbalance. It is built on top of scikit-learn and provides a consistent API for all oversampling, undersampling, and combination methods.

The following table lists the key oversampling classes available in imbalanced-learn:

Class	Method	Key parameter
`RandomOverSampler`	Random duplication of minority instances	`sampling_strategy`: target ratio or count
`SMOTE`	Synthetic interpolation between minority neighbors	`k_neighbors`: number of nearest neighbors (default 5)
`BorderlineSMOTE`	SMOTE applied to borderline minority instances	`m_neighbors`: neighborhood size for safety classification
`ADASYN`	Adaptive synthetic generation weighted by difficulty	`n_neighbors`: number of nearest neighbors
`SVMSMOTE`	SMOTE applied to SVM support vectors	`svm_estimator`: the SVM classifier to use
`KMeansSMOTE`	SMOTE within k-means clusters	`kmeans_estimator`: k-means configuration
`SMOTENC`	SMOTE for mixed nominal and continuous features	`categorical_features`: indices of categorical columns
`SMOTEN`	SMOTE for purely nominal features	`categorical_features`: indices of categorical columns

A typical usage pattern with imbalanced-learn's pipeline integration looks like this:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Define pipeline with SMOTE inside
pipeline = Pipeline([
    ('smote', SMOTE(k_neighbors=5, random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validate correctly (SMOTE applied only to training folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')

This pipeline approach ensures that SMOTE is applied only to the training data within each cross-validation fold, preventing data leakage.

Oversampling vs. undersampling

The choice between oversampling and downsampling (undersampling) depends on the specific characteristics of the dataset and the problem.

Factor	Oversampling	Undersampling
Dataset size change	Increases (adds minority instances)	Decreases (removes majority instances)
Information preservation	Retains all original data	May discard informative majority examples
Overfitting risk	Higher, especially with random duplication	Lower, though majority class may be under-learned
Training time	Longer (more data)	Shorter (less data)
Best for	Small datasets where losing majority data is harmful	Large datasets with redundant majority examples
Synthetic data risk	May introduce noisy or unrealistic synthetic points	No synthetic data generated (in basic approaches)

In practice, the best strategy often combines both approaches. Oversampling the minority class with SMOTE and then cleaning the dataset with Tomek links or ENN (undersampling) has been shown to outperform either technique in isolation across many benchmark datasets.

When to use oversampling

Oversampling is most appropriate in the following scenarios:

The dataset is small and removing majority class instances (undersampling) would leave too few training examples overall.
The minority class is severely underrepresented (for example, less than 5% of the dataset), making it difficult for the classifier to learn meaningful patterns.
The cost of misclassifying minority instances is high (for example, failing to detect fraud or missing a disease diagnosis).
The evaluation metric prioritizes minority class performance, such as recall, F1 score, or area under the ROC curve.

Oversampling may be less effective or unnecessary when:

The dataset is already large and the imbalance is moderate. Cost-sensitive learning or threshold adjustment may suffice.
The classifier inherently handles imbalance well (for example, gradient boosting methods with built-in class weighting).
The problem requires well-calibrated probability estimates, since oversampling can distort predicted probabilities.

Explain Like I'm 5 (ELI5)

Imagine you are trying to teach a computer to tell the difference between cats and dogs using a big pile of photos. But there is a problem: you have 100 photos of dogs and only 5 photos of cats. The computer looks at all the photos and thinks, "Almost everything is a dog!" So whenever it sees a new picture, it just guesses "dog" every time, even when the picture is actually a cat.

Oversampling is like making more copies of those 5 cat photos so the computer has an equal number of cat and dog pictures to study. The simplest way is to just photocopy the same cat photos over and over. A smarter way, called SMOTE, is like taking two cat photos and blending them together to make a brand-new cat photo that looks a little different from both. That way, the computer gets to see more variety in what cats look like, and it does a much better job of recognizing cats alongside dogs.

References

Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, W.P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
He, H., Bai, Y., Garcia, E.A., & Li, S. (2008). "ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning." *Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)*, 1322-1328.
Han, H., Wang, W.Y., & Mao, B.H. (2005). "Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning." *Advances in Intelligent Computing (ICIC 2005), Lecture Notes in Computer Science*, 3644, 878-887.
Batista, G.E., Prati, R.C., & Monard, M.C. (2004). "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data." *ACM SIGKDD Explorations Newsletter*, 6(1), 20-29.
Lemaitre, G., Nogueira, F., & Aridas, C.K. (2017). "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning." *Journal of Machine Learning Research*, 18(17), 1-5.
Douzas, G., Bacao, F., & Last, F. (2018). "Improving Imbalanced Learning Through a Heuristic Oversampling Method Based on K-Means and SMOTE." *Information Sciences*, 465, 1-20.
Santos, M.S., Soares, J.P., Abreu, P.H., Araujo, H., & Santos, J. (2022). "Cross-validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches." *IEEE Computational Intelligence Magazine*, 13(4), 59-76.
Park, J.E., Kim, D., Kim, H.S., et al. (2024). "Applying oversampling before cross-validation will lead to high bias in radiomics." *Scientific Reports*, 14, 12081.
Sharma, T., Gosain, A., & Malhotra, R. (2025). "A comprehensive evaluation of oversampling techniques for enhancing text classification performance." *Scientific Reports*, 15, 5891.
Engelmann, J. & Lessmann, S. (2021). "Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning." *Expert Systems with Applications*, 174, 114582.

Introduction

Random oversampling

Advantages

Disadvantages

SMOTE (Synthetic Minority Oversampling Technique)

How SMOTE works

Advantages of SMOTE over random oversampling

Limitations of SMOTE

SMOTE variants

Borderline-SMOTE

ADASYN (Adaptive Synthetic Sampling)

Combination approaches (oversampling plus undersampling)

SMOTE-Tomek

SMOTE-ENN

Generative oversampling

GAN-based oversampling

VAE-based oversampling

Oversampling across data types

Tabular data

Text data

Image data

Oversampling in cross-validation

The correct procedure

Implementation with imbalanced-learn

Oversampling vs. undersampling

When to use oversampling

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Introduction

Random oversampling

Advantages

Disadvantages

SMOTE (Synthetic Minority Oversampling Technique)

How SMOTE works

Advantages of SMOTE over random oversampling

Limitations of SMOTE

SMOTE variants

Borderline-SMOTE

ADASYN (Adaptive Synthetic Sampling)

Combination approaches (oversampling plus undersampling)

SMOTE-Tomek

SMOTE-ENN

Generative oversampling

GAN-based oversampling

VAE-based oversampling

Oversampling across data types

Tabular data

Text data

Image data

Oversampling in cross-validation

The correct procedure

Implementation with imbalanced-learn

Oversampling vs. undersampling

When to use oversampling

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset