See also: Machine learning terms
Oversampling is a data preprocessing technique used in machine learning to address class imbalance by increasing the number of instances in the minority class. When a training set contains far fewer examples of one class than another, a classification model trained on that data tends to develop a bias toward the majority class, predicting it more often and performing poorly on the underrepresented class. Oversampling counteracts this problem by adding more minority class samples to the training data, either by duplicating existing ones or by generating entirely new synthetic examples.
Class imbalance is common across many real-world domains. In fraud detection, legitimate transactions vastly outnumber fraudulent ones. In medical diagnosis, healthy patients far exceed those with rare diseases. In manufacturing quality control, defective items make up a small fraction of total output. Standard classifiers such as logistic regression, decision trees, and neural networks struggle with these skewed distributions because their loss functions are dominated by the majority class. Oversampling provides a data-level solution that rebalances the class distribution before model training begins.
Oversampling is one of several strategies for handling class imbalance. Others include downsampling (removing majority class instances), cost-sensitive learning (assigning higher misclassification penalties to the minority class), and threshold tuning (adjusting the decision threshold after training). Each approach has distinct trade-offs, and practitioners often combine multiple strategies for best results.
Random oversampling is the simplest form of the technique. It works by randomly selecting instances from the minority class and duplicating them until the desired class ratio is achieved. If the majority class has 1,000 examples and the minority class has 100, random oversampling would replicate minority instances until there are also 1,000 minority examples in the training set.
Despite its limitations, random oversampling remains a useful baseline. Research has shown that it is robust across many datasets and can perform comparably to more sophisticated synthetic methods, particularly when combined with appropriate regularization techniques.
SMOTE is the most widely used synthetic oversampling method. It was introduced by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer in their 2002 paper "SMOTE: Synthetic Minority Over-sampling Technique," published in the Journal of Artificial Intelligence Research (Volume 16, pages 321-357). The paper has been cited tens of thousands of times and remains one of the most influential contributions to the study of class imbalance in machine learning.
Instead of duplicating existing minority class instances, SMOTE generates new synthetic samples by interpolating between existing minority class instances and their nearest neighbors. The algorithm proceeds as follows:
Because the synthetic samples lie along line segments connecting real minority class instances in the feature space, they are plausible and maintain the general distribution of the minority class without being exact duplicates.
Since its introduction, SMOTE has inspired a large family of variants designed to address its limitations. The following table summarizes the most widely used ones.
| Variant | Authors / Year | Key Idea | When to use |
|---|---|---|---|
| Borderline-SMOTE | Han, Wang, & Mao (2005) | Generates synthetic samples only from minority instances near the decision boundary ("danger" zone) | When boundary samples are most informative for classification |
| ADASYN | He, Bai, Garcia, & Li (2008) | Generates more synthetic samples for minority instances that are harder to learn (more majority neighbors) | When different minority regions have varying difficulty levels |
| SVM-SMOTE | Nguyen, Cooper, & Kamei (2011) | Uses a support vector machine to identify support vectors, then applies SMOTE to those instances | When the SVM decision boundary is a good guide for where to synthesize |
| KMeans-SMOTE | Last, Douzas, & Bacao (2018) | Clusters the data with k-means, then applies SMOTE within minority-dense clusters | When the minority class has multiple sub-clusters |
| SMOTE-NC | Chawla et al. (2002) | Extends SMOTE to handle datasets with both nominal (categorical) and continuous features | When the dataset contains a mix of feature types |
| SMOTE-N | Chawla et al. (2002) | Handles purely nominal (categorical) features using the Value Difference Metric | When all features are categorical |
Borderline-SMOTE, introduced by Han, Wang, and Mao in 2005, improves on standard SMOTE by focusing synthetic sample generation on the decision boundary region. The algorithm classifies each minority instance into one of three categories based on its neighborhood composition:
Borderline-SMOTE generates synthetic samples only from the "danger" instances, concentrating new data where it matters most. This approach typically produces better decision boundaries than standard SMOTE because it does not waste synthetic samples in regions already well-represented or in noisy outlier regions.
Two sub-variants exist. Borderline-SMOTE1 interpolates only between borderline minority instances and their minority-class neighbors. Borderline-SMOTE2 also allows interpolation toward majority-class neighbors, producing samples slightly closer to the majority class region.
ADASYN was proposed by Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li at the 2008 IEEE International Joint Conference on Neural Networks (IJCNN). Like Borderline-SMOTE, ADASYN focuses on harder-to-learn minority instances, but it uses a different mechanism to determine how many synthetic samples each instance should receive.
For each minority instance, ADASYN computes the ratio of majority-class neighbors to total neighbors. Instances surrounded by more majority-class neighbors are considered harder to classify and receive proportionally more synthetic samples. This adaptive distribution shifts the classifier's attention toward the most challenging regions of the feature space.
The two key benefits of ADASYN over standard SMOTE are: (1) it reduces the bias introduced by class imbalance by generating data proportional to difficulty, and (2) it adaptively shifts the decision boundary toward difficult examples, improving classification accuracy in those regions.
Practitioners often achieve better results by combining oversampling of the minority class with undersampling of the majority class. Two hybrid methods are particularly popular.
SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then uses Tomek links to clean the resulting dataset. A Tomek link is a pair of instances from different classes that are each other's nearest neighbor. By removing the majority-class member of each Tomek link, the method increases the separation between classes along the decision boundary. This post-processing step helps remove noisy or ambiguous instances, including some of the synthetic samples that SMOTE may have placed in overlapping regions.
SMOTE-ENN combines SMOTE with Edited Nearest Neighbors (ENN). After SMOTE generates synthetic samples, ENN removes any instance (majority or minority) whose class label disagrees with the majority class of its three nearest neighbors. This cleaning step is more aggressive than Tomek links and tends to produce cleaner class regions with wider margins between them.
Research by Batista, Prati, and Monard (2004) found that both SMOTE-Tomek and SMOTE-ENN consistently outperformed standalone SMOTE across a range of datasets, with SMOTE-ENN generally providing the largest improvements due to its more thorough cleaning process.
Recent advances in deep generative models have opened new avenues for oversampling that go beyond the linear interpolation of SMOTE.
Generative adversarial networks (GANs) can learn the underlying distribution of the minority class and generate highly realistic synthetic samples. A GAN consists of a generator network that creates synthetic data and a discriminator network that distinguishes real from generated samples. When trained on minority class data, the generator learns to produce new instances that resemble the minority class distribution.
Several GAN architectures have been adapted for oversampling:
| Method | Description |
|---|---|
| Conditional GAN (CGAN) | Conditions generation on class labels, allowing targeted minority-class sample generation |
| Wasserstein GAN (WGAN) | Uses the Wasserstein distance for more stable training, reducing mode collapse |
| BAGAN (Balancing GAN) | Specifically designed for class balancing; initializes the generator with autoencoder pre-training |
| CTGAN | Designed for tabular data; uses mode-specific normalization for mixed data types |
GAN-based oversampling can capture complex, non-linear distributions that SMOTE's linear interpolation misses. However, GANs require significantly more computational resources, are harder to train (mode collapse, training instability), and may not outperform simpler methods on small or low-dimensional datasets.
Variational autoencoders (VAEs) provide another generative approach. A VAE learns a compressed latent representation of the minority class data and can sample from this latent space to generate new instances. VAEs produce smoother and more diverse samples than GANs in some settings and offer a more stable training process, though the generated samples tend to be blurrier or less sharply defined in image domains.
The choice of oversampling technique depends heavily on the type of data being processed.
Tabular data is the most common setting for oversampling. SMOTE and its variants work well with continuous numerical features. For datasets containing categorical features, SMOTE-NC or SMOTE-N should be used. GAN-based methods such as CTGAN handle mixed feature types by applying mode-specific normalization to continuous columns and using softmax for categorical columns.
Oversampling text data for natural language processing (NLP) tasks presents unique challenges because text is high-dimensional and discrete. Simple duplication of minority-class documents is possible but offers the same overfitting risks as random oversampling with tabular data. More effective approaches include synonym replacement, back-translation (translating text to another language and back), contextual augmentation using pre-trained language models, and applying SMOTE in the embedding space (after converting text to dense vector representations using models like BERT or sentence transformers). A 2025 study in Scientific Reports benchmarked 31 SMOTE-based oversampling techniques on text classification datasets using transformer-based vectorization, finding that performance varies considerably depending on the embedding model and the degree of imbalance.
For image classification tasks, oversampling overlaps with data augmentation. Standard augmentation techniques (rotation, flipping, cropping, color jittering) applied selectively to minority-class images serve as a form of oversampling while also increasing diversity. GAN-based generation of minority-class images is also effective, particularly for medical imaging tasks where minority conditions are rare and collecting additional real data is expensive or ethically constrained. Unlike direct pixel-space interpolation (which would produce ghostly blended images), GANs generate plausible new images that reflect the learned distribution of the minority class.
A critical and frequently misunderstood aspect of oversampling is its interaction with cross-validation. Applying oversampling before splitting the data into cross-validation folds introduces data leakage, because synthetic samples generated from the entire dataset may contain information about instances that end up in the validation fold. This leads to overly optimistic performance estimates that do not reflect real-world generalization.
A 2024 study published in Scientific Reports quantified this problem in a radiomics context, finding that applying oversampling before cross-validation inflated AUC by up to 0.34, sensitivity by up to 0.33, and balanced accuracy by up to 0.37 compared to the correct procedure.
Oversampling must be applied inside each cross-validation fold, only to the training portion of the split. The validation fold should always reflect the original, imbalanced class distribution to provide an honest estimate of model performance. The correct workflow is:
The imbalanced-learn library provides a Pipeline class that automates this workflow, ensuring that oversampling is applied only within the training step of each cross-validation fold.
The imbalanced-learn library (also known as imblearn) is the standard Python toolkit for handling class imbalance. It is built on top of scikit-learn and provides a consistent API for all oversampling, undersampling, and combination methods.
The following table lists the key oversampling classes available in imbalanced-learn:
| Class | Method | Key parameter |
|---|---|---|
RandomOverSampler | Random duplication of minority instances | sampling_strategy: target ratio or count |
SMOTE | Synthetic interpolation between minority neighbors | k_neighbors: number of nearest neighbors (default 5) |
BorderlineSMOTE | SMOTE applied to borderline minority instances | m_neighbors: neighborhood size for safety classification |
ADASYN | Adaptive synthetic generation weighted by difficulty | n_neighbors: number of nearest neighbors |
SVMSMOTE | SMOTE applied to SVM support vectors | svm_estimator: the SVM classifier to use |
KMeansSMOTE | SMOTE within k-means clusters | kmeans_estimator: k-means configuration |
SMOTENC | SMOTE for mixed nominal and continuous features | categorical_features: indices of categorical columns |
SMOTEN | SMOTE for purely nominal features | categorical_features: indices of categorical columns |
A typical usage pattern with imbalanced-learn's pipeline integration looks like this:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Define pipeline with SMOTE inside
pipeline = Pipeline([
('smote', SMOTE(k_neighbors=5, random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
# Cross-validate correctly (SMOTE applied only to training folds)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1')
This pipeline approach ensures that SMOTE is applied only to the training data within each cross-validation fold, preventing data leakage.
The choice between oversampling and downsampling (undersampling) depends on the specific characteristics of the dataset and the problem.
| Factor | Oversampling | Undersampling |
|---|---|---|
| Dataset size change | Increases (adds minority instances) | Decreases (removes majority instances) |
| Information preservation | Retains all original data | May discard informative majority examples |
| Overfitting risk | Higher, especially with random duplication | Lower, though majority class may be under-learned |
| Training time | Longer (more data) | Shorter (less data) |
| Best for | Small datasets where losing majority data is harmful | Large datasets with redundant majority examples |
| Synthetic data risk | May introduce noisy or unrealistic synthetic points | No synthetic data generated (in basic approaches) |
In practice, the best strategy often combines both approaches. Oversampling the minority class with SMOTE and then cleaning the dataset with Tomek links or ENN (undersampling) has been shown to outperform either technique in isolation across many benchmark datasets.
Oversampling is most appropriate in the following scenarios:
Oversampling may be less effective or unnecessary when:
Imagine you are trying to teach a computer to tell the difference between cats and dogs using a big pile of photos. But there is a problem: you have 100 photos of dogs and only 5 photos of cats. The computer looks at all the photos and thinks, "Almost everything is a dog!" So whenever it sees a new picture, it just guesses "dog" every time, even when the picture is actually a cat.
Oversampling is like making more copies of those 5 cat photos so the computer has an equal number of cat and dog pictures to study. The simplest way is to just photocopy the same cat photos over and over. A smarter way, called SMOTE, is like taking two cat photos and blending them together to make a brand-new cat photo that looks a little different from both. That way, the computer gets to see more variety in what cats look like, and it does a much better job of recognizing cats alongside dogs.