SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE, short for Synthetic Minority Over-sampling Technique, is a data preprocessing algorithm designed to address class imbalance in supervised classification problems. It was introduced by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer in a 2002 paper published in the Journal of Artificial Intelligence Research ^[1]. The core idea is simple: rather than duplicating existing minority class samples (as random oversampling does) or discarding majority class samples (as random undersampling does), SMOTE manufactures plausible new minority class examples by interpolating between existing ones along line segments to their nearest neighbors of the same class. The result is a more balanced training set that nominally retains the geometric structure of the minority class without exactly repeating any sample.

Since its publication, SMOTE has become one of the most cited algorithms in machine learning, accumulating tens of thousands of citations across statistics, biomedical informatics, finance, software engineering, and applied data science. It is the default oversampling baseline in nearly every empirical study of imbalanced learning and the headline method in the widely used imbalanced-learn Python library. SMOTE has also spawned a large family of variants such as Borderline-SMOTE, ADASYN, SMOTE-NC, SVM-SMOTE, KMeans-SMOTE, and DeepSMOTE, each tweaking either where in feature space the synthetic points are placed or how many are drawn from each minority cluster. Despite its popularity, the practical value of SMOTE has been increasingly questioned by recent empirical work, most prominently the 2022 study To SMOTE, or not to SMOTE? by Yotam Elor and Hadar Averbuch-Elor, which argues that strong modern classifiers like XGBoost and well-tuned deep learning models gain little or nothing from SMOTE preprocessing and often lose calibration when it is applied ^[2].

The class imbalance problem

Many real-world classification tasks involve datasets where one class vastly outnumbers another. In credit card fraud detection, genuine transactions can outnumber fraudulent ones by 1,000 to 1 or more. In medical diagnosis for rare diseases, the prevalence of a positive case may be a fraction of one percent. In customer churn prediction, the share of customers who actually leave in a given quarter is typically far below the share who stay. In manufacturing defect detection, software bug prediction, network intrusion detection, and clinical adverse-event modeling, the pattern repeats: the interesting class is rare, and a naive classifier that simply predicts the majority label can achieve very high accuracy while being utterly useless.

The problem is not just one of metrics. Standard supervised learning algorithms minimize an aggregate loss over the training set, and when one class dominates, the loss landscape is shaped almost entirely by the majority class. Decision trees may produce splits that never isolate the minority class because the resulting subtrees would be too small. Logistic regression may learn a decision boundary that sits far from the minority cluster because moving it would barely change the average loss. K-nearest neighbor classifiers may consistently predict the majority class even within minority neighborhoods because the local sample is dominated by majority points. Imbalanced learning is therefore not just a metric-selection problem but a representation and optimization problem that affects what the model can plausibly learn.

The traditional remedies fall into three families. The first family is resampling: changing the training distribution by adding minority samples (oversampling) or removing majority samples (undersampling). The second family is cost-sensitive learning: keeping the data unchanged but assigning higher misclassification costs to the minority class through class weights, sample weights, or modified loss functions. The third family is post-hoc threshold adjustment: training a calibrated probabilistic model on the original distribution and then choosing a decision threshold that optimizes the operating characteristic the application cares about. SMOTE belongs squarely to the resampling family and was introduced to give it a more principled oversampling option than mere replication.

The SMOTE algorithm

The SMOTE procedure operates on a single class at a time, typically the minority class in a binary problem. It assumes that features are numeric and that a Euclidean distance metric is meaningful in the feature space. The algorithm takes three inputs: the set of minority class samples, the desired oversampling amount expressed as a percentage or a target number of synthetic samples, and the number of neighbors k used when sampling (Chawla et al. used k = 5 in the original paper ^[1]).

The steps are as follows. For each minority sample x_i, compute its k nearest neighbors among the other minority samples using Euclidean distance. To create a single synthetic sample, choose one of those k neighbors x_nn uniformly at random, draw a random scalar u uniformly from the interval [0, 1], and define the new point as x_new = x_i + u * (x_nn - x_i). Geometrically, x_new lies somewhere along the line segment connecting x_i and x_nn. The number of synthetic samples generated per original minority sample depends on the desired oversampling ratio. If the goal is to triple the size of the minority class, the algorithm generates two synthetic samples per original sample; if the goal is full balance with a much larger majority class, many more synthetic samples per original sample are produced.

The original 2002 paper combined SMOTE-based oversampling with random undersampling of the majority class, and it framed the contribution as a joint resampling strategy rather than as oversampling alone. Modern usage, however, treats SMOTE as a standalone oversampling step that is often applied without any majority undersampling. The same paper benchmarked the method against C4.5 decision trees, RIPPER rule induction, and naive Bayes, evaluating performance with the area under the ROC curve (AUC) rather than raw accuracy ^[1]. The reported gains in AUC over alternative resampling baselines were the empirical foundation that established SMOTE as a standard tool.

The geometric intuition behind SMOTE is that the convex hull of the minority class is roughly the right region in which to add synthetic points, since points near the segment between two same-class examples are likely to also belong to that class. This intuition is exact when the minority class is convex and well separated from the majority, but it can mislead in two important regimes. When minority and majority classes overlap in feature space, the segment between two minority points can pass through majority territory, and synthetic points on that segment may end up inside what should be majority class regions. When minority samples are noisy or mislabeled, interpolating with their neighbors propagates that noise rather than removing it. These failure modes have motivated the long line of SMOTE variants discussed below.

Theoretical properties

A more recent line of work has analyzed SMOTE through the lens of probability distributions rather than geometry. The synthetic points produced by SMOTE do not form a smooth multivariate distribution; instead they are supported on the union of one-dimensional line segments connecting pairs of original minority samples. The marginal distribution of each feature is a mixture of these segments, and certain regions of the convex hull are systematically under-covered. This finding helps explain why SMOTE often appears to fit the training data tightly but generalizes only modestly: the synthetic points lie on a measure-zero set within feature space, which is a peculiar prior to impose on the minority class.

A related theoretical observation is that SMOTE shrinks the variance of the minority class along directions transverse to the line segments while preserving variance along the segments themselves. For minority classes whose true distribution has nontrivial spread in all directions, this anisotropic shrinkage biases any classifier trained on the augmented data. The bias is small when the minority class is approximately one-dimensional but can be significant when the class is genuinely high-dimensional. This is one reason why SMOTE often performs better on tabular data with strong feature correlations than on dense high-dimensional data such as images or embeddings.

A third concern is leakage and privacy. Because each synthetic sample is a deterministic interpolation of two real minority samples plus a random scalar, the synthetic points carry information about specific individuals in the training data. A 2025 line of work showed that SMOTE-augmented datasets can leak membership information and even allow approximate reconstruction of original minority points, which has implications when the minority class corresponds to sensitive medical or financial cases. Practitioners working in regulated domains should treat SMOTE-generated data with the same care as the underlying real data and should not assume that the synthetic samples are anonymous.

SMOTE variants

The SMOTE family has grown into dozens of named variants, each adapting the basic interpolation idea to a particular failure mode of the original algorithm. The table below summarizes the most widely used members of the family.

Variant	Year	Key idea	Best suited for
SMOTE	2002	Random interpolation between minority sample and one of its k nearest minority neighbors ^[1]	Numeric, well-separated classes
Borderline-SMOTE1 / Borderline-SMOTE2	2005	Oversample only minority samples whose neighborhood contains many majority points (the borderline) ^[3]	Overlapping classes near decision boundary
Safe-Level SMOTE	2009	Weight interpolation by a safety score derived from the neighborhood composition	Mixed safe and noisy minority regions
ADASYN (Adaptive Synthetic Sampling)	2008	Generate more synthetic samples for harder-to-learn minority points, fewer for easy ones ^[4]	Adaptive emphasis on hard examples
SMOTE-NC	2002 (in original paper)	Handle mixed numeric and categorical features by interpolating numerics and majority-voting categoricals	Tabular data with categorical attributes
SMOTE-N	2002	Extension for purely nominal feature spaces using a value difference metric	Categorical-only datasets
SVM-SMOTE	2009	Use support vectors of an SVM trained on the original data to identify borderline minority points	Borderline cases via SVM geometry
KMeans-SMOTE	2017	Cluster the data with KMeans, then apply SMOTE only within minority-rich clusters	Multi-modal minority distributions
Cluster-SMOTE	2006	Cluster minority class first, then apply SMOTE within each cluster	Avoiding interpolation across modes
Geometric SMOTE	2019	Generate samples within a geometric region (such as a hyper-ellipsoid) around each minority point rather than along line segments	Densifying full neighborhoods, not just segments
MWMOTE	2014	Majority-Weighted Minority Oversampling that weights samples by informativeness	Hard-to-learn minority subregions
ROSE	2014	Random over-sampling examples via smoothed bootstrap sampling	Smoother synthetic distributions
DeepSMOTE	2021	Apply SMOTE in the latent space of an encoder-decoder network rather than raw pixel space	Image data and high-dimensional inputs ^[5]
SMOTified-GAN	2021	Run SMOTE first to produce candidate minority samples, then refine them with a GAN	Tabular data when GANs alone underfit minority class

Borderline-SMOTE

Introduced by Hui Han, Wen-Yuan Wang, and Bing-Huan Mao in 2005, Borderline-SMOTE recognized that not all minority samples are equally useful for training a classifier ^[3]. Samples deep inside the minority cluster contribute little to the decision boundary, while samples near the boundary between the minority and majority classes carry most of the discriminative information. The Borderline-SMOTE algorithm therefore identifies the DANGER set: minority samples for which more than half but not all of the k nearest neighbors belong to the majority class. Only these borderline samples are used as seeds for synthetic generation. The variant Borderline-SMOTE2 additionally interpolates between borderline minority samples and their majority neighbors, with the interpolation parameter constrained to keep the synthetic point closer to the minority side.

ADASYN

The Adaptive Synthetic Sampling approach (ADASYN), introduced by Haibo He and colleagues in 2008, generalizes Borderline-SMOTE by computing a continuous difficulty score for each minority sample rather than a binary borderline indicator ^[4]. The difficulty score is the fraction of majority neighbors among the k nearest neighbors, so a minority sample surrounded entirely by majority points has a difficulty of 1.0 and a sample surrounded entirely by other minority points has a difficulty of 0.0. The number of synthetic samples generated per original minority point is proportional to its difficulty score, normalized so that the total reaches the desired class balance. The motivation is that hard examples deserve more synthetic neighbors to help the classifier carve out the minority region near them, while easy examples already have plenty of representation. Empirical comparisons typically rank ADASYN slightly above plain SMOTE on benchmark datasets when paired with decision tree classifiers, although the advantage often disappears for stronger classifiers.

SMOTE-NC and SMOTE-N

Real tabular datasets often mix numeric and categorical features, and standard SMOTE breaks down on categorical attributes because there is no meaningful interpolation between, say, the categories red, green, and blue. SMOTE-NC (Nominal and Continuous) handles this by treating each feature type differently. For numeric features, it interpolates as in standard SMOTE. For categorical features, the synthetic sample takes the most frequent value among the k nearest neighbors of the seed minority point. The distance metric used to find neighbors is itself adapted: continuous features contribute the usual Euclidean component, while categorical mismatches contribute a fixed penalty equal to the median standard deviation of the continuous features in the minority class. SMOTE-N is a further variant for purely categorical datasets that uses a Value Difference Metric (VDM) for both neighbor search and synthetic value selection. Both variants are implemented in imbalanced-learn.

SVM-SMOTE and KMeans-SMOTE

SVM-SMOTE replaces the k-nearest-neighbor borderline detection of Borderline-SMOTE with the support vectors of a Support Vector Machine trained on the original imbalanced data. The intuition is that SVM support vectors lie precisely on or near the decision boundary, so they are a more principled way to identify borderline minority points than counting majority neighbors. Synthetic samples are generated along segments connecting each minority support vector to its same-class neighbors, with extrapolation allowed in regions where the minority class is sparse. KMeans-SMOTE first partitions the data with KMeans clustering, then evaluates each cluster's imbalance ratio and minority density. SMOTE is applied within clusters that are minority-rich and reasonably dense, in proportion to how sparse the minority class is locally. The cluster-aware approach helps prevent interpolation across disjoint minority modes, which can otherwise create synthetic points in genuinely majority territory between two minority subclusters.

Deep learning extensions

Classical SMOTE struggles on high-dimensional data such as images, where Euclidean interpolation between two pixel arrays produces nonsensical blurred composites rather than realistic new images. DeepSMOTE, introduced by Damien Dablain, Bartosz Krawczyk, and Nitesh Chawla in 2021, addresses this by training an encoder-decoder network on the original data and then applying SMOTE in the learned latent space rather than the raw input space ^[5]. The interpolated latent codes are decoded back into the input space to produce synthetic samples that lie on the data manifold. SMOTified-GAN takes a complementary approach: SMOTE is run first to produce candidate synthetic samples, then those candidates are passed through a Generative Adversarial Network's discriminator-generator loop to refine them into more realistic distributions. Both lines of work try to combine SMOTE's local interpolation principle with the manifold-learning capabilities of modern neural networks.

Comparison with other imbalance methods

SMOTE is one option among many for handling class imbalance, and it is rarely the only sensible choice. The table below contrasts the major families of methods.

Method	Family	Mechanism	Pros	Cons
Random oversampling	Resampling	Duplicate minority samples until balanced	Simple, no new points to validate	Pure overfitting risk; trees memorize duplicates
Random undersampling	Resampling	Drop majority samples until balanced	Smaller training set; trains faster	Throws away potentially useful information
SMOTE	Resampling	Interpolate new minority samples between neighbors ^[1]	Adds variety to minority class; widely supported	Can place points in majority regions; weakens calibration
Borderline-SMOTE	Resampling	Interpolate only near the boundary ^[3]	Focuses synthetic mass where it matters	Sensitive to k and to noise on the boundary
ADASYN	Resampling	Adaptive synthesis weighted by local hardness ^[4]	Emphasizes hard cases automatically	Can over-generate near outliers and noise
Tomek links	Resampling	Remove majority samples that form Tomek pairs with minority samples	Cleans the boundary	Removes only a small fraction of majority points
Edited Nearest Neighbors (ENN)	Resampling	Drop samples whose class disagrees with their neighbor majority	Removes noise from both classes	Aggressive; can remove valid minority points
NearMiss-1, 2, 3	Resampling	Keep majority samples that are nearest to minority cluster (or specific variants)	Targeted majority reduction	Highly sensitive to k and feature scaling
SMOTE + Tomek / SMOTE + ENN	Hybrid	Apply SMOTE then clean with Tomek or ENN	Adds variety then prunes noise	More hyperparameters; longer pipeline
Class weights	Cost-sensitive	Multiply per-class loss in the objective	No data manipulation; preserves calibration well	Effectiveness depends on optimizer and loss landscape
Focal loss	Cost-sensitive	Down-weight easy examples to focus on hard cases ^[6]	Strong on dense detection; works for deep learning	Needs tuning of focusing parameter gamma
Class-balanced loss	Cost-sensitive	Reweight by effective number of samples per class	Principled handling of long-tailed data	Adds a tunable hyperparameter
Threshold moving	Post-hoc	Train on original distribution, choose decision threshold to optimize a target metric	Preserves probability calibration	Requires a probabilistic classifier and a held-out set

Resampling versus reweighting

Resampling (oversampling, undersampling, SMOTE) and reweighting (class weights, sample weights, focal loss) are mathematically related but operationally different. For empirical risk minimization with a separable loss, applying class weight w_c to class c is equivalent to oversampling class c by a factor of w_c in expectation, since both modify the per-class contribution to the gradient by the same multiplicative factor. The two diverge when the loss is not strictly separable, when batches are mini-batches rather than full passes, or when an optimization algorithm has memory (such as adaptive optimizers or momentum-based methods). In practice, class weighting is often preferred because it preserves the data exactly, requires no preprocessing pipeline, and keeps probability estimates better calibrated.

Threshold moving is a complementary technique that often suffices on its own. The idea is to train a probabilistic classifier on the original imbalanced distribution, accept the resulting Bayes-optimal probabilities, and then pick a decision threshold (such as 0.1 instead of 0.5 for a rare positive class) to maximize a target operating characteristic such as F1, recall at fixed precision, or expected cost. When the goal is a binary decision and the classifier outputs well-calibrated probabilities, threshold moving achieves much of what resampling tries to accomplish without distorting the probabilities themselves. This is particularly important in domains like medical decision support and credit scoring where calibrated probabilities are required downstream.

Modern deep learning alternatives

In deep learning, SMOTE is rarely the first method tried. The dominant alternatives are loss-function modifications and curriculum strategies. The table below summarizes the most widely used approaches.

Approach	Year	Description	Typical use case
Focal loss	2017	Multiplies cross-entropy by (1 - p_t)^gamma to down-weight easy examples ^[6]	Object detection, medical image segmentation
Class-balanced loss	2019	Reweights by 1 / effective_num(c) where effective_num is a function of class size	Long-tailed image classification
Logit adjustment	2021	Adds a class-prior offset to the logits during training and removes it at inference	Long-tailed classification with strong theoretical grounding
Decoupled training	2020	Train representation on original data, then retrain only the classifier head on rebalanced data	Long-tailed image and text classification
Self-supervised pretraining	2020 onward	Pretrain on the full unlabeled dataset to learn balanced representations, then fine-tune	Any task with abundant unlabeled data
Mixup and CutMix	2018	Augment with linear or patch-wise interpolations between training samples	General regularization with side benefit on imbalance
Synthetic data generation with diffusion or LLMs	2023 onward	Generate realistic minority samples from generative models	Text and image classification with text and vision foundation models

Focal loss, introduced by Tsung-Yi Lin and colleagues in 2017 for dense object detection ^[6], modifies the cross-entropy loss with a factor (1 minus the predicted probability of the true class) raised to a focusing power gamma. This down-weights examples that are already classified well and concentrates gradient signal on hard, often minority, examples. Focal loss is now standard in object detection (originally for RetinaNet) and has been widely adopted in medical image segmentation, where pixel-level class imbalance is severe. It does not require resampling and preserves probability calibration better than SMOTE, although it introduces a new tunable parameter.

Class-balanced loss, introduced by Yin Cui and colleagues in 2019, reweights each class by an effective number of samples that diminishes returns on additional samples in already-large classes. Logit adjustment shifts the logits by the log of the class prior at training time, with theoretical guarantees of Bayes-consistency for balanced error. Decoupled training observes that representation learning and classifier learning have different sensitivities to class imbalance: representations are largely robust to imbalance and can be learned on the natural distribution, while the final classifier head benefits from being retrained on a rebalanced distribution. Self-supervised pretraining exploits the full unlabeled dataset and is increasingly the dominant approach for any imbalanced task that has access to large unlabeled corpora.

To SMOTE, or not to SMOTE

The most influential recent critique of SMOTE is the 2022 paper To SMOTE, or not to SMOTE? by Yotam Elor and Hadar Averbuch-Elor, which conducted a large-scale empirical study spanning 73 datasets, multiple oversampling techniques, and a range of classifiers from weak (decision trees, k-nearest neighbors, multilayer perceptron) to strong (XGBoost, LightGBM, CatBoost) ^[2]. The headline finding is that for strong modern classifiers, no oversampling method (including SMOTE, Borderline-SMOTE, ADASYN, and several others) provided improvements that exceeded one standard deviation in either rank or mean across datasets and metrics. For XGBoost in particular, oversampling failed to deliver meaningful gains regardless of the metric used, while it consistently degraded log loss and probability calibration.

The paper offers a coherent explanation. SMOTE and similar methods change the training distribution to make the classifier more sensitive to the minority class. With weak classifiers that have limited capacity and lack built-in mechanisms for handling skewed distributions, this can help. With strong classifiers that are flexible enough to fit the original imbalanced distribution well and are paired with appropriate scoring metrics, the resampling distorts the learned probabilities without improving discrimination. Threshold moving achieves the same operational effect (favoring the minority class in decisions) without distorting calibration.

The practical recommendations from the study are nuanced. When using a strong classifier and caring about a proper metric such as log loss or AUC, the best strategy is typically to use the original imbalanced data and tune the classifier directly, possibly combined with threshold moving for binary decisions. When using a weaker classifier (decision tree, multilayer perceptron, AdaBoost, naive Bayes), SMOTE-like techniques can still help. When the metric of interest is itself sensitive to class balance (such as accuracy on a balanced test set, or balanced accuracy), oversampling can improve scores even with strong classifiers, although this is closer to gaming the metric than to improving the underlying model.

This critique echoes earlier observations by other researchers. The 2009 IEEE TKDE survey by Haibo He and Edwardo Garcia surveyed the state of imbalanced learning and noted that the empirical advantage of SMOTE over simpler baselines was often modest and inconsistent, with results varying widely across datasets and classifiers ^[7]. Later meta-analyses found that SMOTE and its variants frequently failed to outperform a well-tuned random forest with appropriate class weights, and that the most reliable gains came from cleaning techniques (Tomek links, ENN) applied after resampling rather than from the resampling itself.

Tools and implementations

The canonical Python implementation is the imbalanced-learn library (imblearn), an extension of the scikit-learn ecosystem that provides SMOTE, Borderline-SMOTE (versions 1 and 2), ADASYN, SMOTE-NC, SMOTE-N, SVM-SMOTE, KMeans-SMOTE, and combinations such as SMOTE+Tomek and SMOTE+ENN. The library is maintained by the scikit-learn-contrib organization and follows the scikit-learn API conventions for fit and transform methods. R users can access SMOTE through the DMwR package and through the themis package, which integrates with the tidymodels ecosystem. Weka, the venerable Java machine learning workbench, includes SMOTE as a built-in filter under weka.filters.supervised.instance.SMOTE. SAS and SPSS both offer SMOTE-style preprocessing as part of their imbalanced classification pipelines.

A typical imblearn workflow first splits the data into training and test sets, then applies SMOTE only to the training set (applying it to the full dataset before splitting would leak synthetic samples into the test set and inflate evaluation metrics). The Pipeline class in imblearn supports composing a sampler with downstream transformers and an estimator so that the resampling is correctly wrapped inside cross-validation folds. Recommended practice is to tune both the classifier hyperparameters and the SMOTE parameters (most importantly k_neighbors, sampling_strategy, and the choice of variant) jointly using cross-validation, and to evaluate using a metric such as AUC or PR-AUC that is robust to the class imbalance in the test set.

For very large tabular datasets where the cost of nearest-neighbor search becomes prohibitive, approximate nearest-neighbor libraries such as Annoy, FAISS, and HNSW can be used to accelerate the SMOTE inner loop. Some implementations also support GPU-accelerated nearest-neighbor search through RAPIDS cuML, which can make SMOTE practical on datasets with millions of minority samples.

Use cases

SMOTE has been applied across an enormous range of domains. The most common application areas include the following.

Domain	Typical imbalance ratio	Why SMOTE is used	Notes
Credit card and payment fraud detection	1:100 to 1:10000	Surface rare fraud patterns the classifier would otherwise ignore	Often combined with ensemble methods such as XGBoost; recent work questions the necessity
Medical diagnosis of rare diseases	1:50 to 1:1000	Improve recall on rare conditions in screening pipelines	Calibration is critical for clinical use; SMOTE can hurt this
Customer churn prediction	1:5 to 1:20	Improve identification of customers likely to leave	Hybrid SMOTE+ENN reported strong results in telecoms studies
Software defect prediction	1:5 to 1:50	Identify rare buggy modules before release	Long line of NASA defect dataset benchmarks using SMOTE
Network intrusion detection	1:100 to 1:10000	Detect rare attack patterns in flow logs	Often paired with feature selection and anomaly detection
Manufacturing defect detection	1:50 to 1:1000	Find rare defective units in QA inspection	Vision-based variants increasingly use DeepSMOTE-style methods
Insurance claim fraud	1:20 to 1:100	Spot fraudulent claims in mostly legitimate volumes	Often combined with categorical features via SMOTE-NC
Gene and protein function prediction	1:10 to 1:1000	Predict rare functional categories from sequence features	Classic biomedical informatics application of SMOTE
Bankruptcy prediction	1:10 to 1:100	Identify firms likely to default from financial ratios	Often benchmarked against random oversampling and ADASYN

In each of these domains, SMOTE rarely operates alone. A typical production pipeline includes data cleaning, feature engineering, optional dimensionality reduction, resampling (SMOTE or a variant), classifier training, and a calibration step. The choice of variant often depends on the structure of the features (mixed numeric and categorical favors SMOTE-NC, dense numeric favors plain SMOTE) and on the position of the minority class in feature space (well-separated favors plain SMOTE, near-boundary favors Borderline-SMOTE or ADASYN).

Best practices and pitfalls

Several practical pitfalls have been documented across the SMOTE literature, and avoiding them often matters more than picking the right variant.

The first pitfall is applying SMOTE before train-test split. This leaks information from the test set into the training set through the synthetic samples and inflates apparent test performance. SMOTE must be applied only to the training fold, never to the test fold or to the data before the split.

The second pitfall is failing to scale features. SMOTE uses Euclidean distance for nearest-neighbor search, and unscaled features with very different ranges will dominate the distance computation. Standard scaling or min-max normalization should always precede SMOTE on numeric features.

The third pitfall is ignoring categorical features. Plain SMOTE produces meaningless interpolations on encoded categoricals, especially one-hot encoded ones, where the synthetic point may have nonsensical fractional category indicators. SMOTE-NC or a categorical encoder applied after SMOTE is required for mixed-type data.

The fourth pitfall is over-oversampling. Pushing the minority class to full balance with the majority class often introduces more noise than signal, especially when the imbalance ratio is extreme. Many practitioners report that partial oversampling (for instance, to a 1:3 ratio rather than 1:1) yields better results.

The fifth pitfall is trusting accuracy. Once the training distribution has been rebalanced, accuracy is a poor metric because the model has been optimized on a distribution that does not match deployment. Evaluate on the original imbalanced test set using AUC, PR-AUC, F1, or expected cost.

The sixth pitfall is forgetting calibration. SMOTE shifts the predicted probabilities toward the rebalanced distribution rather than the true deployment distribution. Calibration techniques such as Platt scaling, isotonic regression, or temperature scaling can help, although the underlying signal loss from SMOTE-induced distortion is not fully recoverable.

Historical impact and ongoing research

The 2002 SMOTE paper has accumulated tens of thousands of citations and remains one of the most cited algorithms in applied machine learning. It helped formalize class imbalance as a recognized subfield with its own journals, workshops, and benchmarks, and it provided a baseline against which every new imbalanced learning method is measured. The Chawla group at Notre Dame has continued to publish in the area, including a 2018 follow-up paper SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary that surveyed the field's evolution.

Research in imbalanced learning continues along several active lines. The first line is better synthetic generation: replacing linear interpolation with manifold-aware methods such as DeepSMOTE ^[5], variational autoencoder-based generators, and diffusion model-based augmentation. The second line is better integration with strong classifiers: studying which combinations of resampling, reweighting, and threshold adjustment work for which classes of models, including the questions raised by Elor and Averbuch-Elor ^[2]. The third line is long-tailed deep learning: handling the case of many classes with widely varying frequency, where techniques like decoupled training, logit adjustment, and balanced softmax have largely supplanted SMOTE-style methods. The fourth line is fairness and bias: studying how resampling interacts with subgroup performance, since rebalancing one variable can worsen imbalance on another.

A fifth and increasingly important line is the privacy of synthetic data. The 2025 work on SMOTE membership inference attacks demonstrated that synthetic samples can leak information about individual training points, raising questions about the use of SMOTE in healthcare, financial services, and other regulated domains. Differential privacy-aware variants of SMOTE are an active research direction.

Summary

SMOTE remains a fixture of the applied machine learning toolbox more than two decades after its publication. Its central idea, that synthetic minority samples can be manufactured by interpolating between existing ones, is intuitive, easy to implement, and often improves the performance of weaker classifiers on imbalanced datasets. The SMOTE family of variants (Borderline-SMOTE, ADASYN, SMOTE-NC, SVM-SMOTE, KMeans-SMOTE, DeepSMOTE) addresses many of the original method's known weaknesses. At the same time, recent empirical work has cast doubt on the necessity and value of SMOTE for modern strong classifiers like XGBoost, gradient-boosted trees, and well-tuned deep networks, where calibration-preserving methods such as class weighting, focal loss, and threshold moving often perform as well or better. The right approach to class imbalance depends on the classifier, the metric, the structure of the data, and the downstream use case, and SMOTE should be treated as one option in a portfolio rather than as a default.

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). *SMOTE: Synthetic Minority Over-sampling Technique*. Journal of Artificial Intelligence Research, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302
Elor, Y. and Averbuch-Elor, H. (2022). *To SMOTE, or not to SMOTE?* arXiv:2201.08528. https://arxiv.org/abs/2201.08528
Han, H., Wang, W.-Y., and Mao, B.-H. (2005). *Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning*. International Conference on Intelligent Computing. Lecture Notes in Computer Science, vol 3644. https://link.springer.com/chapter/10.1007/11538059_91
He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). *ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning*. IEEE International Joint Conference on Neural Networks. https://ieeexplore.ieee.org/document/4633969
Dablain, D., Krawczyk, B., and Chawla, N. V. (2021). *DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data*. IEEE Transactions on Neural Networks and Learning Systems. arXiv:2105.02340. https://arxiv.org/abs/2105.02340
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). *Focal Loss for Dense Object Detection*. IEEE International Conference on Computer Vision (ICCV). arXiv:1708.02002. https://arxiv.org/abs/1708.02002
He, H. and Garcia, E. A. (2009). *Learning from Imbalanced Data*. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284. https://ieeexplore.ieee.org/document/5128907
Fernandez, A., Garcia, S., Herrera, F., and Chawla, N. V. (2018). *SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary*. Journal of Artificial Intelligence Research, 61, 863-905. https://www.jair.org/index.php/jair/article/view/11192
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., and Belongie, S. (2019). *Class-Balanced Loss Based on Effective Number of Samples*. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1901.05555. https://arxiv.org/abs/1901.05555
Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). *Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning*. Journal of Machine Learning Research, 18(17), 1-5. https://jmlr.org/papers/v18/16-365.html
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., and Kalantidis, Y. (2020). *Decoupling Representation and Classifier for Long-Tailed Recognition*. International Conference on Learning Representations (ICLR). arXiv:1910.09217. https://arxiv.org/abs/1910.09217
Menardi, G. and Torelli, N. (2014). *Training and Assessing Classification Rules with Imbalanced Data*. Data Mining and Knowledge Discovery, 28(1), 92-122. https://link.springer.com/article/10.1007/s10618-012-0295-5
Wikipedia contributors. *Synthetic minority oversampling technique*. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Synthetic_minority_oversampling_technique

The class imbalance problem

The SMOTE algorithm

Theoretical properties

SMOTE variants

Borderline-SMOTE

ADASYN

SMOTE-NC and SMOTE-N

SVM-SMOTE and KMeans-SMOTE

Deep learning extensions

Comparison with other imbalance methods

Resampling versus reweighting

Modern deep learning alternatives

To SMOTE, or not to SMOTE

Tools and implementations

Use cases

Best practices and pitfalls

Historical impact and ongoing research

Summary

See also

References

Improve this article

Related Articles

ARC-AGI 2

XGBoost

LightGBM

k-Nearest Neighbors

Feature Selection

Dynamic Programming

The class imbalance problem

The SMOTE algorithm

Theoretical properties

SMOTE variants

Borderline-SMOTE

ADASYN

SMOTE-NC and SMOTE-N

SVM-SMOTE and KMeans-SMOTE

Deep learning extensions

Comparison with other imbalance methods

Resampling versus reweighting

Modern deep learning alternatives

To SMOTE, or not to SMOTE

Tools and implementations

Use cases

Best practices and pitfalls

Historical impact and ongoing research

Summary

See also

References

Related Articles

ARC-AGI 2

XGBoost

LightGBM

k-Nearest Neighbors

Feature Selection

Dynamic Programming