SMOTE, short for Synthetic Minority Over-sampling Technique, is a data preprocessing algorithm designed to address class imbalance in supervised classification problems. It was introduced by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer in a 2002 paper published in the Journal of Artificial Intelligence Research [1]. The core idea is simple: rather than duplicating existing minority class samples (as random oversampling does) or discarding majority class samples (as random undersampling does), SMOTE manufactures plausible new minority class examples by interpolating between existing ones along line segments to their nearest neighbors of the same class. The result is a more balanced training set that nominally retains the geometric structure of the minority class without exactly repeating any sample.
Since its publication, SMOTE has become one of the most cited algorithms in machine learning, accumulating tens of thousands of citations across statistics, biomedical informatics, finance, software engineering, and applied data science. It is the default oversampling baseline in nearly every empirical study of imbalanced learning and the headline method in the widely used imbalanced-learn Python library. SMOTE has also spawned a large family of variants such as Borderline-SMOTE, ADASYN, SMOTE-NC, SVM-SMOTE, KMeans-SMOTE, and DeepSMOTE, each tweaking either where in feature space the synthetic points are placed or how many are drawn from each minority cluster. Despite its popularity, the practical value of SMOTE has been increasingly questioned by recent empirical work, most prominently the 2022 study To SMOTE, or not to SMOTE? by Yotam Elor and Hadar Averbuch-Elor, which argues that strong modern classifiers like XGBoost and well-tuned deep learning models gain little or nothing from SMOTE preprocessing and often lose calibration when it is applied [2].
Many real-world classification tasks involve datasets where one class vastly outnumbers another. In credit card fraud detection, genuine transactions can outnumber fraudulent ones by 1,000 to 1 or more. In medical diagnosis for rare diseases, the prevalence of a positive case may be a fraction of one percent. In customer churn prediction, the share of customers who actually leave in a given quarter is typically far below the share who stay. In manufacturing defect detection, software bug prediction, network intrusion detection, and clinical adverse-event modeling, the pattern repeats: the interesting class is rare, and a naive classifier that simply predicts the majority label can achieve very high accuracy while being utterly useless.
The problem is not just one of metrics. Standard supervised learning algorithms minimize an aggregate loss over the training set, and when one class dominates, the loss landscape is shaped almost entirely by the majority class. Decision trees may produce splits that never isolate the minority class because the resulting subtrees would be too small. Logistic regression may learn a decision boundary that sits far from the minority cluster because moving it would barely change the average loss. K-nearest neighbor classifiers may consistently predict the majority class even within minority neighborhoods because the local sample is dominated by majority points. Imbalanced learning is therefore not just a metric-selection problem but a representation and optimization problem that affects what the model can plausibly learn.
The traditional remedies fall into three families. The first family is resampling: changing the training distribution by adding minority samples (oversampling) or removing majority samples (undersampling). The second family is cost-sensitive learning: keeping the data unchanged but assigning higher misclassification costs to the minority class through class weights, sample weights, or modified loss functions. The third family is post-hoc threshold adjustment: training a calibrated probabilistic model on the original distribution and then choosing a decision threshold that optimizes the operating characteristic the application cares about. SMOTE belongs squarely to the resampling family and was introduced to give it a more principled oversampling option than mere replication.
The SMOTE procedure operates on a single class at a time, typically the minority class in a binary problem. It assumes that features are numeric and that a Euclidean distance metric is meaningful in the feature space. The algorithm takes three inputs: the set of minority class samples, the desired oversampling amount expressed as a percentage or a target number of synthetic samples, and the number of neighbors k used when sampling (Chawla et al. used k = 5 in the original paper [1]).
The steps are as follows. For each minority sample x_i, compute its k nearest neighbors among the other minority samples using Euclidean distance. To create a single synthetic sample, choose one of those k neighbors x_nn uniformly at random, draw a random scalar u uniformly from the interval [0, 1], and define the new point as x_new = x_i + u * (x_nn - x_i). Geometrically, x_new lies somewhere along the line segment connecting x_i and x_nn. The number of synthetic samples generated per original minority sample depends on the desired oversampling ratio. If the goal is to triple the size of the minority class, the algorithm generates two synthetic samples per original sample; if the goal is full balance with a much larger majority class, many more synthetic samples per original sample are produced.
The original 2002 paper combined SMOTE-based oversampling with random undersampling of the majority class, and it framed the contribution as a joint resampling strategy rather than as oversampling alone. Modern usage, however, treats SMOTE as a standalone oversampling step that is often applied without any majority undersampling. The same paper benchmarked the method against C4.5 decision trees, RIPPER rule induction, and naive Bayes, evaluating performance with the area under the ROC curve (AUC) rather than raw accuracy [1]. The reported gains in AUC over alternative resampling baselines were the empirical foundation that established SMOTE as a standard tool.
The geometric intuition behind SMOTE is that the convex hull of the minority class is roughly the right region in which to add synthetic points, since points near the segment between two same-class examples are likely to also belong to that class. This intuition is exact when the minority class is convex and well separated from the majority, but it can mislead in two important regimes. When minority and majority classes overlap in feature space, the segment between two minority points can pass through majority territory, and synthetic points on that segment may end up inside what should be majority class regions. When minority samples are noisy or mislabeled, interpolating with their neighbors propagates that noise rather than removing it. These failure modes have motivated the long line of SMOTE variants discussed below.
A more recent line of work has analyzed SMOTE through the lens of probability distributions rather than geometry. The synthetic points produced by SMOTE do not form a smooth multivariate distribution; instead they are supported on the union of one-dimensional line segments connecting pairs of original minority samples. The marginal distribution of each feature is a mixture of these segments, and certain regions of the convex hull are systematically under-covered. This finding helps explain why SMOTE often appears to fit the training data tightly but generalizes only modestly: the synthetic points lie on a measure-zero set within feature space, which is a peculiar prior to impose on the minority class.
A related theoretical observation is that SMOTE shrinks the variance of the minority class along directions transverse to the line segments while preserving variance along the segments themselves. For minority classes whose true distribution has nontrivial spread in all directions, this anisotropic shrinkage biases any classifier trained on the augmented data. The bias is small when the minority class is approximately one-dimensional but can be significant when the class is genuinely high-dimensional. This is one reason why SMOTE often performs better on tabular data with strong feature correlations than on dense high-dimensional data such as images or embeddings.
A third concern is leakage and privacy. Because each synthetic sample is a deterministic interpolation of two real minority samples plus a random scalar, the synthetic points carry information about specific individuals in the training data. A 2025 line of work showed that SMOTE-augmented datasets can leak membership information and even allow approximate reconstruction of original minority points, which has implications when the minority class corresponds to sensitive medical or financial cases. Practitioners working in regulated domains should treat SMOTE-generated data with the same care as the underlying real data and should not assume that the synthetic samples are anonymous.
The SMOTE family has grown into dozens of named variants, each adapting the basic interpolation idea to a particular failure mode of the original algorithm. The table below summarizes the most widely used members of the family.
| Variant | Year | Key idea | Best suited for |
|---|---|---|---|
| SMOTE | 2002 | Random interpolation between minority sample and one of its k nearest minority neighbors [1] | Numeric, well-separated classes |
| Borderline-SMOTE1 / Borderline-SMOTE2 | 2005 | Oversample only minority samples whose neighborhood contains many majority points (the borderline) [3] | Overlapping classes near decision boundary |
| Safe-Level SMOTE | 2009 | Weight interpolation by a safety score derived from the neighborhood composition | Mixed safe and noisy minority regions |
| ADASYN (Adaptive Synthetic Sampling) | 2008 | Generate more synthetic samples for harder-to-learn minority points, fewer for easy ones [4] | Adaptive emphasis on hard examples |
| SMOTE-NC | 2002 (in original paper) | Handle mixed numeric and categorical features by interpolating numerics and majority-voting categoricals | Tabular data with categorical attributes |
| SMOTE-N | 2002 | Extension for purely nominal feature spaces using a value difference metric | Categorical-only datasets |
| SVM-SMOTE | 2009 | Use support vectors of an SVM trained on the original data to identify borderline minority points | Borderline cases via SVM geometry |
| KMeans-SMOTE | 2017 | Cluster the data with KMeans, then apply SMOTE only within minority-rich clusters | Multi-modal minority distributions |
| Cluster-SMOTE | 2006 | Cluster minority class first, then apply SMOTE within each cluster | Avoiding interpolation across modes |
| Geometric SMOTE | 2019 | Generate samples within a geometric region (such as a hyper-ellipsoid) around each minority point rather than along line segments | Densifying full neighborhoods, not just segments |
| MWMOTE | 2014 | Majority-Weighted Minority Oversampling that weights samples by informativeness | Hard-to-learn minority subregions |
| ROSE | 2014 | Random over-sampling examples via smoothed bootstrap sampling | Smoother synthetic distributions |
| DeepSMOTE | 2021 | Apply SMOTE in the latent space of an encoder-decoder network rather than raw pixel space | Image data and high-dimensional inputs [5] |
| SMOTified-GAN | 2021 | Run SMOTE first to produce candidate minority samples, then refine them with a GAN | Tabular data when GANs alone underfit minority class |
Introduced by Hui Han, Wen-Yuan Wang, and Bing-Huan Mao in 2005, Borderline-SMOTE recognized that not all minority samples are equally useful for training a classifier [3]. Samples deep inside the minority cluster contribute little to the decision boundary, while samples near the boundary between the minority and majority classes carry most of the discriminative information. The Borderline-SMOTE algorithm therefore identifies the DANGER set: minority samples for which more than half but not all of the k nearest neighbors belong to the majority class. Only these borderline samples are used as seeds for synthetic generation. The variant Borderline-SMOTE2 additionally interpolates between borderline minority samples and their majority neighbors, with the interpolation parameter constrained to keep the synthetic point closer to the minority side.
The Adaptive Synthetic Sampling approach (ADASYN), introduced by Haibo He and colleagues in 2008, generalizes Borderline-SMOTE by computing a continuous difficulty score for each minority sample rather than a binary borderline indicator [4]. The difficulty score is the fraction of majority neighbors among the k nearest neighbors, so a minority sample surrounded entirely by majority points has a difficulty of 1.0 and a sample surrounded entirely by other minority points has a difficulty of 0.0. The number of synthetic samples generated per original minority point is proportional to its difficulty score, normalized so that the total reaches the desired class balance. The motivation is that hard examples deserve more synthetic neighbors to help the classifier carve out the minority region near them, while easy examples already have plenty of representation. Empirical comparisons typically rank ADASYN slightly above plain SMOTE on benchmark datasets when paired with decision tree classifiers, although the advantage often disappears for stronger classifiers.
Real tabular datasets often mix numeric and categorical features, and standard SMOTE breaks down on categorical attributes because there is no meaningful interpolation between, say, the categories red, green, and blue. SMOTE-NC (Nominal and Continuous) handles this by treating each feature type differently. For numeric features, it interpolates as in standard SMOTE. For categorical features, the synthetic sample takes the most frequent value among the k nearest neighbors of the seed minority point. The distance metric used to find neighbors is itself adapted: continuous features contribute the usual Euclidean component, while categorical mismatches contribute a fixed penalty equal to the median standard deviation of the continuous features in the minority class. SMOTE-N is a further variant for purely categorical datasets that uses a Value Difference Metric (VDM) for both neighbor search and synthetic value selection. Both variants are implemented in imbalanced-learn.
SVM-SMOTE replaces the k-nearest-neighbor borderline detection of Borderline-SMOTE with the support vectors of a Support Vector Machine trained on the original imbalanced data. The intuition is that SVM support vectors lie precisely on or near the decision boundary, so they are a more principled way to identify borderline minority points than counting majority neighbors. Synthetic samples are generated along segments connecting each minority support vector to its same-class neighbors, with extrapolation allowed in regions where the minority class is sparse. KMeans-SMOTE first partitions the data with KMeans clustering, then evaluates each cluster's imbalance ratio and minority density. SMOTE is applied within clusters that are minority-rich and reasonably dense, in proportion to how sparse the minority class is locally. The cluster-aware approach helps prevent interpolation across disjoint minority modes, which can otherwise create synthetic points in genuinely majority territory between two minority subclusters.
Classical SMOTE struggles on high-dimensional data such as images, where Euclidean interpolation between two pixel arrays produces nonsensical blurred composites rather than realistic new images. DeepSMOTE, introduced by Damien Dablain, Bartosz Krawczyk, and Nitesh Chawla in 2021, addresses this by training an encoder-decoder network on the original data and then applying SMOTE in the learned latent space rather than the raw input space [5]. The interpolated latent codes are decoded back into the input space to produce synthetic samples that lie on the data manifold. SMOTified-GAN takes a complementary approach: SMOTE is run first to produce candidate synthetic samples, then those candidates are passed through a Generative Adversarial Network's discriminator-generator loop to refine them into more realistic distributions. Both lines of work try to combine SMOTE's local interpolation principle with the manifold-learning capabilities of modern neural networks.
SMOTE is one option among many for handling class imbalance, and it is rarely the only sensible choice. The table below contrasts the major families of methods.
| Method | Family | Mechanism | Pros | Cons |
|---|---|---|---|---|
| Random oversampling | Resampling | Duplicate minority samples until balanced | Simple, no new points to validate | Pure overfitting risk; trees memorize duplicates |
| Random undersampling | Resampling | Drop majority samples until balanced | Smaller training set; trains faster | Throws away potentially useful information |
| SMOTE | Resampling | Interpolate new minority samples between neighbors [1] | Adds variety to minority class; widely supported | Can place points in majority regions; weakens calibration |
| Borderline-SMOTE | Resampling | Interpolate only near the boundary [3] | Focuses synthetic mass where it matters | Sensitive to k and to noise on the boundary |
| ADASYN | Resampling | Adaptive synthesis weighted by local hardness [4] | Emphasizes hard cases automatically | Can over-generate near outliers and noise |
| Tomek links | Resampling | Remove majority samples that form Tomek pairs with minority samples | Cleans the boundary | Removes only a small fraction of majority points |
| Edited Nearest Neighbors (ENN) | Resampling | Drop samples whose class disagrees with their neighbor majority | Removes noise from both classes | Aggressive; can remove valid minority points |
| NearMiss-1, 2, 3 | Resampling | Keep majority samples that are nearest to minority cluster (or specific variants) | Targeted majority reduction | Highly sensitive to k and feature scaling |
| SMOTE + Tomek / SMOTE + ENN | Hybrid | Apply SMOTE then clean with Tomek or ENN | Adds variety then prunes noise | More hyperparameters; longer pipeline |
| Class weights | Cost-sensitive | Multiply per-class loss in the objective | No data manipulation; preserves calibration well | Effectiveness depends on optimizer and loss landscape |
| Focal loss | Cost-sensitive | Down-weight easy examples to focus on hard cases [6] | Strong on dense detection; works for deep learning | Needs tuning of focusing parameter gamma |
| Class-balanced loss | Cost-sensitive | Reweight by effective number of samples per class | Principled handling of long-tailed data | Adds a tunable hyperparameter |
| Threshold moving | Post-hoc | Train on original distribution, choose decision threshold to optimize a target metric | Preserves probability calibration | Requires a probabilistic classifier and a held-out set |
Resampling (oversampling, undersampling, SMOTE) and reweighting (class weights, sample weights, focal loss) are mathematically related but operationally different. For empirical risk minimization with a separable loss, applying class weight w_c to class c is equivalent to oversampling class c by a factor of w_c in expectation, since both modify the per-class contribution to the gradient by the same multiplicative factor. The two diverge when the loss is not strictly separable, when batches are mini-batches rather than full passes, or when an optimization algorithm has memory (such as adaptive optimizers or momentum-based methods). In practice, class weighting is often preferred because it preserves the data exactly, requires no preprocessing pipeline, and keeps probability estimates better calibrated.
Threshold moving is a complementary technique that often suffices on its own. The idea is to train a probabilistic classifier on the original imbalanced distribution, accept the resulting Bayes-optimal probabilities, and then pick a decision threshold (such as 0.1 instead of 0.5 for a rare positive class) to maximize a target operating characteristic such as F1, recall at fixed precision, or expected cost. When the goal is a binary decision and the classifier outputs well-calibrated probabilities, threshold moving achieves much of what resampling tries to accomplish without distorting the probabilities themselves. This is particularly important in domains like medical decision support and credit scoring where calibrated probabilities are required downstream.
In deep learning, SMOTE is rarely the first method tried. The dominant alternatives are loss-function modifications and curriculum strategies. The table below summarizes the most widely used approaches.
| Approach | Year | Description | Typical use case |
|---|---|---|---|
| Focal loss | 2017 | Multiplies cross-entropy by (1 - p_t)^gamma to down-weight easy examples [6] | Object detection, medical image segmentation |
| Class-balanced loss | 2019 | Reweights by 1 / effective_num(c) where effective_num is a function of class size | Long-tailed image classification |
| Logit adjustment | 2021 | Adds a class-prior offset to the logits during training and removes it at inference | Long-tailed classification with strong theoretical grounding |
| Decoupled training | 2020 | Train representation on original data, then retrain only the classifier head on rebalanced data | Long-tailed image and text classification |
| Self-supervised pretraining | 2020 onward | Pretrain on the full unlabeled dataset to learn balanced representations, then fine-tune | Any task with abundant unlabeled data |
| Mixup and CutMix | 2018 | Augment with linear or patch-wise interpolations between training samples | General regularization with side benefit on imbalance |
| Synthetic data generation with diffusion or LLMs | 2023 onward | Generate realistic minority samples from generative models | Text and image classification with text and vision foundation models |
Focal loss, introduced by Tsung-Yi Lin and colleagues in 2017 for dense object detection [6], modifies the cross-entropy loss with a factor (1 minus the predicted probability of the true class) raised to a focusing power gamma. This down-weights examples that are already classified well and concentrates gradient signal on hard, often minority, examples. Focal loss is now standard in object detection (originally for RetinaNet) and has been widely adopted in medical image segmentation, where pixel-level class imbalance is severe. It does not require resampling and preserves probability calibration better than SMOTE, although it introduces a new tunable parameter.
Class-balanced loss, introduced by Yin Cui and colleagues in 2019, reweights each class by an effective number of samples that diminishes returns on additional samples in already-large classes. Logit adjustment shifts the logits by the log of the class prior at training time, with theoretical guarantees of Bayes-consistency for balanced error. Decoupled training observes that representation learning and classifier learning have different sensitivities to class imbalance: representations are largely robust to imbalance and can be learned on the natural distribution, while the final classifier head benefits from being retrained on a rebalanced distribution. Self-supervised pretraining exploits the full unlabeled dataset and is increasingly the dominant approach for any imbalanced task that has access to large unlabeled corpora.
The most influential recent critique of SMOTE is the 2022 paper To SMOTE, or not to SMOTE? by Yotam Elor and Hadar Averbuch-Elor, which conducted a large-scale empirical study spanning 73 datasets, multiple oversampling techniques, and a range of classifiers from weak (decision trees, k-nearest neighbors, multilayer perceptron) to strong (XGBoost, LightGBM, CatBoost) [2]. The headline finding is that for strong modern classifiers, no oversampling method (including SMOTE, Borderline-SMOTE, ADASYN, and several others) provided improvements that exceeded one standard deviation in either rank or mean across datasets and metrics. For XGBoost in particular, oversampling failed to deliver meaningful gains regardless of the metric used, while it consistently degraded log loss and probability calibration.
The paper offers a coherent explanation. SMOTE and similar methods change the training distribution to make the classifier more sensitive to the minority class. With weak classifiers that have limited capacity and lack built-in mechanisms for handling skewed distributions, this can help. With strong classifiers that are flexible enough to fit the original imbalanced distribution well and are paired with appropriate scoring metrics, the resampling distorts the learned probabilities without improving discrimination. Threshold moving achieves the same operational effect (favoring the minority class in decisions) without distorting calibration.
The practical recommendations from the study are nuanced. When using a strong classifier and caring about a proper metric such as log loss or AUC, the best strategy is typically to use the original imbalanced data and tune the classifier directly, possibly combined with threshold moving for binary decisions. When using a weaker classifier (decision tree, multilayer perceptron, AdaBoost, naive Bayes), SMOTE-like techniques can still help. When the metric of interest is itself sensitive to class balance (such as accuracy on a balanced test set, or balanced accuracy), oversampling can improve scores even with strong classifiers, although this is closer to gaming the metric than to improving the underlying model.
This critique echoes earlier observations by other researchers. The 2009 IEEE TKDE survey by Haibo He and Edwardo Garcia surveyed the state of imbalanced learning and noted that the empirical advantage of SMOTE over simpler baselines was often modest and inconsistent, with results varying widely across datasets and classifiers [7]. Later meta-analyses found that SMOTE and its variants frequently failed to outperform a well-tuned random forest with appropriate class weights, and that the most reliable gains came from cleaning techniques (Tomek links, ENN) applied after resampling rather than from the resampling itself.
The canonical Python implementation is the imbalanced-learn library (imblearn), an extension of the scikit-learn ecosystem that provides SMOTE, Borderline-SMOTE (versions 1 and 2), ADASYN, SMOTE-NC, SMOTE-N, SVM-SMOTE, KMeans-SMOTE, and combinations such as SMOTE+Tomek and SMOTE+ENN. The library is maintained by the scikit-learn-contrib organization and follows the scikit-learn API conventions for fit and transform methods. R users can access SMOTE through the DMwR package and through the themis package, which integrates with the tidymodels ecosystem. Weka, the venerable Java machine learning workbench, includes SMOTE as a built-in filter under weka.filters.supervised.instance.SMOTE. SAS and SPSS both offer SMOTE-style preprocessing as part of their imbalanced classification pipelines.
A typical imblearn workflow first splits the data into training and test sets, then applies SMOTE only to the training set (applying it to the full dataset before splitting would leak synthetic samples into the test set and inflate evaluation metrics). The Pipeline class in imblearn supports composing a sampler with downstream transformers and an estimator so that the resampling is correctly wrapped inside cross-validation folds. Recommended practice is to tune both the classifier hyperparameters and the SMOTE parameters (most importantly k_neighbors, sampling_strategy, and the choice of variant) jointly using cross-validation, and to evaluate using a metric such as AUC or PR-AUC that is robust to the class imbalance in the test set.
For very large tabular datasets where the cost of nearest-neighbor search becomes prohibitive, approximate nearest-neighbor libraries such as Annoy, FAISS, and HNSW can be used to accelerate the SMOTE inner loop. Some implementations also support GPU-accelerated nearest-neighbor search through RAPIDS cuML, which can make SMOTE practical on datasets with millions of minority samples.
SMOTE has been applied across an enormous range of domains. The most common application areas include the following.
| Domain | Typical imbalance ratio | Why SMOTE is used | Notes |
|---|---|---|---|
| Credit card and payment fraud detection | 1:100 to 1:10000 | Surface rare fraud patterns the classifier would otherwise ignore | Often combined with ensemble methods such as XGBoost; recent work questions the necessity |
| Medical diagnosis of rare diseases | 1:50 to 1:1000 | Improve recall on rare conditions in screening pipelines | Calibration is critical for clinical use; SMOTE can hurt this |
| Customer churn prediction | 1:5 to 1:20 | Improve identification of customers likely to leave | Hybrid SMOTE+ENN reported strong results in telecoms studies |
| Software defect prediction | 1:5 to 1:50 | Identify rare buggy modules before release | Long line of NASA defect dataset benchmarks using SMOTE |
| Network intrusion detection | 1:100 to 1:10000 | Detect rare attack patterns in flow logs | Often paired with feature selection and anomaly detection |
| Manufacturing defect detection | 1:50 to 1:1000 | Find rare defective units in QA inspection | Vision-based variants increasingly use DeepSMOTE-style methods |
| Insurance claim fraud | 1:20 to 1:100 | Spot fraudulent claims in mostly legitimate volumes | Often combined with categorical features via SMOTE-NC |
| Gene and protein function prediction | 1:10 to 1:1000 | Predict rare functional categories from sequence features | Classic biomedical informatics application of SMOTE |
| Bankruptcy prediction | 1:10 to 1:100 | Identify firms likely to default from financial ratios | Often benchmarked against random oversampling and ADASYN |
In each of these domains, SMOTE rarely operates alone. A typical production pipeline includes data cleaning, feature engineering, optional dimensionality reduction, resampling (SMOTE or a variant), classifier training, and a calibration step. The choice of variant often depends on the structure of the features (mixed numeric and categorical favors SMOTE-NC, dense numeric favors plain SMOTE) and on the position of the minority class in feature space (well-separated favors plain SMOTE, near-boundary favors Borderline-SMOTE or ADASYN).
Several practical pitfalls have been documented across the SMOTE literature, and avoiding them often matters more than picking the right variant.
The first pitfall is applying SMOTE before train-test split. This leaks information from the test set into the training set through the synthetic samples and inflates apparent test performance. SMOTE must be applied only to the training fold, never to the test fold or to the data before the split.
The second pitfall is failing to scale features. SMOTE uses Euclidean distance for nearest-neighbor search, and unscaled features with very different ranges will dominate the distance computation. Standard scaling or min-max normalization should always precede SMOTE on numeric features.
The third pitfall is ignoring categorical features. Plain SMOTE produces meaningless interpolations on encoded categoricals, especially one-hot encoded ones, where the synthetic point may have nonsensical fractional category indicators. SMOTE-NC or a categorical encoder applied after SMOTE is required for mixed-type data.
The fourth pitfall is over-oversampling. Pushing the minority class to full balance with the majority class often introduces more noise than signal, especially when the imbalance ratio is extreme. Many practitioners report that partial oversampling (for instance, to a 1:3 ratio rather than 1:1) yields better results.
The fifth pitfall is trusting accuracy. Once the training distribution has been rebalanced, accuracy is a poor metric because the model has been optimized on a distribution that does not match deployment. Evaluate on the original imbalanced test set using AUC, PR-AUC, F1, or expected cost.
The sixth pitfall is forgetting calibration. SMOTE shifts the predicted probabilities toward the rebalanced distribution rather than the true deployment distribution. Calibration techniques such as Platt scaling, isotonic regression, or temperature scaling can help, although the underlying signal loss from SMOTE-induced distortion is not fully recoverable.
The 2002 SMOTE paper has accumulated tens of thousands of citations and remains one of the most cited algorithms in applied machine learning. It helped formalize class imbalance as a recognized subfield with its own journals, workshops, and benchmarks, and it provided a baseline against which every new imbalanced learning method is measured. The Chawla group at Notre Dame has continued to publish in the area, including a 2018 follow-up paper SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary that surveyed the field's evolution.
Research in imbalanced learning continues along several active lines. The first line is better synthetic generation: replacing linear interpolation with manifold-aware methods such as DeepSMOTE [5], variational autoencoder-based generators, and diffusion model-based augmentation. The second line is better integration with strong classifiers: studying which combinations of resampling, reweighting, and threshold adjustment work for which classes of models, including the questions raised by Elor and Averbuch-Elor [2]. The third line is long-tailed deep learning: handling the case of many classes with widely varying frequency, where techniques like decoupled training, logit adjustment, and balanced softmax have largely supplanted SMOTE-style methods. The fourth line is fairness and bias: studying how resampling interacts with subgroup performance, since rebalancing one variable can worsen imbalance on another.
A fifth and increasingly important line is the privacy of synthetic data. The 2025 work on SMOTE membership inference attacks demonstrated that synthetic samples can leak information about individual training points, raising questions about the use of SMOTE in healthcare, financial services, and other regulated domains. Differential privacy-aware variants of SMOTE are an active research direction.
SMOTE remains a fixture of the applied machine learning toolbox more than two decades after its publication. Its central idea, that synthetic minority samples can be manufactured by interpolating between existing ones, is intuitive, easy to implement, and often improves the performance of weaker classifiers on imbalanced datasets. The SMOTE family of variants (Borderline-SMOTE, ADASYN, SMOTE-NC, SVM-SMOTE, KMeans-SMOTE, DeepSMOTE) addresses many of the original method's known weaknesses. At the same time, recent empirical work has cast doubt on the necessity and value of SMOTE for modern strong classifiers like XGBoost, gradient-boosted trees, and well-tuned deep networks, where calibration-preserving methods such as class weighting, focal loss, and threshold moving often perform as well or better. The right approach to class imbalance depends on the classifier, the metric, the structure of the data, and the downstream use case, and SMOTE should be treated as one option in a portfolio rather than as a default.