Imbalanced Dataset

Data & Datasets Machine Learning

34 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v5 · 6,837 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An imbalanced dataset is a dataset used in machine learning where the classification categories are not approximately equally represented, so that one class (the majority class) contains far more samples than one or more other classes (the minority class). For example, a credit card fraud detection dataset might contain 99.8% legitimate transactions and only 0.2% fraudulent ones. This skewed distribution causes most standard classification model algorithms to develop a strong bias toward the majority class, because they are designed to minimize overall prediction error and the easiest way to do that is to predict the majority class for every input. The canonical defense is to rebalance the data with Synthetic Minority Over-sampling Technique (SMOTE), introduced by Chawla et al. in 2002, whose paper has accumulated more than 28,000 citations and remains the most cited method for class imbalance ^[1]. The original authors define the problem precisely: "A dataset is imbalanced if the classification categories are not approximately equally represented" ^[1].

The second pillar of handling imbalance is measurement: on a dataset where 99.9% of records belong to one class, a model that blindly predicts the majority class scores 99.9% accuracy while detecting none of the minority cases, an effect known as the accuracy paradox. For this reason, practitioners evaluate imbalanced problems with precision, recall, F1 score, the area under the precision-recall curve (PR-AUC), and the Matthews Correlation Coefficient (MCC) rather than raw accuracy ^[8]^[9].

Imbalanced datasets appear in nearly every domain where predictive modeling is used. In medical diagnosis, rare diseases may account for less than 1% of patient records. In cybersecurity, malicious network packets are vastly outnumbered by normal traffic. In manufacturing, defective products might represent less than 2% of all units. In each of these cases, the minority class is the one that matters most: missed fraud, missed diagnoses, and missed intrusions carry high real-world costs. This makes the class imbalance problem one of the most studied and practically relevant topics in applied machine learning.

Explain like I'm 5 (ELI5)

Imagine you have a big bag of marbles. Almost all of them are blue (990 blue marbles), but only a few are red (10 red marbles). Now suppose you are trying to teach a robot to sort marbles by color. Because the robot sees blue marbles almost every time, it learns to just say "blue" for everything. It gets the answer right 99% of the time, but it never finds the red ones.

To fix this, you can do a few things. You could make copies of the red marbles so the robot sees them more often. You could take away some of the blue marbles so the colors are more even. Or you could tell the robot: "Getting a red marble wrong is a much bigger deal than getting a blue marble wrong, so pay extra attention to the red ones." All of these ideas help the robot learn to spot both colors, not just the common one.

Where do imbalanced datasets occur?

Class imbalance is the norm rather than the exception in many applied settings. The table below lists common domains, typical imbalance ratios, and why the minority class carries disproportionate importance.

Domain	Minority class	Typical imbalance ratio	Why the minority class matters
Credit card fraud detection	Fraudulent transactions	0.1% to 0.2% of all transactions	Undetected fraud causes direct financial losses
Medical diagnosis (rare diseases)	Patients with the condition	Less than 1% of patient records	Missed diagnoses can be life-threatening
Manufacturing quality control	Defective products	0.5% to 2% of items produced	Shipping defective products harms brand reputation and safety
Network intrusion detection	Malicious packets	Less than 1% of network traffic	A single undetected intrusion can compromise an entire system
Cancer screening	Malignant tumors	1% to 5% of cases	False negatives delay treatment
Insurance claim fraud	Fraudulent claims	1% to 5% of all claims	Fraudulent payouts increase premiums for all policyholders
Loan default prediction	Defaulting borrowers	2% to 10% of applicants	Undetected defaults lead to significant financial losses
Spam email detection	Spam emails	10% to 20% of total emails	Spam wastes user time and can carry phishing threats
Equipment failure prediction	Failure events	Less than 1% of sensor readings	Unexpected failures cause costly downtime
Anti-money laundering	Suspicious transactions	Less than 0.1% of transactions	Undetected laundering enables organized crime

How is the degree of imbalance measured?

Before selecting a mitigation strategy, it helps to quantify how imbalanced a dataset actually is. Several metrics exist for this purpose.

The simplest measure is the imbalance ratio (IR), defined as the number of majority-class samples divided by the number of minority-class samples. A dataset with 9,500 negative examples and 500 positive examples has an IR of 19:1. Higher IR values indicate more severe imbalance.

However, IR has limitations when applied to multi-class classification problems. For multi-class settings, researchers have proposed additional measures. The imbalance coefficient normalizes the ratio to a bounded range. The Bayes Imbalance Impact Index (BI3), proposed by Song et al. (2019) ^[15], reflects the extent of influence purely from the factor of imbalance for the whole dataset, separating the effect of imbalance from other data complexity factors like class overlap and noise. The Likelihood-Ratio Imbalance Degree (LRID) uses a likelihood-ratio test to measure imbalance extent across multiple classes.

Research has shown that imbalance ratio alone does not fully explain classifier performance degradation. Other data characteristics, including class overlap (where majority and minority class distributions share significant regions in feature space), small disjuncts (small clusters of minority-class samples separated from the main minority cluster), and label noise, interact with imbalance to compound the difficulty of learning.

Why is class imbalance a problem?

What is the accuracy paradox?

The most intuitive evaluation metric for classification is accuracy, defined as the proportion of correct predictions out of all predictions. However, accuracy becomes deeply misleading when classes are imbalanced. Consider a fraud detection dataset where only 0.1% of transactions are fraudulent. A naive model that predicts every transaction as legitimate achieves 99.9% accuracy while catching zero fraud. This phenomenon is known as the accuracy paradox: a model can report high accuracy while being completely useless for the task it was designed to perform.

Because accuracy weights per-class performance proportionally to class size, it largely disregards how well the model handles the minority class. In domains where the minority class is the class of interest, accuracy reveals more about the distribution of classes than about actual model quality. Chicco and Jurman (2020) warn that on imbalanced data "accuracy and F1 score can dangerously show overoptimistic inflated results" ^[9].

Model bias toward the majority class

Most classification algorithms, including logistic regression, decision trees, and neural networks, are designed to minimize overall error during training. When one class vastly outnumbers the other, the loss function landscape is dominated by majority-class examples. The model learns to assign higher prior probability to the majority class, and its decision boundary shifts away from the minority class.

In practical terms, this means the model becomes very good at predicting the common outcome (for example, "not fraud") but very poor at detecting the rare outcome (for example, "fraud"). Since the rare outcome is typically the one that matters most, this bias defeats the purpose of building the model.

Insufficient minority-class signal

With very few minority-class samples, the model may not encounter enough diverse examples to learn the underlying patterns that distinguish the minority class from the majority class. This leads to poor generalization: the model overfits to the few minority samples it has seen and fails to recognize new minority-class instances at inference time.

Interaction with data complexity factors

Imbalance rarely exists in isolation. Real-world datasets often exhibit overlapping class distributions, noisy labels, and small disjuncts. These factors interact with imbalance to amplify the difficulty. A moderately imbalanced dataset (IR of 10:1) with heavy class overlap can be harder to learn than a severely imbalanced dataset (IR of 100:1) with well-separated classes. Understanding these interactions is important for selecting the right mitigation strategy ^[10].

How do you fix an imbalanced dataset? Data-level solutions

Data-level approaches modify the training dataset to reduce the degree of imbalance before the model is trained. These techniques are algorithm-agnostic and can be applied as a preprocessing step.

Oversampling techniques

Oversampling increases the number of minority-class examples in the training set. Several strategies exist, ranging from simple duplication to sophisticated synthetic data generation.

Technique	How it works	Strengths	Limitations
Random oversampling	Duplicates randomly selected minority-class samples	Simple to implement; no new data fabricated	Can cause overfitting by repeating identical samples
SMOTE	Creates synthetic samples by interpolating between a minority-class sample and its k-nearest minority-class neighbors	Generates novel samples; reduces overfitting risk compared to random oversampling	Can create noisy samples in overlapping class regions
Borderline-SMOTE	Applies SMOTE only to minority-class samples near the decision boundary	Focuses synthetic generation where it matters most	Requires careful identification of borderline samples
ADASYN	Generates more synthetic samples for harder-to-learn minority instances (those with more majority-class neighbors)	Adapts generation density to local difficulty	May amplify noise if hard-to-learn samples are actually outliers
SVM-SMOTE	Uses support vector machine support vectors to identify the borderline area, then generates synthetic data along lines connecting minority-class support vectors to their nearest neighbors	Leverages the SVM decision boundary for targeted generation	Computationally expensive due to SVM training step
SMOTE-NC	Handles datasets with both numerical and categorical features by using a modified distance metric with a median-based penalty for categorical differences	Works with mixed data types, unlike standard SMOTE	Requires at least one continuous feature; slower than standard SMOTE
SMOTE-ENN	Combines SMOTE oversampling with Edited Nearest Neighbors cleaning	Removes noisy synthetic samples after generation	More computationally expensive than SMOTE alone
K-Means SMOTE	Clusters the feature space with K-Means, then applies SMOTE within sparse minority clusters	Avoids generating synthetic samples in dense or noisy areas	Adds clustering overhead; sensitive to K selection

Undersampling techniques

Downsampling (also called undersampling) reduces the number of majority-class examples. This approach is useful when the dataset is very large and the computational cost of training on all majority-class samples is prohibitive.

Technique	How it works	Strengths	Limitations
Random undersampling	Removes randomly selected majority-class samples	Simple and fast	May discard informative majority-class examples; increases variance in the learned model
Tomek Links	Identifies pairs of nearest-neighbor samples from different classes (Tomek links) and removes the majority-class member	Cleans the decision boundary region	Only removes borderline samples; may not reduce imbalance substantially
NearMiss-1	Keeps majority-class samples whose average distance to the closest minority-class samples is smallest	Preserves majority samples near the boundary	Can be sensitive to noise and outliers
NearMiss-2	Keeps majority-class samples whose average distance to the farthest minority-class samples is smallest	Retains samples that are globally close to the minority class	May remove important majority-class structure
NearMiss-3	For each minority sample, keeps its M nearest majority-class neighbors, then selects majority samples with the largest average distance to their N nearest minority neighbors	Two-step process provides finer control	Computationally expensive; parameter-sensitive
Condensed Nearest Neighbor (CNN)	Iteratively selects majority-class samples that are misclassified by a 1-NN classifier trained on the current subset	Produces a compact, representative subset	Result depends on sample ordering
Edited Nearest Neighbors (ENN)	Removes any sample whose class label differs from the majority of its k nearest neighbors	Cleans noisy and borderline samples from both classes	Mild effect on imbalance ratio
One-Sided Selection (OSS)	Combines Tomek Links removal with CNN to remove both borderline noise and redundant majority-class samples	More thorough than either technique alone	More complex to tune

Combination approaches

Some methods combine oversampling and undersampling in a single pipeline. SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then removes Tomek links from the augmented dataset to clean up noisy boundary regions. SMOTE-ENN applies SMOTE followed by Edited Nearest Neighbors, which removes any sample whose class label differs from the majority of its k nearest neighbors. These combination approaches often outperform either oversampling or undersampling used in isolation, because they both augment the minority class and clean up the resulting decision boundary.

How does SMOTE work?

The Synthetic Minority Over-sampling Technique (SMOTE) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 in the Journal of Artificial Intelligence Research (volume 16, pages 321-357) and remains the most widely cited method for handling class imbalance ^[1]. As of 2025 the original paper has more than 28,000 citations (Semantic Scholar), making it one of the most cited machine learning papers of its era ^[16]. SMOTE addresses a fundamental limitation of random oversampling: rather than duplicating existing minority-class samples (which risks overfitting), it generates entirely new synthetic samples in feature space. As the authors put it, "Our method of over-sampling the minority class involves creating synthetic minority class examples" ^[1].

How SMOTE works

For each minority-class sample $x$ , find its $k$ nearest minority-class neighbors in feature space ( $k$ is typically 5).
Randomly select one of the $k$ neighbors, call it $x_{\text{nn}}$ .
Compute the difference vector: $\text{diff} = x_{\text{nn}} - x$ .
Generate a random number $\lambda$ uniformly distributed between 0 and 1.
Create the synthetic sample: $x_{\text{new}} = x + \lambda \cdot \text{diff}$ .
Repeat until the desired level of oversampling is achieved.

The result is that new synthetic samples lie along the line segments connecting existing minority-class samples in feature space. This produces more varied training data than simple duplication and helps the classifier generalize better. The original authors evaluated SMOTE using C4.5, Ripper, and Naive Bayes classifiers, measuring performance with the area under the ROC curve, and reported that combining minority over-sampling with majority under-sampling achieves better performance in ROC space than under-sampling alone ^[1].

SMOTE variants

Since its introduction, numerous extensions of SMOTE have been proposed to address its limitations.

Borderline-SMOTE (Han, Wang, and Mao, 2005) ^[3] restricts synthetic sample generation to minority-class instances that lie near the decision boundary. These are instances that have a roughly equal number of majority-class and minority-class neighbors. This focuses the augmentation effort where it is most needed and avoids generating synthetic samples deep within the minority-class cluster where the classifier already performs well.

ADASYN (He et al., 2008) ^[2] assigns a density distribution to minority-class samples based on their difficulty of learning. Samples surrounded by more majority-class neighbors are considered harder to learn and receive more synthetic neighbors. This adaptively shifts the decision boundary toward the difficult examples.

SVM-SMOTE uses a support vector machine to identify the borderline region. After training an SVM classifier on the original data, the minority-class support vectors approximate the decision boundary. Synthetic samples are generated along lines connecting these support vectors to their nearest minority-class neighbors. This approach can produce better-targeted synthetic samples than Borderline-SMOTE in some settings.

SMOTE-NC (Nominal and Continuous) extends SMOTE to datasets containing both numerical and categorical features. For numerical features, it uses the standard SMOTE interpolation. For categorical features, it assigns the most frequent category among the k nearest neighbors. A constant M, computed as the median of the standard deviations of numerical features in the minority class, is used as a penalty term when calculating distances involving categorical variables.

K-Means SMOTE first clusters the data using K-Means, identifies clusters dominated by minority-class samples, and applies SMOTE within those clusters. This avoids generating synthetic samples in noisy or heavily overlapping regions.

SMOTE-Tomek and SMOTE-ENN are hybrid approaches that apply SMOTE for oversampling and then use Tomek Links or Edited Nearest Neighbors (respectively) to clean up noisy or ambiguous samples created during synthesis.

Limitations of SMOTE

SMOTE has several well-documented limitations that practitioners should be aware of.

First, SMOTE operates in continuous feature space and relies on Euclidean distance for nearest-neighbor calculations. In very high-dimensional spaces, distance metrics become unreliable (the curse of dimensionality), and SMOTE may generate synthetic samples that do not reflect the true minority-class distribution. Applying dimensionality reduction before SMOTE can mitigate this issue.

Second, SMOTE does not consider the majority-class distribution when generating synthetic samples. If minority and majority classes overlap significantly, SMOTE can generate synthetic points that fall within majority-class regions, introducing noise and potentially degrading classifier performance. Borderline-SMOTE and ADASYN partially address this by focusing generation on boundary regions.

Third, the linear interpolation mechanism can produce synthetic samples that deviate from the true minority-class manifold, particularly when the minority class has a complex, non-linear distribution. GAN-based approaches (discussed below) can better capture such distributions.

Fourth, SMOTE was designed for binary classification. Applying it to multi-class problems requires either decomposing the problem into multiple binary problems or using multi-class extensions, which adds complexity.

GAN-based oversampling

Generative adversarial networks offer an alternative to SMOTE-family methods for generating synthetic minority-class samples. Instead of linear interpolation, GANs learn the underlying data distribution through an adversarial training process involving a generator and a discriminator.

For tabular data, CTGAN (Conditional Tabular GAN), introduced by Xu et al. (2019), is specifically designed ^[14] to handle the challenges of mixed data types (continuous and categorical columns) and imbalanced categorical variables. CTGAN uses a variational Gaussian mixture model to encode continuous columns and a training-by-sampling strategy that conditions the generator on specific column values. This allows it to generate synthetic samples that better capture complex, non-linear relationships in the data compared to interpolation-based methods.

Other GAN variants used for imbalanced data include CopulaGAN, which models the joint distribution of features using copulas, and WGAN-GP (Wasserstein GAN with Gradient Penalty), which provides more stable training dynamics.

GAN-based oversampling tends to produce more realistic synthetic samples than SMOTE when the minority class has a complex distribution. However, GANs are significantly more expensive to train, require careful hyperparameter tuning, and may suffer from mode collapse (generating only a limited variety of samples). For small minority classes (fewer than a few hundred samples), GANs may not have enough training data to learn the distribution effectively, making SMOTE-family methods a more practical choice.

Algorithm-level solutions: class weights, cost-sensitive learning, and focal loss

Algorithm-level approaches modify the learning algorithm itself so that it pays more attention to the minority class, without altering the training data.

Class weights

Most modern classifiers, including logistic regression, support vector machines, random forests, and neural networks, support a class_weight parameter that assigns higher importance to minority-class samples during training. When class weights are set inversely proportional to class frequencies, the loss contribution of each minority-class sample is amplified, effectively forcing the model to pay equal attention to both classes.

In scikit-learn, setting class_weight='balanced' automatically computes weights as:

weight_j = n_samples / (n_classes * n_samples_j)

where n_samples_j is the number of samples in class j. For a binary dataset with 950 negative and 50 positive samples, the positive class receives a weight of $1000 / (2 \times 50) = 10$ , meaning each positive-class error counts ten times as much as each negative-class error during training.

Cost-sensitive learning

Cost-sensitive learning generalizes class weighting by assigning different misclassification costs to different types of errors. A cost matrix specifies the penalty for each cell of the confusion matrix. For instance, in medical diagnosis, the cost of a false negative (missing a disease) is typically set much higher than the cost of a false positive (unnecessary follow-up test). The learning algorithm then minimizes expected cost rather than raw error count.

Cost-sensitive approaches can be implemented at three levels.

Direct algorithm modification involves changing the objective function of the learning algorithm to incorporate costs. For example, a cost-sensitive decision tree can use cost-weighted impurity measures instead of standard Gini impurity or information gain.

Meta-learning wrappers convert any existing classifier into a cost-sensitive one by manipulating instance weights or probability thresholds. This approach has the advantage of being model-agnostic.

Threshold adjustment modifies the probability threshold at prediction time rather than during training, shifting it to reflect the asymmetric costs.

Threshold moving

In binary classification, most classifiers output a probability score and apply a default threshold of 0.5 to convert it into a class label. For imbalanced data, this default threshold is usually suboptimal. Threshold moving (also called threshold tuning) involves selecting a threshold that optimizes a metric more appropriate for the task, such as the F1 score or the geometric mean of sensitivity and specificity. The optimal threshold can be determined by analyzing the precision-recall curve or the ROC curve on a validation set.

For example, if a model outputs probability 0.3 for a positive case, the default 0.5 threshold would classify it as negative. Lowering the threshold to 0.2 would correctly classify it as positive. The trade-off is that a lower threshold increases recall (catching more true positives) at the expense of precision (producing more false positives).

Focal loss

Introduced by Lin et al. (2017) for dense object detection, focal loss has become widely adopted for training deep learning models on imbalanced data ^[4]. The authors observed that "the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause" of the accuracy gap between one-stage and two-stage detectors, and proposed "reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples" ^[4]. Focal loss modifies the standard cross-entropy loss by adding a modulating factor that down-weights easy (well-classified) examples and focuses training on hard, misclassified instances:

\mathrm{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)

The hyperparameter $\gamma$ (typically set between 1 and 5) controls how aggressively easy examples are down-weighted. When $\gamma = 0$ , focal loss reduces to standard cross-entropy. The $\alpha_t$ term provides class-specific weighting. Lin et al. report that "we use gamma = 2.0 with alpha = .25 for all experiments," and that these settings let their RetinaNet detector reach 40.8 AP on the COCO test-dev benchmark, surpassing every prior one-stage and two-stage detector at the time ^[4]. In practice, focal loss effectively concentrates the gradient signal on minority-class samples and hard examples near the decision boundary.

Ensemble solutions

Ensemble methods combine multiple base learners, each trained on a different resampled version of the data, to produce a more robust classifier. Several ensemble approaches are specifically designed for imbalanced classification.

Method	Base learner	Resampling strategy	Description
BalancedRandomForest	Decision tree	Random undersampling per bootstrap	Each tree in the forest is trained on a balanced bootstrap sample created by randomly undersampling the majority class to match the minority class
EasyEnsemble	AdaBoost	Random undersampling per subset	Creates multiple balanced subsets by undersampling the majority class, trains an AdaBoost classifier on each subset, and aggregates predictions
RUSBoost	Decision tree (boosted)	Random undersampling per boosting round	Integrates random undersampling into the AdaBoost boosting process, balancing the data at each iteration
BalancedBagging	Any classifier	Random undersampling per bag	Extends standard bagging by undersampling each bootstrap sample before training the base learner
SMOTEBagging	Any classifier	SMOTE per bag	Applies SMOTE to each bootstrap sample to generate a balanced training set for each base learner
SMOTEBoost	Decision tree (boosted)	SMOTE per boosting round	Integrates SMOTE into the boosting procedure; generates synthetic minority samples at each round before updating weights

In comparative studies, ensemble methods that incorporate resampling (EasyEnsemble, RUSBoost, SMOTEBagging) consistently outperform standalone resampling or standalone ensemble approaches on imbalanced benchmarks. EasyEnsemble, in particular, has shown strong results across multiple studies ^[7], likely because it combines the variance reduction benefits of ensembling with the bias correction benefits of undersampling.

Anomaly detection as an alternative

When class imbalance is extreme (IR greater than 1000:1, or when labeled minority-class samples are very scarce), framing the problem as anomaly detection rather than binary classification can be more effective. Anomaly detection methods learn a model of "normal" behavior from the majority class and flag deviations as anomalies.

One-Class SVM learns a tight boundary around normal data in feature space and classifies points outside this boundary as anomalies. It works well when the normal class is compact in feature space but can be computationally intensive for large datasets, especially with non-linear kernels.

Isolation Forest builds an ensemble of random binary trees that recursively partition the feature space. Anomalies, being rare and different from normal patterns, tend to be isolated by fewer partitions and thus appear in shorter paths within the trees. Isolation Forest handles large, high-dimensional datasets efficiently and is relatively insensitive to the contamination rate.

Autoencoders trained on majority-class data learn to reconstruct normal patterns. At inference time, anomalies produce high reconstruction error because the model has never seen similar patterns during training. This approach is particularly useful for complex, high-dimensional data like images and time series.

The anomaly detection framing is most appropriate when very few labeled minority-class examples are available or when the minority class is too heterogeneous to model directly.

Multi-class imbalance

Imbalance in multi-class classification is more complex than in binary settings because the imbalance can exist between any pair of classes. A dataset might have three classes with distributions of 90%, 8%, and 2%, creating multiple simultaneous imbalance relationships.

Decomposition strategies

One common approach is to decompose the multi-class problem into multiple binary problems. One-vs-Rest (OvR) creates one binary classifier per class, where each classifier distinguishes one class from all others. However, OvR inherently creates imbalanced binary problems: the "rest" group is almost always larger than the single target class. One-vs-One (OvO) creates a binary classifier for each pair of classes. This naturally produces more balanced binary subproblems but requires training $O(k^2)$ classifiers for $k$ classes.

Multi-class SMOTE

Applying SMOTE to multi-class problems requires deciding which classes to oversample and by how much. Common strategies include oversampling all minority classes to match the majority class, oversampling each class to match the median class size, or using class-specific oversampling ratios based on the degree of imbalance each class faces.

Multi-class cost matrices

Cost-sensitive learning extends naturally to multi-class settings by defining a $k$ -by- $k$ cost matrix where each entry specifies the cost of predicting class $i$ when the true class is $j$ . In practice, designing an appropriate multi-class cost matrix requires domain expertise, since the relative costs of confusing different class pairs may vary significantly.

Deep learning with imbalanced data

Deep learning models face the same challenges as traditional classifiers when trained on imbalanced data, but their scale and flexibility enable additional mitigation strategies.

Loss function modifications

The most common approach is to replace standard cross-entropy loss with a loss function that accounts for class imbalance. Focal loss (described above) is the most widely used option. Other alternatives include:

Class-balanced loss (Cui et al., 2019) ^[13]: re-weights the loss by the effective number of samples per class, computed as $(1 - \beta^n) / (1 - \beta)$ where $n$ is the number of samples and $\beta$ is a hyperparameter.
Dice loss: originally developed for image segmentation tasks with imbalanced foreground and background pixels; measures the overlap between predicted and ground-truth regions.
Label-distribution-aware margin (LDAM) loss: enforces larger classification margins for minority classes, providing stronger regularization for under-represented classes.

Data-level strategies for deep learning

Because deep learning models train in mini-batches, severe imbalance can result in batches that contain no minority-class samples at all. Strategies to address this include:

Class-balanced sampling: constructing each mini-batch so that it contains an approximately equal number of samples from each class.
Curriculum learning: starting training with a balanced subset of easy examples and gradually introducing harder and more imbalanced batches.
Two-phase training: first pre-training on a balanced subset (using oversampling or undersampling), then fine-tuning on the original imbalanced data with a lower learning rate.
Decoupled training: training the feature extractor (backbone) with instance-balanced sampling for representation learning, then retraining only the classification head with class-balanced sampling. This approach, introduced by Kang et al. (2020), has shown strong results on long-tailed recognition benchmarks ^[12].

Data augmentation

For image and text data, data augmentation can serve as a form of oversampling that generates genuinely new minority-class examples rather than simple interpolations. Techniques such as random cropping, rotation, and color jittering (for images), as well as synonym replacement and back-translation (for text), increase the diversity of minority-class training data. When combined with focal loss or class-balanced sampling, augmentation-based approaches can substantially improve minority-class recall without sacrificing overall performance.

Which metrics should you use for imbalanced data?

Choosing the right evaluation metric is as important as choosing the right resampling strategy. Standard accuracy is unreliable for imbalanced data. The following metrics provide a more faithful picture of model performance.

Metric	Formula or definition	Why it helps with imbalanced data
Precision	$\text{TP} / (\text{TP} + \text{FP})$	Measures the fraction of predicted positives that are truly positive; high precision means few false alarms
Recall (sensitivity)	$\text{TP} / (\text{TP} + \text{FN})$	Measures the fraction of actual positives that are correctly detected; high recall means few missed cases
F1 score	$\frac{2 \cdot (\text{Precision} \cdot \text{Recall})}{\text{Precision} + \text{Recall}}$	Harmonic mean of precision and recall; balances both concerns in a single number
F-beta score	$\frac{(1 + \beta^2) \cdot (\text{Precision} \cdot \text{Recall})}{\beta^2 \cdot \text{Precision} + \text{Recall}}$	Generalization of F1 that allows weighting recall more ( $\beta > 1$ ) or precision more ( $\beta < 1$ )
PR-AUC	Area under the Precision-Recall curve	Focuses on positive-class performance; more informative than ROC-AUC when the positive class is rare
ROC-AUC	Area under the ROC curve	Measures trade-off between true positive rate and false positive rate across all thresholds
Matthews Correlation Coefficient (MCC)	$\frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$	Uses all four cells of the confusion matrix; returns a value between -1 and +1 that is informative even with severe imbalance
Balanced accuracy	$(\text{Sensitivity} + \text{Specificity}) / 2$	Averages per-class accuracy; corrects for majority-class dominance
Cohen's Kappa	Agreement beyond chance	Compares observed accuracy with expected accuracy under random prediction; penalizes models that merely predict the majority class
G-Mean	$\sqrt{\text{Sensitivity} \cdot \text{Specificity}}$	Geometric mean of per-class accuracies; penalizes models that sacrifice one class for another

PR-AUC vs. ROC-AUC

A long-standing debate in the literature concerns whether ROC-AUC or PR-AUC is more appropriate for imbalanced settings. ROC-AUC plots the true positive rate against the false positive rate and tends to present an optimistic view when the negative class is very large, because a small false positive rate still corresponds to a large absolute number of false positives. PR-AUC plots precision against recall and is more sensitive to errors involving the positive (minority) class.

Saito and Rehmsmeier (2015) demonstrated, in a paper titled "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," that "PRC plots can provide the viewer with an accurate prediction of future classification performance" because only the precision-recall curve changes with the ratio of positives to negatives ^[8]. Their analysis showed that a method performing well on the ROC curve may still perform poorly on the PR curve if it generates many false positives, which are obscured by the large number of true negatives in the ROC analysis.

As a general guideline, PR-AUC is preferred when the primary goal is to accurately identify minority-class instances (for example, fraud detection, rare disease diagnosis), while ROC-AUC is appropriate when the costs of false positives and false negatives are roughly symmetric. In practice, reporting both curves alongside the MCC provides the most complete picture.

The Matthews Correlation Coefficient

The MCC deserves special attention for imbalanced datasets. Chicco and Jurman (2020) demonstrated that MCC is more informative than both F1 score and accuracy for binary classification evaluation, because the coefficient "produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset" ^[9]. Unlike F1, MCC accounts for true negatives and is symmetric with respect to both classes. An MCC of +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates complete disagreement. Because MCC uses all four quadrants of the confusion matrix, it is harder to "game" by simply predicting the majority class.

What is the imbalanced-learn library?

imbalanced-learn (imported as imblearn) is an open-source Python library specifically designed to handle class-imbalanced datasets, first described by Lemaitre, Nogueira, and Aridas (2017) ^[5]. It is part of the scikit-learn-contrib ecosystem and provides a consistent API that integrates seamlessly with scikit-learn pipelines. As of June 2026 the library is at version 0.14.2 and requires Python 3.10 or newer and scikit-learn 1.4.2 or newer ^[17].

The library organizes its methods into four categories:

Over-sampling: Random oversampling, SMOTE, ADASYN, Borderline-SMOTE, K-Means SMOTE, SVM-SMOTE, SMOTE-NC
Under-sampling: Random undersampling, Tomek Links, NearMiss (versions 1, 2, 3), Edited Nearest Neighbors, Condensed Nearest Neighbor, One-Sided Selection, Neighbourhood Cleaning Rule
Combination: SMOTE-Tomek, SMOTE-ENN
Ensemble: EasyEnsembleClassifier, BalancedRandomForestClassifier, BalancedBaggingClassifier, RUSBoostClassifier

A basic usage example with SMOTE:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score

# Split data (stratified to preserve class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Create a pipeline with SMOTE and a classifier
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validate on training data
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-validated F1: {scores.mean():.3f} +/- {scores.std():.3f}")

# Fit and evaluate on test set
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

An important best practice is to apply resampling only to the training set, never to the validation or test set. The imblearn.pipeline.Pipeline class handles this automatically by applying the resampling step only during fit, not during predict or score. Resampling the test set would distort the evaluation metrics and give a misleading picture of real-world performance.

Comparison of approaches

The following table summarizes the main categories of techniques for handling imbalanced data, along with their typical use cases and trade-offs.

Approach	When to use	Advantages	Disadvantages
Random oversampling	Small datasets with moderate imbalance	Simple; no hyperparameters	Risk of overfitting from duplicated samples
SMOTE and variants	Moderate imbalance with continuous features	Generates diverse synthetic samples	May create noise in overlapping regions; struggles with high-dimensional data
GAN-based oversampling	Complex distributions; sufficient minority samples for GAN training	Captures non-linear distributions	Expensive to train; risk of mode collapse
Random undersampling	Very large datasets where computation is a concern	Reduces training time	Discards potentially useful majority-class information
Informed undersampling (Tomek, NearMiss)	Moderate to large datasets with noisy boundaries	Cleans decision boundary	May not reduce imbalance enough on its own
Class weights	Any classifier that supports weighted loss	No data modification needed; easy to implement	May not be sufficient for extreme imbalance
Cost-sensitive learning	Problems with well-defined misclassification costs	Directly optimizes for business objectives	Requires domain knowledge to set cost matrix
Threshold moving	Any probabilistic classifier	Simple post-hoc adjustment; no retraining needed	Only adjusts the decision point, not the learned representation
Focal loss	Deep learning models on imbalanced data	Automatically down-weights easy examples	Requires tuning $\gamma$ and $\alpha$ hyperparameters
Ensemble methods	General-purpose; when single models underperform	Combines benefits of resampling and model averaging	Higher computational cost; more complex to deploy
Anomaly detection	Extreme imbalance (>1000:1) or very few minority labels	Does not require balanced training data	Cannot leverage minority-class labels effectively

Best practices

Always evaluate with appropriate metrics. Use precision, recall, F1, PR-AUC, MCC, or balanced accuracy rather than raw accuracy.
Resample training data only. Never apply SMOTE, undersampling, or any resampling to the test or validation set.
Use stratified splits. When splitting data into training, validation, and test sets, use stratified sampling to preserve the original class distribution in each split.
Try multiple strategies. There is no universally best approach. Experiment with data-level, algorithm-level, and ensemble methods, and compare results using cross-validation.
Combine techniques. Using class weights together with SMOTE, or ensemble methods with focal loss, often outperforms any single technique.
Consider the cost structure. If false negatives are far more costly than false positives (for example, missed cancer diagnoses), weight your evaluation and training accordingly.
Be cautious with SMOTE on high-dimensional data. SMOTE relies on nearest-neighbor distances, which can become unreliable in very high-dimensional spaces. Dimensionality reduction before SMOTE can help.
Monitor for overfitting. Oversampling, especially random oversampling, can lead to overfitting. Track performance on a held-out validation set throughout training.
Use pipelines for resampling. Wrap resampling and modeling in a single pipeline (e.g., imblearn.pipeline.Pipeline) to prevent data leakage during cross-validation.
Account for data complexity. Imbalance ratio alone does not determine difficulty. Assess class overlap, noise, and small disjuncts to choose the most appropriate technique.
Start simple. Try class weights or threshold moving before resorting to complex resampling or ensemble methods. In many cases, simple approaches are competitive with more elaborate ones.

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. *Journal of Artificial Intelligence Research*, 16, 321-357. ↩
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. *IEEE International Joint Conference on Neural Networks*, 1322-1328. ↩
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. *International Conference on Intelligent Computing (ICIC)*, 878-887. ↩
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. *IEEE International Conference on Computer Vision (ICCV)*, 2980-2988. ↩
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. *Journal of Machine Learning Research*, 18(17), 1-5. ↩
Tomek, I. (1976). Two Modifications of CNN. *IEEE Transactions on Systems, Man, and Cybernetics*, 6(11), 769-772.
Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550. ↩
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. *PLOS ONE*, 10(3), e0118432. ↩
Chicco, D., & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. *BMC Genomics*, 21(6). ↩
Krawczyk, B. (2016). Learning from Imbalanced Data: Open Challenges and Future Directions. *Progress in Artificial Intelligence*, 5(4), 221-232. ↩
Fernandez, A., Garcia, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). *Learning from Imbalanced Data Sets*. Springer.
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2020). Decoupling Representation and Classifier for Long-Tailed Recognition. *International Conference on Learning Representations (ICLR)*. ↩
Cui, Y., Jia, M., Lin, T. Y., Song, Y., & Belongie, S. (2019). Class-Balanced Loss Based on Effective Number of Samples. *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9268-9277. ↩
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular Data using Conditional GAN. *Advances in Neural Information Processing Systems (NeurIPS)*, 32. ↩
Song, J., Huang, X., Qin, S., & Song, Q. (2019). Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem. *IEEE Transactions on Neural Networks and Learning Systems*, 30(11), 3525-3538. ↩
Semantic Scholar. SMOTE: Synthetic Minority Over-sampling Technique (citation count). Retrieved 2026. https://www.semanticscholar.org/paper/8cb44f06586f609a29d9b496cc752ec01475dffe ↩
Imbalanced-learn developers. Getting Started (version 0.14.2 installation requirements). Retrieved June 2026. https://imbalanced-learn.org/stable/install.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Area under the curve Focal loss Machine learning terms/All Majority class Minority class Negative class Subsampling Terms Undersampling Upweighting

Explain like I'm 5 (ELI5)

Where do imbalanced datasets occur?

How is the degree of imbalance measured?

Why is class imbalance a problem?

What is the accuracy paradox?

Model bias toward the majority class

Insufficient minority-class signal

Interaction with data complexity factors

How do you fix an imbalanced dataset? Data-level solutions

Oversampling techniques

Undersampling techniques

Combination approaches

How does SMOTE work?

How SMOTE works

SMOTE variants

Limitations of SMOTE

GAN-based oversampling

Algorithm-level solutions: class weights, cost-sensitive learning, and focal loss

Class weights

Cost-sensitive learning

Threshold moving

Focal loss

Ensemble solutions

Anomaly detection as an alternative

Multi-class imbalance

Decomposition strategies

Multi-class SMOTE

Multi-class cost matrices

Deep learning with imbalanced data

Loss function modifications

Data-level strategies for deep learning

Data augmentation

Which metrics should you use for imbalanced data?

PR-AUC vs. ROC-AUC

The Matthews Correlation Coefficient

What is the imbalanced-learn library?

Comparison of approaches

Best practices

See also

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here