An imbalanced dataset is a dataset used in machine learning where the classes are not represented equally. In a typical imbalanced scenario, one class (the majority class) contains far more samples than one or more other classes (the minority class). For example, a credit card fraud detection dataset might contain 99.8% legitimate transactions and only 0.2% fraudulent ones. This skewed distribution causes most standard classification model algorithms to develop a strong bias toward the majority class, because they are designed to minimize overall prediction error and the easiest way to do that is to predict the majority class for every input.
Imbalanced datasets appear in nearly every domain where predictive modeling is used. In medical diagnosis, rare diseases may account for less than 1% of patient records. In cybersecurity, malicious network packets are vastly outnumbered by normal traffic. In manufacturing, defective products might represent less than 2% of all units. In each of these cases, the minority class is the one that matters most: missed fraud, missed diagnoses, and missed intrusions carry high real-world costs. This makes the class imbalance problem one of the most studied and practically relevant topics in applied machine learning.
Imagine you have a big bag of marbles. Almost all of them are blue (990 blue marbles), but only a few are red (10 red marbles). Now suppose you are trying to teach a robot to sort marbles by color. Because the robot sees blue marbles almost every time, it learns to just say "blue" for everything. It gets the answer right 99% of the time, but it never finds the red ones.
To fix this, you can do a few things. You could make copies of the red marbles so the robot sees them more often. You could take away some of the blue marbles so the colors are more even. Or you could tell the robot: "Getting a red marble wrong is a much bigger deal than getting a blue marble wrong, so pay extra attention to the red ones." All of these ideas help the robot learn to spot both colors, not just the common one.
Class imbalance is the norm rather than the exception in many applied settings. The table below lists common domains, typical imbalance ratios, and why the minority class carries disproportionate importance.
| Domain | Minority class | Typical imbalance ratio | Why the minority class matters |
|---|---|---|---|
| Credit card fraud detection | Fraudulent transactions | 0.1% to 0.2% of all transactions | Undetected fraud causes direct financial losses |
| Medical diagnosis (rare diseases) | Patients with the condition | Less than 1% of patient records | Missed diagnoses can be life-threatening |
| Manufacturing quality control | Defective products | 0.5% to 2% of items produced | Shipping defective products harms brand reputation and safety |
| Network intrusion detection | Malicious packets | Less than 1% of network traffic | A single undetected intrusion can compromise an entire system |
| Cancer screening | Malignant tumors | 1% to 5% of cases | False negatives delay treatment |
| Insurance claim fraud | Fraudulent claims | 1% to 5% of all claims | Fraudulent payouts increase premiums for all policyholders |
| Loan default prediction | Defaulting borrowers | 2% to 10% of applicants | Undetected defaults lead to significant financial losses |
| Spam email detection | Spam emails | 10% to 20% of total emails | Spam wastes user time and can carry phishing threats |
| Equipment failure prediction | Failure events | Less than 1% of sensor readings | Unexpected failures cause costly downtime |
| Anti-money laundering | Suspicious transactions | Less than 0.1% of transactions | Undetected laundering enables organized crime |
Before selecting a mitigation strategy, it helps to quantify how imbalanced a dataset actually is. Several metrics exist for this purpose.
The simplest measure is the imbalance ratio (IR), defined as the number of majority-class samples divided by the number of minority-class samples. A dataset with 9,500 negative examples and 500 positive examples has an IR of 19:1. Higher IR values indicate more severe imbalance.
However, IR has limitations when applied to multi-class classification problems. For multi-class settings, researchers have proposed additional measures. The imbalance coefficient normalizes the ratio to a bounded range. The Bayes Imbalance Impact Index (BI3), proposed by Song et al. (2019), reflects the extent of influence purely from the factor of imbalance for the whole dataset, separating the effect of imbalance from other data complexity factors like class overlap and noise. The Likelihood-Ratio Imbalance Degree (LRID) uses a likelihood-ratio test to measure imbalance extent across multiple classes.
Research has shown that imbalance ratio alone does not fully explain classifier performance degradation. Other data characteristics, including class overlap (where majority and minority class distributions share significant regions in feature space), small disjuncts (small clusters of minority-class samples separated from the main minority cluster), and label noise, interact with imbalance to compound the difficulty of learning.
The most intuitive evaluation metric for classification is accuracy, defined as the proportion of correct predictions out of all predictions. However, accuracy becomes deeply misleading when classes are imbalanced. Consider a fraud detection dataset where only 0.1% of transactions are fraudulent. A naive model that predicts every transaction as legitimate achieves 99.9% accuracy while catching zero fraud. This phenomenon is known as the accuracy paradox: a model can report high accuracy while being completely useless for the task it was designed to perform.
Because accuracy weights per-class performance proportionally to class size, it largely disregards how well the model handles the minority class. In domains where the minority class is the class of interest, accuracy reveals more about the distribution of classes than about actual model quality.
Most classification algorithms, including logistic regression, decision trees, and neural networks, are designed to minimize overall error during training. When one class vastly outnumbers the other, the loss function landscape is dominated by majority-class examples. The model learns to assign higher prior probability to the majority class, and its decision boundary shifts away from the minority class.
In practical terms, this means the model becomes very good at predicting the common outcome (for example, "not fraud") but very poor at detecting the rare outcome (for example, "fraud"). Since the rare outcome is typically the one that matters most, this bias defeats the purpose of building the model.
With very few minority-class samples, the model may not encounter enough diverse examples to learn the underlying patterns that distinguish the minority class from the majority class. This leads to poor generalization: the model overfits to the few minority samples it has seen and fails to recognize new minority-class instances at inference time.
Imbalance rarely exists in isolation. Real-world datasets often exhibit overlapping class distributions, noisy labels, and small disjuncts. These factors interact with imbalance to amplify the difficulty. A moderately imbalanced dataset (IR of 10:1) with heavy class overlap can be harder to learn than a severely imbalanced dataset (IR of 100:1) with well-separated classes. Understanding these interactions is important for selecting the right mitigation strategy.
Data-level approaches modify the training dataset to reduce the degree of imbalance before the model is trained. These techniques are algorithm-agnostic and can be applied as a preprocessing step.
Oversampling increases the number of minority-class examples in the training set. Several strategies exist, ranging from simple duplication to sophisticated synthetic data generation.
| Technique | How it works | Strengths | Limitations |
|---|---|---|---|
| Random oversampling | Duplicates randomly selected minority-class samples | Simple to implement; no new data fabricated | Can cause overfitting by repeating identical samples |
| SMOTE | Creates synthetic samples by interpolating between a minority-class sample and its k-nearest minority-class neighbors | Generates novel samples; reduces overfitting risk compared to random oversampling | Can create noisy samples in overlapping class regions |
| Borderline-SMOTE | Applies SMOTE only to minority-class samples near the decision boundary | Focuses synthetic generation where it matters most | Requires careful identification of borderline samples |
| ADASYN | Generates more synthetic samples for harder-to-learn minority instances (those with more majority-class neighbors) | Adapts generation density to local difficulty | May amplify noise if hard-to-learn samples are actually outliers |
| SVM-SMOTE | Uses support vector machine support vectors to identify the borderline area, then generates synthetic data along lines connecting minority-class support vectors to their nearest neighbors | Leverages the SVM decision boundary for targeted generation | Computationally expensive due to SVM training step |
| SMOTE-NC | Handles datasets with both numerical and categorical features by using a modified distance metric with a median-based penalty for categorical differences | Works with mixed data types, unlike standard SMOTE | Requires at least one continuous feature; slower than standard SMOTE |
| SMOTE-ENN | Combines SMOTE oversampling with Edited Nearest Neighbors cleaning | Removes noisy synthetic samples after generation | More computationally expensive than SMOTE alone |
| K-Means SMOTE | Clusters the feature space with K-Means, then applies SMOTE within sparse minority clusters | Avoids generating synthetic samples in dense or noisy areas | Adds clustering overhead; sensitive to K selection |
Downsampling (also called undersampling) reduces the number of majority-class examples. This approach is useful when the dataset is very large and the computational cost of training on all majority-class samples is prohibitive.
| Technique | How it works | Strengths | Limitations |
|---|---|---|---|
| Random undersampling | Removes randomly selected majority-class samples | Simple and fast | May discard informative majority-class examples; increases variance in the learned model |
| Tomek Links | Identifies pairs of nearest-neighbor samples from different classes (Tomek links) and removes the majority-class member | Cleans the decision boundary region | Only removes borderline samples; may not reduce imbalance substantially |
| NearMiss-1 | Keeps majority-class samples whose average distance to the closest minority-class samples is smallest | Preserves majority samples near the boundary | Can be sensitive to noise and outliers |
| NearMiss-2 | Keeps majority-class samples whose average distance to the farthest minority-class samples is smallest | Retains samples that are globally close to the minority class | May remove important majority-class structure |
| NearMiss-3 | For each minority sample, keeps its M nearest majority-class neighbors, then selects majority samples with the largest average distance to their N nearest minority neighbors | Two-step process provides finer control | Computationally expensive; parameter-sensitive |
| Condensed Nearest Neighbor (CNN) | Iteratively selects majority-class samples that are misclassified by a 1-NN classifier trained on the current subset | Produces a compact, representative subset | Result depends on sample ordering |
| Edited Nearest Neighbors (ENN) | Removes any sample whose class label differs from the majority of its k nearest neighbors | Cleans noisy and borderline samples from both classes | Mild effect on imbalance ratio |
| One-Sided Selection (OSS) | Combines Tomek Links removal with CNN to remove both borderline noise and redundant majority-class samples | More thorough than either technique alone | More complex to tune |
Some methods combine oversampling and undersampling in a single pipeline. SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then removes Tomek links from the augmented dataset to clean up noisy boundary regions. SMOTE-ENN applies SMOTE followed by Edited Nearest Neighbors, which removes any sample whose class label differs from the majority of its k nearest neighbors. These combination approaches often outperform either oversampling or undersampling used in isolation, because they both augment the minority class and clean up the resulting decision boundary.
The Synthetic Minority Over-sampling Technique (SMOTE) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 and remains the most widely cited method for handling class imbalance. As of 2025, the original paper has over 25,000 citations. SMOTE addresses a fundamental limitation of random oversampling: rather than duplicating existing minority-class samples (which risks overfitting), it generates entirely new synthetic samples in feature space.
The result is that new synthetic samples lie along the line segments connecting existing minority-class samples in feature space. This produces more varied training data than simple duplication and helps the classifier generalize better. The original authors evaluated SMOTE using C4.5, Ripper, and Naive Bayes classifiers, measuring performance with the area under the ROC curve.
Since its introduction, numerous extensions of SMOTE have been proposed to address its limitations.
Borderline-SMOTE (Han, Wang, and Mao, 2005) restricts synthetic sample generation to minority-class instances that lie near the decision boundary. These are instances that have a roughly equal number of majority-class and minority-class neighbors. This focuses the augmentation effort where it is most needed and avoids generating synthetic samples deep within the minority-class cluster where the classifier already performs well.
ADASYN (He et al., 2008) assigns a density distribution to minority-class samples based on their difficulty of learning. Samples surrounded by more majority-class neighbors are considered harder to learn and receive more synthetic neighbors. This adaptively shifts the decision boundary toward the difficult examples.
SVM-SMOTE uses a support vector machine to identify the borderline region. After training an SVM classifier on the original data, the minority-class support vectors approximate the decision boundary. Synthetic samples are generated along lines connecting these support vectors to their nearest minority-class neighbors. This approach can produce better-targeted synthetic samples than Borderline-SMOTE in some settings.
SMOTE-NC (Nominal and Continuous) extends SMOTE to datasets containing both numerical and categorical features. For numerical features, it uses the standard SMOTE interpolation. For categorical features, it assigns the most frequent category among the k nearest neighbors. A constant M, computed as the median of the standard deviations of numerical features in the minority class, is used as a penalty term when calculating distances involving categorical variables.
K-Means SMOTE first clusters the data using K-Means, identifies clusters dominated by minority-class samples, and applies SMOTE within those clusters. This avoids generating synthetic samples in noisy or heavily overlapping regions.
SMOTE-Tomek and SMOTE-ENN are hybrid approaches that apply SMOTE for oversampling and then use Tomek Links or Edited Nearest Neighbors (respectively) to clean up noisy or ambiguous samples created during synthesis.
SMOTE has several well-documented limitations that practitioners should be aware of.
First, SMOTE operates in continuous feature space and relies on Euclidean distance for nearest-neighbor calculations. In very high-dimensional spaces, distance metrics become unreliable (the curse of dimensionality), and SMOTE may generate synthetic samples that do not reflect the true minority-class distribution. Applying dimensionality reduction before SMOTE can mitigate this issue.
Second, SMOTE does not consider the majority-class distribution when generating synthetic samples. If minority and majority classes overlap significantly, SMOTE can generate synthetic points that fall within majority-class regions, introducing noise and potentially degrading classifier performance. Borderline-SMOTE and ADASYN partially address this by focusing generation on boundary regions.
Third, the linear interpolation mechanism can produce synthetic samples that deviate from the true minority-class manifold, particularly when the minority class has a complex, non-linear distribution. GAN-based approaches (discussed below) can better capture such distributions.
Fourth, SMOTE was designed for binary classification. Applying it to multi-class problems requires either decomposing the problem into multiple binary problems or using multi-class extensions, which adds complexity.
Generative adversarial networks offer an alternative to SMOTE-family methods for generating synthetic minority-class samples. Instead of linear interpolation, GANs learn the underlying data distribution through an adversarial training process involving a generator and a discriminator.
For tabular data, CTGAN (Conditional Tabular GAN) is specifically designed to handle the challenges of mixed data types (continuous and categorical columns) and imbalanced categorical variables. CTGAN uses a variational Gaussian mixture model to encode continuous columns and a training-by-sampling strategy that conditions the generator on specific column values. This allows it to generate synthetic samples that better capture complex, non-linear relationships in the data compared to interpolation-based methods.
Other GAN variants used for imbalanced data include CopulaGAN, which models the joint distribution of features using copulas, and WGAN-GP (Wasserstein GAN with Gradient Penalty), which provides more stable training dynamics.
GAN-based oversampling tends to produce more realistic synthetic samples than SMOTE when the minority class has a complex distribution. However, GANs are significantly more expensive to train, require careful hyperparameter tuning, and may suffer from mode collapse (generating only a limited variety of samples). For small minority classes (fewer than a few hundred samples), GANs may not have enough training data to learn the distribution effectively, making SMOTE-family methods a more practical choice.
Algorithm-level approaches modify the learning algorithm itself so that it pays more attention to the minority class, without altering the training data.
Most modern classifiers, including logistic regression, support vector machines, random forests, and neural networks, support a class_weight parameter that assigns higher importance to minority-class samples during training. When class weights are set inversely proportional to class frequencies, the loss contribution of each minority-class sample is amplified, effectively forcing the model to pay equal attention to both classes.
In scikit-learn, setting class_weight='balanced' automatically computes weights as:
weight_j = n_samples / (n_classes * n_samples_j)
where n_samples_j is the number of samples in class j. For a binary dataset with 950 negative and 50 positive samples, the positive class receives a weight of 1000 / (2 * 50) = 10, meaning each positive-class error counts ten times as much as each negative-class error during training.
Cost-sensitive learning generalizes class weighting by assigning different misclassification costs to different types of errors. A cost matrix specifies the penalty for each cell of the confusion matrix. For instance, in medical diagnosis, the cost of a false negative (missing a disease) is typically set much higher than the cost of a false positive (unnecessary follow-up test). The learning algorithm then minimizes expected cost rather than raw error count.
Cost-sensitive approaches can be implemented at three levels.
Direct algorithm modification involves changing the objective function of the learning algorithm to incorporate costs. For example, a cost-sensitive decision tree can use cost-weighted impurity measures instead of standard Gini impurity or information gain.
Meta-learning wrappers convert any existing classifier into a cost-sensitive one by manipulating instance weights or probability thresholds. This approach has the advantage of being model-agnostic.
Threshold adjustment modifies the probability threshold at prediction time rather than during training, shifting it to reflect the asymmetric costs.
In binary classification, most classifiers output a probability score and apply a default threshold of 0.5 to convert it into a class label. For imbalanced data, this default threshold is usually suboptimal. Threshold moving (also called threshold tuning) involves selecting a threshold that optimizes a metric more appropriate for the task, such as the F1 score or the geometric mean of sensitivity and specificity. The optimal threshold can be determined by analyzing the precision-recall curve or the ROC curve on a validation set.
For example, if a model outputs probability 0.3 for a positive case, the default 0.5 threshold would classify it as negative. Lowering the threshold to 0.2 would correctly classify it as positive. The trade-off is that a lower threshold increases recall (catching more true positives) at the expense of precision (producing more false positives).
Introduced by Lin et al. (2017) for dense object detection, focal loss has become widely adopted for training deep learning models on imbalanced data. Focal loss modifies the standard cross-entropy loss by adding a modulating factor that down-weights easy (well-classified) examples and focuses training on hard, misclassified instances:
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
The hyperparameter gamma (typically set between 1 and 5) controls how aggressively easy examples are down-weighted. When gamma = 0, focal loss reduces to standard cross-entropy. The alpha_t term provides class-specific weighting. In practice, focal loss effectively concentrates the gradient signal on minority-class samples and hard examples near the decision boundary.
Ensemble methods combine multiple base learners, each trained on a different resampled version of the data, to produce a more robust classifier. Several ensemble approaches are specifically designed for imbalanced classification.
| Method | Base learner | Resampling strategy | Description |
|---|---|---|---|
| BalancedRandomForest | Decision tree | Random undersampling per bootstrap | Each tree in the forest is trained on a balanced bootstrap sample created by randomly undersampling the majority class to match the minority class |
| EasyEnsemble | AdaBoost | Random undersampling per subset | Creates multiple balanced subsets by undersampling the majority class, trains an AdaBoost classifier on each subset, and aggregates predictions |
| RUSBoost | Decision tree (boosted) | Random undersampling per boosting round | Integrates random undersampling into the AdaBoost boosting process, balancing the data at each iteration |
| BalancedBagging | Any classifier | Random undersampling per bag | Extends standard bagging by undersampling each bootstrap sample before training the base learner |
| SMOTEBagging | Any classifier | SMOTE per bag | Applies SMOTE to each bootstrap sample to generate a balanced training set for each base learner |
| SMOTEBoost | Decision tree (boosted) | SMOTE per boosting round | Integrates SMOTE into the boosting procedure; generates synthetic minority samples at each round before updating weights |
In comparative studies, ensemble methods that incorporate resampling (EasyEnsemble, RUSBoost, SMOTEBagging) consistently outperform standalone resampling or standalone ensemble approaches on imbalanced benchmarks. EasyEnsemble, in particular, has shown strong results across multiple studies, likely because it combines the variance reduction benefits of ensembling with the bias correction benefits of undersampling.
When class imbalance is extreme (IR greater than 1000:1, or when labeled minority-class samples are very scarce), framing the problem as anomaly detection rather than binary classification can be more effective. Anomaly detection methods learn a model of "normal" behavior from the majority class and flag deviations as anomalies.
One-Class SVM learns a tight boundary around normal data in feature space and classifies points outside this boundary as anomalies. It works well when the normal class is compact in feature space but can be computationally intensive for large datasets, especially with non-linear kernels.
Isolation Forest builds an ensemble of random binary trees that recursively partition the feature space. Anomalies, being rare and different from normal patterns, tend to be isolated by fewer partitions and thus appear in shorter paths within the trees. Isolation Forest handles large, high-dimensional datasets efficiently and is relatively insensitive to the contamination rate.
Autoencoders trained on majority-class data learn to reconstruct normal patterns. At inference time, anomalies produce high reconstruction error because the model has never seen similar patterns during training. This approach is particularly useful for complex, high-dimensional data like images and time series.
The anomaly detection framing is most appropriate when very few labeled minority-class examples are available or when the minority class is too heterogeneous to model directly.
Imbalance in multi-class classification is more complex than in binary settings because the imbalance can exist between any pair of classes. A dataset might have three classes with distributions of 90%, 8%, and 2%, creating multiple simultaneous imbalance relationships.
One common approach is to decompose the multi-class problem into multiple binary problems. One-vs-Rest (OvR) creates one binary classifier per class, where each classifier distinguishes one class from all others. However, OvR inherently creates imbalanced binary problems: the "rest" group is almost always larger than the single target class. One-vs-One (OvO) creates a binary classifier for each pair of classes. This naturally produces more balanced binary subproblems but requires training O(k^2) classifiers for k classes.
Applying SMOTE to multi-class problems requires deciding which classes to oversample and by how much. Common strategies include oversampling all minority classes to match the majority class, oversampling each class to match the median class size, or using class-specific oversampling ratios based on the degree of imbalance each class faces.
Cost-sensitive learning extends naturally to multi-class settings by defining a k-by-k cost matrix where each entry specifies the cost of predicting class i when the true class is j. In practice, designing an appropriate multi-class cost matrix requires domain expertise, since the relative costs of confusing different class pairs may vary significantly.
Deep learning models face the same challenges as traditional classifiers when trained on imbalanced data, but their scale and flexibility enable additional mitigation strategies.
The most common approach is to replace standard cross-entropy loss with a loss function that accounts for class imbalance. Focal loss (described above) is the most widely used option. Other alternatives include:
Because deep learning models train in mini-batches, severe imbalance can result in batches that contain no minority-class samples at all. Strategies to address this include:
For image and text data, data augmentation can serve as a form of oversampling that generates genuinely new minority-class examples rather than simple interpolations. Techniques such as random cropping, rotation, and color jittering (for images), as well as synonym replacement and back-translation (for text), increase the diversity of minority-class training data. When combined with focal loss or class-balanced sampling, augmentation-based approaches can substantially improve minority-class recall without sacrificing overall performance.
Choosing the right evaluation metric is as important as choosing the right resampling strategy. Standard accuracy is unreliable for imbalanced data. The following metrics provide a more faithful picture of model performance.
| Metric | Formula or definition | Why it helps with imbalanced data |
|---|---|---|
| Precision | TP / (TP + FP) | Measures the fraction of predicted positives that are truly positive; high precision means few false alarms |
| Recall (sensitivity) | TP / (TP + FN) | Measures the fraction of actual positives that are correctly detected; high recall means few missed cases |
| F1 score | 2 * (Precision * Recall) / (Precision + Recall) | Harmonic mean of precision and recall; balances both concerns in a single number |
| F-beta score | (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall) | Generalization of F1 that allows weighting recall more (beta > 1) or precision more (beta < 1) |
| PR-AUC | Area under the Precision-Recall curve | Focuses on positive-class performance; more informative than ROC-AUC when the positive class is rare |
| ROC-AUC | Area under the ROC curve | Measures trade-off between true positive rate and false positive rate across all thresholds |
| Matthews Correlation Coefficient (MCC) | (TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)) | Uses all four cells of the confusion matrix; returns a value between -1 and +1 that is informative even with severe imbalance |
| Balanced accuracy | (Sensitivity + Specificity) / 2 | Averages per-class accuracy; corrects for majority-class dominance |
| Cohen's Kappa | Agreement beyond chance | Compares observed accuracy with expected accuracy under random prediction; penalizes models that merely predict the majority class |
| G-Mean | sqrt(Sensitivity * Specificity) | Geometric mean of per-class accuracies; penalizes models that sacrifice one class for another |
A long-standing debate in the literature concerns whether ROC-AUC or PR-AUC is more appropriate for imbalanced settings. ROC-AUC plots the true positive rate against the false positive rate and tends to present an optimistic view when the negative class is very large, because a small false positive rate still corresponds to a large absolute number of false positives. PR-AUC plots precision against recall and is more sensitive to errors involving the positive (minority) class.
Saito and Rehmsmeier (2015) demonstrated that the precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. Their analysis showed that a method performing well on the ROC curve may still perform poorly on the PR curve if it generates many false positives, which are obscured by the large number of true negatives in the ROC analysis.
As a general guideline, PR-AUC is preferred when the primary goal is to accurately identify minority-class instances (for example, fraud detection, rare disease diagnosis), while ROC-AUC is appropriate when the costs of false positives and false negatives are roughly symmetric. In practice, reporting both curves alongside the MCC provides the most complete picture.
The MCC deserves special attention for imbalanced datasets. Chicco and Jurman (2020) demonstrated that MCC is more informative than both F1 score and accuracy for binary classification evaluation. Unlike F1, MCC accounts for true negatives and is symmetric with respect to both classes. An MCC of +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates complete disagreement. Because MCC uses all four quadrants of the confusion matrix, it is harder to "game" by simply predicting the majority class.
imbalanced-learn is an open-source Python library specifically designed to handle class-imbalanced datasets. It is part of the scikit-learn-contrib ecosystem and provides a consistent API that integrates seamlessly with scikit-learn pipelines.
The library organizes its methods into four categories:
A basic usage example with SMOTE:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score
# Split data (stratified to preserve class distribution)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Create a pipeline with SMOTE and a classifier
pipeline = Pipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
# Cross-validate on training data
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-validated F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# Fit and evaluate on test set
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
An important best practice is to apply resampling only to the training set, never to the validation or test set. The imblearn.pipeline.Pipeline class handles this automatically by applying the resampling step only during fit, not during predict or score. Resampling the test set would distort the evaluation metrics and give a misleading picture of real-world performance.
The following table summarizes the main categories of techniques for handling imbalanced data, along with their typical use cases and trade-offs.
| Approach | When to use | Advantages | Disadvantages |
|---|---|---|---|
| Random oversampling | Small datasets with moderate imbalance | Simple; no hyperparameters | Risk of overfitting from duplicated samples |
| SMOTE and variants | Moderate imbalance with continuous features | Generates diverse synthetic samples | May create noise in overlapping regions; struggles with high-dimensional data |
| GAN-based oversampling | Complex distributions; sufficient minority samples for GAN training | Captures non-linear distributions | Expensive to train; risk of mode collapse |
| Random undersampling | Very large datasets where computation is a concern | Reduces training time | Discards potentially useful majority-class information |
| Informed undersampling (Tomek, NearMiss) | Moderate to large datasets with noisy boundaries | Cleans decision boundary | May not reduce imbalance enough on its own |
| Class weights | Any classifier that supports weighted loss | No data modification needed; easy to implement | May not be sufficient for extreme imbalance |
| Cost-sensitive learning | Problems with well-defined misclassification costs | Directly optimizes for business objectives | Requires domain knowledge to set cost matrix |
| Threshold moving | Any probabilistic classifier | Simple post-hoc adjustment; no retraining needed | Only adjusts the decision point, not the learned representation |
| Focal loss | Deep learning models on imbalanced data | Automatically down-weights easy examples | Requires tuning gamma and alpha hyperparameters |
| Ensemble methods | General-purpose; when single models underperform | Combines benefits of resampling and model averaging | Higher computational cost; more complex to deploy |
| Anomaly detection | Extreme imbalance (>1000:1) or very few minority labels | Does not require balanced training data | Cannot leverage minority-class labels effectively |
imblearn.pipeline.Pipeline) to prevent data leakage during cross-validation.