Class-Imbalanced Dataset

A class-imbalanced dataset is a dataset in which the distribution of examples across the target classes is significantly unequal. In a typical imbalanced scenario, one class (the majority class) contains far more samples than one or more other classes (the minority class). Class imbalance is one of the most pervasive challenges in applied machine learning, because most standard classification model algorithms assume roughly equal class representation during training.

Imbalanced datasets appear across nearly every domain where predictive modeling is used, from healthcare and finance to manufacturing and cybersecurity. The core difficulty is that models trained on imbalanced data tend to develop a strong bias toward the majority class, resulting in poor detection of the minority class, which is often the class of greatest practical interest.

Real-World Examples

Class imbalance is the norm rather than the exception in many real-world applications. The table below summarizes common domains, typical imbalance ratios, and why the minority class matters.

Domain	Minority Class	Typical Imbalance Ratio	Why Minority Class Matters
Credit card fraud detection	Fraudulent transactions	0.1% to 0.2% of all transactions	Undetected fraud causes direct financial losses
Medical diagnosis (rare diseases)	Patients with the condition	Less than 1% of patient records	Missed diagnoses can be life-threatening
Manufacturing quality control	Defective products	0.5% to 2% of items produced	Shipping defective products harms brand reputation and safety
Network intrusion detection	Malicious packets	Less than 1% of network traffic	A single undetected intrusion can compromise an entire system
Cancer screening	Malignant tumors	1% to 5% of cases	False negatives delay critical treatment
Insurance claim fraud	Fraudulent claims	1% to 5% of all claims	Fraudulent payouts increase premiums for all policyholders
Loan default prediction	Defaulting borrowers	2% to 10% of applicants	Undetected defaults lead to significant financial losses
Spam email detection	Spam emails	10% to 20% of total emails	Spam wastes user time and can carry phishing threats

These examples illustrate a recurring pattern: the minority class is often the one that carries the highest cost when misclassified, yet it is the hardest for models to learn because so few training examples are available.

Why Class Imbalance Is a Problem

The Accuracy Paradox

The most intuitive evaluation metric for classification is accuracy, defined as the proportion of correct predictions out of all predictions. However, accuracy becomes deeply misleading when classes are imbalanced. Consider a fraud detection dataset where only 0.1% of transactions are fraudulent. A naive model that predicts every transaction as legitimate achieves 99.9% accuracy while catching zero fraud. This phenomenon is known as the accuracy paradox: a model can report high accuracy while being completely useless for the task it was designed to perform.

Because accuracy weights per-class performance proportionally to class size, it largely disregards how well the model performs on the minority class. In domains where the minority class is the class of interest, accuracy can reveal more about the distribution of classes than about actual model quality.

Model Bias Toward the Majority Class

Most classification model algorithms, including logistic regression, decision trees, and neural networks, are designed to minimize overall error during training. When one class vastly outnumbers the other, the loss landscape is dominated by majority-class examples. The model learns to assign higher prior probability to the majority class, and its decision boundary shifts away from the minority class.

In practical terms, this means the model becomes very good at predicting the common outcome (e.g., "not fraud") but very poor at detecting the rare outcome (e.g., "fraud"). Since the rare outcome is typically the one that matters most, this bias defeats the purpose of building the model in the first place.

Insufficient Minority-Class Signal

With very few minority-class samples, the model may not encounter enough diverse examples to learn the underlying patterns that distinguish the minority class from the majority class. This leads to poor generalization: the model overfits to the few minority samples it has seen and fails to recognize new minority-class instances at inference time.

Data-Level Solutions

Data-level approaches modify the training dataset to reduce the degree of imbalance before the model is trained. These techniques are algorithm-agnostic and can be applied as a preprocessing step.

Oversampling Techniques

Oversampling increases the number of minority-class examples in the training set. Several strategies exist, ranging from simple duplication to sophisticated synthetic data generation.

Technique	How It Works	Strengths	Limitations
Random oversampling	Duplicates randomly selected minority-class samples	Simple to implement; no new data fabricated	Can cause overfitting by repeating identical samples
SMOTE	Creates synthetic samples by interpolating between a minority-class sample and its k-nearest minority-class neighbors	Generates novel samples; reduces overfitting risk compared to random oversampling	Can create noisy samples in overlapping class regions
Borderline-SMOTE	Applies SMOTE only to minority-class samples near the decision boundary	Focuses synthetic generation where it matters most	Requires careful identification of borderline samples
ADASYN	Generates more synthetic samples for harder-to-learn minority instances (those with more majority-class neighbors)	Adapts generation density to local difficulty	May amplify noise if hard-to-learn samples are actually outliers
SMOTE-ENN	Combines SMOTE oversampling with Edited Nearest Neighbors cleaning	Removes noisy synthetic samples after generation	More computationally expensive than SMOTE alone
K-Means SMOTE	Clusters the feature space with K-Means, then applies SMOTE within sparse minority clusters	Avoids generating synthetic samples in dense or noisy areas	Adds clustering overhead; sensitive to K selection

Undersampling Techniques

Downsampling (also called undersampling) reduces the number of majority-class examples. This approach is particularly useful when the dataset is very large and the computational cost of training on all majority-class samples is prohibitive.

Technique	How It Works	Strengths	Limitations
Random undersampling	Removes randomly selected majority-class samples	Simple and fast	May discard informative majority-class examples
Tomek Links	Identifies pairs of nearest-neighbor samples from different classes (Tomek links) and removes the majority-class member	Cleans the decision boundary region	Only removes borderline samples; may not reduce imbalance substantially
NearMiss-1	Keeps majority-class samples whose average distance to the closest minority-class samples is smallest	Preserves majority samples near the boundary	Can be sensitive to noise and outliers
NearMiss-2	Keeps majority-class samples whose average distance to the farthest minority-class samples is smallest	Retains samples that are globally close to the minority class	May remove important majority-class structure
NearMiss-3	For each minority sample, keeps its M nearest majority-class neighbors, then selects majority samples with the largest average distance to their N nearest minority neighbors	Two-step process provides finer control	Computationally expensive; parameter-sensitive
Condensed Nearest Neighbor (CNN)	Iteratively selects majority-class samples that are misclassified by a 1-NN classifier trained on the current subset	Produces a compact, representative subset	Result depends on sample ordering
One-Sided Selection (OSS)	Combines Tomek Links removal with CNN to remove both borderline noise and redundant majority-class samples	More thorough than either technique alone	More complex to tune

Combination Approaches

Some methods combine oversampling and undersampling in a single pipeline. SMOTE-Tomek first applies SMOTE to generate synthetic minority samples, then removes Tomek links from the augmented dataset to clean up noisy boundary regions. Similarly, SMOTE-ENN applies SMOTE followed by Edited Nearest Neighbors, which removes any sample whose class label differs from the majority of its k nearest neighbors. These combination approaches often outperform either oversampling or undersampling used in isolation.

Algorithm-Level Solutions

Algorithm-level approaches modify the learning algorithm itself so that it pays more attention to the minority class, without altering the training data.

Class Weights

Most modern classifiers (including logistic regression, support vector machines, random forests, and neural networks) support a class_weight parameter that assigns higher importance to minority-class samples during training. When class weights are set inversely proportional to class frequencies, the loss contribution of each minority-class sample is amplified, effectively forcing the model to pay equal attention to both classes.

In scikit-learn, setting class_weight='balanced' automatically computes weights as:

weight_j = n_samples / (n_classes * n_samples_j)

where n_samples_j is the number of samples in class j.

Cost-Sensitive Learning

Cost-sensitive learning generalizes class weighting by assigning different misclassification costs to different types of errors. A cost matrix specifies the penalty for each cell of the confusion matrix. For instance, in medical diagnosis, the cost of a false negative (missing a disease) is typically set much higher than the cost of a false positive (unnecessary follow-up test). The learning algorithm then minimizes expected cost rather than raw error count.

Threshold Moving

In binary classification, most classifiers output a probability score and apply a default threshold of 0.5 to convert it into a class label. For imbalanced data, this default threshold is usually suboptimal. Threshold moving (also called threshold tuning) involves selecting a threshold that optimizes a metric more appropriate for the task, such as the F1 score or the geometric mean of sensitivity and specificity. The optimal threshold can be determined by analyzing the precision-recall curve or the ROC curve on a validation set.

Focal Loss

Introduced by Lin et al. (2017) for dense object detection, focal loss has become widely adopted for training deep learning models on imbalanced data. Focal loss modifies the standard cross-entropy loss by adding a modulating factor that down-weights easy (well-classified) examples and focuses training on hard, misclassified instances:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

The hyperparameter gamma (typically set between 1 and 5) controls how aggressively easy examples are down-weighted. When gamma = 0, focal loss reduces to standard cross-entropy. The alpha_t term provides class-specific weighting. In practice, focal loss effectively concentrates the gradient signal on minority-class samples and hard examples near the decision boundary.

Ensemble Solutions

Ensemble methods combine multiple base learners, each trained on a different resampled version of the data, to produce a more robust classifier. Several ensemble approaches are specifically designed for imbalanced classification.

Method	Base Learner	Resampling Strategy	Description
BalancedRandomForest	Decision tree	Random undersampling per bootstrap	Each tree in the forest is trained on a balanced bootstrap sample created by randomly undersampling the majority class to match the minority class
EasyEnsemble	AdaBoost	Random undersampling per subset	Creates multiple balanced subsets by undersampling the majority class, trains an AdaBoost classifier on each subset, and aggregates predictions
RUSBoost	Decision tree (boosted)	Random undersampling per boosting round	Integrates random undersampling into the AdaBoost boosting process, balancing the data at each iteration
BalancedBagging	Any classifier	Random undersampling per bag	Extends standard bagging by undersampling each bootstrap sample before training the base learner
SMOTEBagging	Any classifier	SMOTE per bag	Applies SMOTE to each bootstrap sample to generate a balanced training set for each base learner

In comparative studies, ensemble methods that incorporate resampling (EasyEnsemble, RUSBoost, SMOTEBagging) consistently outperform standalone resampling or standalone ensemble approaches on imbalanced benchmarks.

SMOTE In Depth

The Synthetic Minority Over-sampling Technique (SMOTE) was introduced by Chawla et al. in 2002 and remains the most widely cited method for handling class imbalance. SMOTE addresses a fundamental limitation of random oversampling: rather than duplicating existing minority-class samples (which risks overfitting), it generates entirely new synthetic samples in feature space.

How SMOTE Works

For each minority-class sample, find its k nearest minority-class neighbors (k is typically 5).
Randomly select one of the k neighbors.
Compute the difference vector between the original sample and the selected neighbor.
Multiply this difference by a random number between 0 and 1.
Add the scaled difference to the original sample to produce a new synthetic point.
Repeat until the desired level of oversampling is achieved.

The result is that new synthetic samples lie along the line segments connecting existing minority-class samples in feature space. This produces more varied training data than simple duplication and helps the classifier generalize better.

SMOTE Variants

Since its introduction, numerous extensions of SMOTE have been proposed to address its limitations.

Borderline-SMOTE restricts synthetic sample generation to minority-class instances that lie near the decision boundary (i.e., those that have a roughly equal number of majority-class and minority-class neighbors). This focuses the augmentation effort where it is most needed and avoids generating synthetic samples deep within the minority-class cluster where the classifier already performs well.

ADASYN (Adaptive Synthetic Sampling) takes a step further by assigning a density distribution to minority-class samples based on their difficulty of learning. Samples surrounded by more majority-class neighbors are considered harder to learn and receive more synthetic neighbors. This shifts the decision boundary toward the difficult examples.

K-Means SMOTE first clusters the data using K-Means, identifies clusters that are dominated by minority-class samples, and applies SMOTE within those clusters. This avoids generating synthetic samples in noisy or heavily overlapping regions.

SMOTE-Tomek and SMOTE-ENN are hybrid approaches that apply SMOTE for oversampling and then use Tomek Links or Edited Nearest Neighbors (respectively) to clean up noisy or ambiguous samples created during the synthetic generation process.

Evaluation Metrics for Imbalanced Data

Choosing the right evaluation metric is as important as choosing the right resampling strategy. Standard accuracy is unreliable for imbalanced data. The following metrics provide a more faithful picture of model performance.

Metric	Formula / Definition	Why It Helps with Imbalanced Data
Precision	TP / (TP + FP)	Measures the fraction of predicted positives that are truly positive; high precision means few false alarms
Recall (Sensitivity)	TP / (TP + FN)	Measures the fraction of actual positives that are correctly detected; high recall means few missed cases
F1 Score	2 * (Precision * Recall) / (Precision + Recall)	Harmonic mean of precision and recall; balances both concerns in a single number
F-beta Score	(1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)	Generalization of F1 that allows weighting recall more (beta > 1) or precision more (beta < 1)
PR-AUC	Area under the Precision-Recall curve	Focuses on positive-class performance; more informative than ROC-AUC when the positive class is rare
ROC-AUC	Area under the Receiver Operating Characteristic curve	Measures trade-off between true positive rate and false positive rate; less sensitive to class distribution
Matthews Correlation Coefficient (MCC)	(TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))	Uses all four cells of the confusion matrix; returns a value between -1 and +1 that is informative even with severe imbalance
Balanced Accuracy	(Sensitivity + Specificity) / 2	Averages per-class accuracy; corrects for majority-class dominance
Cohen's Kappa	Agreement beyond chance	Compares observed accuracy with expected accuracy under random prediction; penalizes models that merely predict the majority class

PR-AUC vs. ROC-AUC

A long-standing debate in the literature concerns whether ROC-AUC or PR-AUC is more appropriate for imbalanced settings. ROC-AUC plots the true positive rate against the false positive rate and tends to present an optimistic view when the negative class is very large, because a small false positive rate still corresponds to a large absolute number of false positives. PR-AUC plots precision against recall and is more sensitive to errors involving the positive (minority) class.

As a general guideline, PR-AUC is preferred when the primary goal is to accurately identify minority-class instances (e.g., fraud detection, rare disease diagnosis), while ROC-AUC is appropriate when the costs of false positives and false negatives are roughly symmetric. In practice, reporting both curves alongside the MCC provides the most complete picture.

The imbalanced-learn Library

imbalanced-learn is an open-source Python library specifically designed to handle class-imbalanced datasets. It is part of the scikit-learn-contrib ecosystem and provides a consistent API that integrates seamlessly with scikit-learn pipelines.

The library organizes its methods into four categories:

Over-sampling: Random oversampling, SMOTE, ADASYN, Borderline-SMOTE, K-Means SMOTE, SVM-SMOTE
Under-sampling: Random undersampling, Tomek Links, NearMiss (versions 1, 2, 3), Edited Nearest Neighbors, Condensed Nearest Neighbor, One-Sided Selection
Combination: SMOTE-Tomek, SMOTE-ENN
Ensemble: EasyEnsembleClassifier, BalancedRandomForestClassifier, BalancedBaggingClassifier, RUSBoostClassifier

A basic usage example with SMOTE:

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

# Apply SMOTE to training data only
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Train and evaluate
clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
print(classification_report(y_test, clf.predict(X_test)))

An important best practice is to apply resampling only to the training set, never to the validation or test set. Resampling the test set would distort the evaluation metrics and give a misleading picture of real-world performance.

Deep Learning with Imbalanced Data

Deep learning models face the same challenges as traditional classifiers when trained on imbalanced data, but their scale and flexibility enable additional mitigation strategies.

Loss Function Modifications

The most common approach is to replace standard cross-entropy loss with a loss function that accounts for class imbalance. Focal loss (described above) is the most widely used option. Other alternatives include class-balanced loss, which re-weights the loss by the effective number of samples per class, and dice loss, originally developed for image segmentation tasks with imbalanced foreground and background pixels.

Data-Level Strategies for Deep Learning

Because deep learning models train in mini-batches, severe imbalance can result in batches that contain no minority-class samples at all. Strategies to address this include:

Class-balanced sampling: Constructing each mini-batch so that it contains an approximately equal number of samples from each class.
Curriculum learning: Starting training with a balanced subset of easy examples and gradually introducing harder and more imbalanced batches.
Two-phase training: First pre-training on a balanced subset (using oversampling or undersampling), then fine-tuning on the original imbalanced data with a lower learning rate.

Data Augmentation

For image and text data, data augmentation can serve as a form of oversampling that generates genuinely new minority-class examples rather than simple interpolations. Techniques such as random cropping, rotation, color jittering (for images), and synonym replacement or back-translation (for text) increase the diversity of minority-class training data. When combined with focal loss or class-balanced sampling, augmentation-based approaches can substantially improve minority-class recall without sacrificing overall performance.

Best Practices

Always evaluate with appropriate metrics. Use precision, recall, F1, PR-AUC, MCC, or balanced accuracy rather than raw accuracy.
Resample training data only. Never apply SMOTE, undersampling, or any resampling to the test or validation set.
Try multiple strategies. There is no universally best approach. Experiment with data-level, algorithm-level, and ensemble methods, and compare results using cross-validation.
Combine techniques. Using class weights together with SMOTE, or ensemble methods with focal loss, often outperforms any single technique.
Consider the cost structure. If false negatives are far more costly than false positives (e.g., missed cancer diagnoses), weight your evaluation and training accordingly.
Use stratified splits. When splitting data into training, validation, and test sets, use stratified sampling to preserve the original class distribution in each split.
Be cautious with SMOTE on high-dimensional data. SMOTE relies on nearest-neighbor distances, which can become unreliable in very high-dimensional spaces. Dimensionality reduction before SMOTE can help.
Monitor for overfitting. Oversampling, especially random oversampling, can lead to overfitting. Track performance on a held-out validation set throughout training.

Explain Like I'm 5 (ELI5)

Imagine you have a big bag of marbles. Almost all of them are blue (990 blue marbles), but only a few are red (10 red marbles). Now suppose you are trying to teach a robot to sort marbles by color. Because the robot sees blue marbles almost every time, it learns to just say "blue" for everything. It gets the answer right 99% of the time, but it never finds the red ones.

To fix this, you can do a few things. You could make copies of the red marbles so the robot sees them more often. You could take away some of the blue marbles so the colors are more even. Or you could tell the robot: "Getting a red marble wrong is a much bigger deal than getting a blue marble wrong, so pay extra attention to the red ones." All of these ideas help the robot learn to spot both colors, not just the common one.

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. *Journal of Artificial Intelligence Research*, 16, 321-357.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. *IEEE International Joint Conference on Neural Networks*, 1322-1328.
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. *International Conference on Intelligent Computing (ICIC)*, 878-887.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. *IEEE International Conference on Computer Vision (ICCV)*, 2980-2988.
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. *Journal of Machine Learning Research*, 18(17), 1-5.
Tomek, I. (1976). Two Modifications of CNN. *IEEE Transactions on Systems, Man, and Cybernetics*, 6(11), 769-772.
Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550.
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. *PLOS ONE*, 10(3), e0118432.
Chicco, D., & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. *BMC Genomics*, 21(6).
Krawczyk, B. (2016). Learning from Imbalanced Data: Open Challenges and Future Directions. *Progress in Artificial Intelligence*, 5(4), 221-232.
Fernandez, A., Garcia, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). *Learning from Imbalanced Data Sets*. Springer.
Google Developers. (2024). Datasets: Class-Imbalanced Datasets. *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets

Real-World Examples

Why Class Imbalance Is a Problem

The Accuracy Paradox

Model Bias Toward the Majority Class

Insufficient Minority-Class Signal

Data-Level Solutions

Oversampling Techniques

Undersampling Techniques

Combination Approaches

Algorithm-Level Solutions

Class Weights

Cost-Sensitive Learning

Threshold Moving

Focal Loss

Ensemble Solutions

SMOTE In Depth

How SMOTE Works

SMOTE Variants

Evaluation Metrics for Imbalanced Data

PR-AUC vs. ROC-AUC

The imbalanced-learn Library

Deep Learning with Imbalanced Data

Loss Function Modifications

Data-Level Strategies for Deep Learning

Data Augmentation

Best Practices

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Data Augmentation

Data Set or Dataset

Downsampling

Real-World Examples

Why Class Imbalance Is a Problem

The Accuracy Paradox

Model Bias Toward the Majority Class

Insufficient Minority-Class Signal

Data-Level Solutions

Oversampling Techniques

Undersampling Techniques

Combination Approaches

Algorithm-Level Solutions

Class Weights

Cost-Sensitive Learning

Threshold Moving

Focal Loss

Ensemble Solutions

SMOTE In Depth

How SMOTE Works

SMOTE Variants

Evaluation Metrics for Imbalanced Data

PR-AUC vs. ROC-AUC

The imbalanced-learn Library

Deep Learning with Imbalanced Data

Loss Function Modifications

Data-Level Strategies for Deep Learning

Data Augmentation

Best Practices

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Data Augmentation

Data Set or Dataset

Downsampling