# Imbalanced Dataset

> Source: https://aiwiki.ai/wiki/imbalanced_dataset
> Updated: 2026-07-12
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

An **imbalanced dataset** is a dataset used in [machine learning](/wiki/machine_learning) where the [classification](/wiki/classification) categories are not approximately equally represented, so that one class (the **majority class**) contains far more samples than one or more other classes (the **minority class**). For example, a credit card fraud detection dataset might contain 99.8% legitimate transactions and only 0.2% fraudulent ones. This skewed distribution causes most standard [classification model](/wiki/classification_model) algorithms to develop a strong bias toward the majority class, because they are designed to minimize overall prediction error and the easiest way to do that is to predict the majority class for every input. The canonical defense is to rebalance the data with [Synthetic Minority Over-sampling Technique (SMOTE)](/wiki/smote), introduced by Chawla et al. in 2002, whose paper has accumulated more than 28,000 citations and remains the most cited method for class imbalance [1]. The original authors define the problem precisely: "A dataset is imbalanced if the classification categories are not approximately equally represented" [1].

The second pillar of handling imbalance is measurement: on a dataset where 99.9% of records belong to one class, a model that blindly predicts the majority class scores 99.9% [accuracy](/wiki/accuracy) while detecting none of the minority cases, an effect known as the **accuracy paradox**. For this reason, practitioners evaluate imbalanced problems with [precision](/wiki/precision), [recall](/wiki/recall), [F1 score](/wiki/f1_score), the area under the precision-recall curve (PR-AUC), and the Matthews Correlation Coefficient (MCC) rather than raw accuracy [8][9].

Imbalanced datasets appear in nearly every domain where predictive modeling is used. In medical diagnosis, rare diseases may account for less than 1% of patient records. In cybersecurity, malicious network packets are vastly outnumbered by normal traffic. In manufacturing, defective products might represent less than 2% of all units. In each of these cases, the minority class is the one that matters most: missed fraud, missed diagnoses, and missed intrusions carry high real-world costs. This makes the class imbalance problem one of the most studied and practically relevant topics in applied machine learning.

## Explain like I'm 5 (ELI5)

Imagine you have a big bag of marbles. Almost all of them are blue (990 blue marbles), but only a few are red (10 red marbles). Now suppose you are trying to teach a robot to sort marbles by color. Because the robot sees blue marbles almost every time, it learns to just say "blue" for everything. It gets the answer right 99% of the time, but it never finds the red ones.

To fix this, you can do a few things. You could make copies of the red marbles so the robot sees them more often. You could take away some of the blue marbles so the colors are more even. Or you could tell the robot: "Getting a red marble wrong is a much bigger deal than getting a blue marble wrong, so pay extra attention to the red ones." All of these ideas help the robot learn to spot both colors, not just the common one.

## Where do imbalanced datasets occur?

Class imbalance is the norm rather than the exception in many applied settings. The table below lists common domains, typical imbalance ratios, and why the minority class carries disproportionate importance.

| Domain | Minority class | Typical imbalance ratio | Why the minority class matters |
|---|---|---|---|
| Credit card fraud detection | Fraudulent transactions | 0.1% to 0.2% of all transactions | Undetected fraud causes direct financial losses |
| Medical diagnosis (rare diseases) | Patients with the condition | Less than 1% of patient records | Missed diagnoses can be life-threatening |
| Manufacturing quality control | Defective products | 0.5% to 2% of items produced | Shipping defective products harms brand reputation and safety |
| Network intrusion detection | Malicious packets | Less than 1% of network traffic | A single undetected intrusion can compromise an entire system |
| Cancer screening | Malignant tumors | 1% to 5% of cases | False negatives delay treatment |
| Insurance claim fraud | Fraudulent claims | 1% to 5% of all claims | Fraudulent payouts increase premiums for all policyholders |
| Loan default prediction | Defaulting borrowers | 2% to 10% of applicants | Undetected defaults lead to significant financial losses |
| Spam email detection | Spam emails | 10% to 20% of total emails | Spam wastes user time and can carry phishing threats |
| Equipment failure prediction | Failure events | Less than 1% of sensor readings | Unexpected failures cause costly downtime |
| Anti-money laundering | Suspicious transactions | Less than 0.1% of transactions | Undetected laundering enables organized crime |

## How is the degree of imbalance measured?

Before selecting a mitigation strategy, it helps to quantify how imbalanced a dataset actually is. Several metrics exist for this purpose.

The simplest measure is the **imbalance ratio (IR)**, defined as the number of majority-class samples divided by the number of minority-class samples. A dataset with 9,500 negative examples and 500 positive examples has an IR of 19:1. Higher IR values indicate more severe imbalance.

However, IR has limitations when applied to [multi-class classification](/wiki/multi-class_classification) problems. For multi-class settings, researchers have proposed additional measures. The **imbalance coefficient** normalizes the ratio to a bounded range. The **Bayes Imbalance Impact Index (BI3)**, proposed by Song et al. (2019) [15], reflects the extent of influence purely from the factor of imbalance for the whole dataset, separating the effect of imbalance from other data complexity factors like class overlap and noise. The **Likelihood-Ratio Imbalance Degree (LRID)** uses a likelihood-ratio test to measure imbalance extent across multiple classes.

Research has shown that imbalance ratio alone does not fully explain classifier performance degradation. Other data characteristics, including class overlap (where majority and minority class distributions share significant regions in feature space), small disjuncts (small clusters of minority-class samples separated from the main minority cluster), and label noise, interact with imbalance to compound the difficulty of learning.

## Why is class imbalance a problem?

### What is the accuracy paradox?

The most intuitive evaluation metric for [classification](/wiki/classification_model) is [accuracy](/wiki/accuracy), defined as the proportion of correct predictions out of all predictions. However, accuracy becomes deeply misleading when classes are imbalanced. Consider a fraud detection dataset where only 0.1% of transactions are fraudulent. A naive model that predicts every transaction as legitimate achieves 99.9% accuracy while catching zero fraud. This phenomenon is known as the **accuracy paradox**: a model can report high accuracy while being completely useless for the task it was designed to perform.

Because accuracy weights per-class performance proportionally to class size, it largely disregards how well the model handles the minority class. In domains where the minority class is the class of interest, accuracy reveals more about the distribution of classes than about actual model quality. Chicco and Jurman (2020) warn that on imbalanced data "accuracy and F1 score can dangerously show overoptimistic inflated results" [9].

### Model bias toward the majority class

Most classification algorithms, including [logistic regression](/wiki/logistic_regression), [decision trees](/wiki/decision_tree), and [neural networks](/wiki/neural_network), are designed to minimize overall error during training. When one class vastly outnumbers the other, the [loss function](/wiki/loss_function) landscape is dominated by majority-class examples. The model learns to assign higher prior probability to the majority class, and its [decision boundary](/wiki/decision_boundary) shifts away from the minority class.

In practical terms, this means the model becomes very good at predicting the common outcome (for example, "not fraud") but very poor at detecting the rare outcome (for example, "fraud"). Since the rare outcome is typically the one that matters most, this bias defeats the purpose of building the model.

### Insufficient minority-class signal

With very few minority-class samples, the model may not encounter enough diverse examples to learn the underlying patterns that distinguish the minority class from the majority class. This leads to poor [generalization](/wiki/generalization): the model [overfits](/wiki/overfitting) to the few minority samples it has seen and fails to recognize new minority-class instances at inference time.

### Interaction with data complexity factors

Imbalance rarely exists in isolation. Real-world datasets often exhibit overlapping class distributions, noisy labels, and small disjuncts. These factors interact with imbalance to amplify the difficulty. A moderately imbalanced dataset (IR of 10:1) with heavy class overlap can be harder to learn than a severely imbalanced dataset (IR of 100:1) with well-separated classes. Understanding these interactions is important for selecting the right mitigation strategy [10].

## How do you fix an imbalanced dataset? Data-level solutions

Data-level approaches modify the training dataset to reduce the degree of imbalance before the model is trained. These techniques are algorithm-agnostic and can be applied as a preprocessing step.

### Oversampling techniques

[Oversampling](/wiki/oversampling) increases the number of minority-class examples in the training set. Several strategies exist, ranging from simple duplication to sophisticated synthetic data generation.

| Technique | How it works | Strengths | Limitations |
|---|---|---|---|
| Random oversampling | Duplicates randomly selected minority-class samples | Simple to implement; no new data fabricated | Can cause [overfitting](/wiki/overfitting) by repeating identical samples |
| SMOTE | Creates synthetic samples by interpolating between a minority-class sample and its k-nearest minority-class neighbors | Generates novel samples; reduces overfitting risk compared to random oversampling | Can create noisy samples in overlapping class regions |
| Borderline-SMOTE | Applies SMOTE only to minority-class samples near the [decision boundary](/wiki/decision_boundary) | Focuses synthetic generation where it matters most | Requires careful identification of borderline samples |
| ADASYN | Generates more synthetic samples for harder-to-learn minority instances (those with more majority-class neighbors) | Adapts generation density to local difficulty | May amplify noise if hard-to-learn samples are actually outliers |
| SVM-SMOTE | Uses [support vector machine](/wiki/support_vector_machine_svm) support vectors to identify the borderline area, then generates synthetic data along lines connecting minority-class support vectors to their nearest neighbors | Leverages the SVM decision boundary for targeted generation | Computationally expensive due to SVM training step |
| SMOTE-NC | Handles datasets with both numerical and categorical features by using a modified distance metric with a median-based penalty for categorical differences | Works with mixed data types, unlike standard SMOTE | Requires at least one continuous feature; slower than standard SMOTE |
| SMOTE-ENN | Combines SMOTE oversampling with Edited Nearest Neighbors cleaning | Removes noisy synthetic samples after generation | More computationally expensive than SMOTE alone |
| K-Means SMOTE | Clusters the feature space with [K-Means](/wiki/k-means), then applies SMOTE within sparse minority clusters | Avoids generating synthetic samples in dense or noisy areas | Adds clustering overhead; sensitive to K selection |

### Undersampling techniques

[Downsampling](/wiki/downsampling) (also called undersampling) reduces the number of majority-class examples. This approach is useful when the dataset is very large and the computational cost of training on all majority-class samples is prohibitive.

| Technique | How it works | Strengths | Limitations |
|---|---|---|---|
| Random undersampling | Removes randomly selected majority-class samples | Simple and fast | May discard informative majority-class examples; increases [variance](/wiki/bias) in the learned model |
| Tomek Links | Identifies pairs of nearest-neighbor samples from different classes (Tomek links) and removes the majority-class member | Cleans the decision boundary region | Only removes borderline samples; may not reduce imbalance substantially |
| NearMiss-1 | Keeps majority-class samples whose average distance to the closest minority-class samples is smallest | Preserves majority samples near the boundary | Can be sensitive to noise and outliers |
| NearMiss-2 | Keeps majority-class samples whose average distance to the farthest minority-class samples is smallest | Retains samples that are globally close to the minority class | May remove important majority-class structure |
| NearMiss-3 | For each minority sample, keeps its M nearest majority-class neighbors, then selects majority samples with the largest average distance to their N nearest minority neighbors | Two-step process provides finer control | Computationally expensive; parameter-sensitive |
| Condensed Nearest Neighbor (CNN) | Iteratively selects majority-class samples that are misclassified by a 1-NN classifier trained on the current subset | Produces a compact, representative subset | Result depends on sample ordering |
| Edited Nearest Neighbors (ENN) | Removes any sample whose class label differs from the majority of its k nearest neighbors | Cleans noisy and borderline samples from both classes | Mild effect on imbalance ratio |
| One-Sided Selection (OSS) | Combines Tomek Links removal with CNN to remove both borderline noise and redundant majority-class samples | More thorough than either technique alone | More complex to tune |

### Combination approaches

Some methods combine oversampling and undersampling in a single pipeline. **SMOTE-Tomek** first applies SMOTE to generate synthetic minority samples, then removes Tomek links from the augmented dataset to clean up noisy boundary regions. **SMOTE-ENN** applies SMOTE followed by Edited Nearest Neighbors, which removes any sample whose class label differs from the majority of its k nearest neighbors. These combination approaches often outperform either oversampling or undersampling used in isolation, because they both augment the minority class and clean up the resulting decision boundary.

## How does SMOTE work?

The [Synthetic Minority Over-sampling Technique (SMOTE)](/wiki/smote) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 in the *Journal of Artificial Intelligence Research* (volume 16, pages 321-357) and remains the most widely cited method for handling class imbalance [1]. As of 2025 the original paper has more than 28,000 citations (Semantic Scholar), making it one of the most cited machine learning papers of its era [16]. SMOTE addresses a fundamental limitation of random oversampling: rather than duplicating existing minority-class samples (which risks overfitting), it generates entirely new synthetic samples in feature space. As the authors put it, "Our method of over-sampling the minority class involves creating synthetic minority class examples" [1].

### How SMOTE works

1. For each minority-class sample $$x$$, find its $$k$$ nearest minority-class neighbors in feature space ($$k$$ is typically 5).
2. Randomly select one of the $$k$$ neighbors, call it $$x_{\text{nn}}$$.
3. Compute the difference vector: $$\text{diff} = x_{\text{nn}} - x$$.
4. Generate a random number $$\lambda$$ uniformly distributed between 0 and 1.
5. Create the synthetic sample: $$x_{\text{new}} = x + \lambda \cdot \text{diff}$$.
6. Repeat until the desired level of oversampling is achieved.

The result is that new synthetic samples lie along the line segments connecting existing minority-class samples in feature space. This produces more varied training data than simple duplication and helps the classifier generalize better. The original authors evaluated SMOTE using C4.5, Ripper, and [Naive Bayes](/wiki/naive_bayes) classifiers, measuring performance with the area under the [ROC curve](/wiki/roc_receiver_operating_characteristic_curve), and reported that combining minority over-sampling with majority under-sampling achieves better performance in ROC space than under-sampling alone [1].

### SMOTE variants

Since its introduction, numerous extensions of SMOTE have been proposed to address its limitations.

**Borderline-SMOTE** (Han, Wang, and Mao, 2005) [3] restricts synthetic sample generation to minority-class instances that lie near the decision boundary. These are instances that have a roughly equal number of majority-class and minority-class neighbors. This focuses the augmentation effort where it is most needed and avoids generating synthetic samples deep within the minority-class cluster where the classifier already performs well.

**ADASYN** (He et al., 2008) [2] assigns a density distribution to minority-class samples based on their difficulty of learning. Samples surrounded by more majority-class neighbors are considered harder to learn and receive more synthetic neighbors. This adaptively shifts the decision boundary toward the difficult examples.

**SVM-SMOTE** uses a [support vector machine](/wiki/support_vector_machine_svm) to identify the borderline region. After training an SVM classifier on the original data, the minority-class support vectors approximate the decision boundary. Synthetic samples are generated along lines connecting these support vectors to their nearest minority-class neighbors. This approach can produce better-targeted synthetic samples than Borderline-SMOTE in some settings.

**SMOTE-NC** (Nominal and Continuous) extends SMOTE to datasets containing both numerical and categorical features. For numerical features, it uses the standard SMOTE interpolation. For categorical features, it assigns the most frequent category among the k nearest neighbors. A constant M, computed as the median of the standard deviations of numerical features in the minority class, is used as a penalty term when calculating distances involving categorical variables.

**K-Means SMOTE** first clusters the data using [K-Means](/wiki/k-means), identifies clusters dominated by minority-class samples, and applies SMOTE within those clusters. This avoids generating synthetic samples in noisy or heavily overlapping regions.

**SMOTE-Tomek** and **SMOTE-ENN** are hybrid approaches that apply SMOTE for oversampling and then use Tomek Links or Edited Nearest Neighbors (respectively) to clean up noisy or ambiguous samples created during synthesis.

### Limitations of SMOTE

SMOTE has several well-documented limitations that practitioners should be aware of.

First, SMOTE operates in continuous feature space and relies on [Euclidean distance](/wiki/embedding_space) for nearest-neighbor calculations. In very high-dimensional spaces, distance metrics become unreliable (the curse of dimensionality), and SMOTE may generate synthetic samples that do not reflect the true minority-class distribution. Applying [dimensionality reduction](/wiki/dimension_reduction) before SMOTE can mitigate this issue.

Second, SMOTE does not consider the majority-class distribution when generating synthetic samples. If minority and majority classes overlap significantly, SMOTE can generate synthetic points that fall within majority-class regions, introducing noise and potentially degrading classifier performance. Borderline-SMOTE and ADASYN partially address this by focusing generation on boundary regions.

Third, the linear interpolation mechanism can produce synthetic samples that deviate from the true minority-class manifold, particularly when the minority class has a complex, non-linear distribution. GAN-based approaches (discussed below) can better capture such distributions.

Fourth, SMOTE was designed for binary classification. Applying it to multi-class problems requires either decomposing the problem into multiple binary problems or using multi-class extensions, which adds complexity.

## GAN-based oversampling

[Generative adversarial networks](/wiki/generative_adversarial_network_gan) offer an alternative to SMOTE-family methods for generating synthetic minority-class samples. Instead of linear interpolation, GANs learn the underlying data distribution through an adversarial training process involving a generator and a discriminator.

For tabular data, **CTGAN** (Conditional Tabular GAN), introduced by Xu et al. (2019), is specifically designed [14] to handle the challenges of mixed data types (continuous and categorical columns) and imbalanced categorical variables. CTGAN uses a variational Gaussian mixture model to encode continuous columns and a training-by-sampling strategy that conditions the generator on specific column values. This allows it to generate synthetic samples that better capture complex, non-linear relationships in the data compared to interpolation-based methods.

Other GAN variants used for imbalanced data include **CopulaGAN**, which models the joint distribution of features using copulas, and **WGAN-GP** (Wasserstein GAN with Gradient Penalty), which provides more stable training dynamics.

GAN-based oversampling tends to produce more realistic synthetic samples than SMOTE when the minority class has a complex distribution. However, GANs are significantly more expensive to train, require careful hyperparameter tuning, and may suffer from mode collapse (generating only a limited variety of samples). For small minority classes (fewer than a few hundred samples), GANs may not have enough training data to learn the distribution effectively, making SMOTE-family methods a more practical choice.

## Algorithm-level solutions: class weights, cost-sensitive learning, and focal loss

Algorithm-level approaches modify the learning algorithm itself so that it pays more attention to the minority class, without altering the training data.

### Class weights

Most modern classifiers, including [logistic regression](/wiki/logistic_regression), [support vector machines](/wiki/support_vector_machine_svm), [random forests](/wiki/random_forest), and [neural networks](/wiki/neural_network), support a `class_weight` parameter that assigns higher importance to minority-class samples during training. When class weights are set inversely proportional to class frequencies, the loss contribution of each minority-class sample is amplified, effectively forcing the model to pay equal attention to both classes.

In [scikit-learn](/wiki/scikit-learn), setting `class_weight='balanced'` automatically computes weights as:

```
weight_j = n_samples / (n_classes * n_samples_j)
```

where `n_samples_j` is the number of samples in class `j`. For a binary dataset with 950 negative and 50 positive samples, the positive class receives a weight of $$1000 / (2 \times 50) = 10$$, meaning each positive-class error counts ten times as much as each negative-class error during training.

### Cost-sensitive learning

Cost-sensitive learning generalizes class weighting by assigning different misclassification costs to different types of errors. A **cost matrix** specifies the penalty for each cell of the [confusion matrix](/wiki/confusion_matrix). For instance, in medical diagnosis, the cost of a false negative (missing a disease) is typically set much higher than the cost of a false positive (unnecessary follow-up test). The learning algorithm then minimizes expected cost rather than raw error count.

Cost-sensitive approaches can be implemented at three levels.

**Direct algorithm modification** involves changing the objective function of the learning algorithm to incorporate costs. For example, a cost-sensitive [decision tree](/wiki/decision_tree) can use cost-weighted impurity measures instead of standard Gini impurity or information gain.

**Meta-learning wrappers** convert any existing classifier into a cost-sensitive one by manipulating instance weights or probability thresholds. This approach has the advantage of being model-agnostic.

**Threshold adjustment** modifies the probability threshold at prediction time rather than during training, shifting it to reflect the asymmetric costs.

### Threshold moving

In [binary classification](/wiki/binary_classification), most classifiers output a probability score and apply a default threshold of 0.5 to convert it into a class label. For imbalanced data, this default threshold is usually suboptimal. **Threshold moving** (also called threshold tuning) involves selecting a threshold that optimizes a metric more appropriate for the task, such as the [F1 score](/wiki/f1_score) or the geometric mean of sensitivity and specificity. The optimal threshold can be determined by analyzing the [precision](/wiki/precision)-[recall](/wiki/recall) curve or the ROC curve on a validation set.

For example, if a model outputs probability 0.3 for a positive case, the default 0.5 threshold would classify it as negative. Lowering the threshold to 0.2 would correctly classify it as positive. The trade-off is that a lower threshold increases recall (catching more true positives) at the expense of precision (producing more false positives).

### Focal loss

Introduced by Lin et al. (2017) for dense object detection, **focal loss** has become widely adopted for training [deep learning](/wiki/deep_learning) models on imbalanced data [4]. The authors observed that "the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause" of the accuracy gap between one-stage and two-stage detectors, and proposed "reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples" [4]. Focal loss modifies the standard [cross-entropy](/wiki/cross-entropy) loss by adding a modulating factor that down-weights easy (well-classified) examples and focuses training on hard, misclassified instances:

$$
\mathrm{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)
$$

The hyperparameter $$\gamma$$ (typically set between 1 and 5) controls how aggressively easy examples are down-weighted. When $$\gamma = 0$$, focal loss reduces to standard cross-entropy. The $$\alpha_t$$ term provides class-specific weighting. Lin et al. report that "we use gamma = 2.0 with alpha = .25 for all experiments," and that these settings let their RetinaNet detector reach 40.8 AP on the COCO test-dev benchmark, surpassing every prior one-stage and two-stage detector at the time [4]. In practice, focal loss effectively concentrates the gradient signal on minority-class samples and hard examples near the decision boundary.

## Ensemble solutions

[Ensemble](/wiki/ensemble) methods combine multiple base learners, each trained on a different resampled version of the data, to produce a more robust classifier. Several ensemble approaches are specifically designed for imbalanced classification.

| Method | Base learner | Resampling strategy | Description |
|---|---|---|---|
| BalancedRandomForest | [Decision tree](/wiki/decision_tree) | Random undersampling per bootstrap | Each tree in the forest is trained on a balanced bootstrap sample created by randomly undersampling the majority class to match the minority class |
| EasyEnsemble | [AdaBoost](/wiki/boosting) | Random undersampling per subset | Creates multiple balanced subsets by undersampling the majority class, trains an AdaBoost classifier on each subset, and aggregates predictions |
| RUSBoost | Decision tree (boosted) | Random undersampling per boosting round | Integrates random undersampling into the AdaBoost boosting process, balancing the data at each iteration |
| BalancedBagging | Any classifier | Random undersampling per bag | Extends standard [bagging](/wiki/bagging) by undersampling each bootstrap sample before training the base learner |
| SMOTEBagging | Any classifier | SMOTE per bag | Applies SMOTE to each bootstrap sample to generate a balanced training set for each base learner |
| SMOTEBoost | Decision tree (boosted) | SMOTE per boosting round | Integrates SMOTE into the boosting procedure; generates synthetic minority samples at each round before updating weights |

In comparative studies, ensemble methods that incorporate resampling (EasyEnsemble, RUSBoost, SMOTEBagging) consistently outperform standalone resampling or standalone ensemble approaches on imbalanced benchmarks. EasyEnsemble, in particular, has shown strong results across multiple studies [7], likely because it combines the variance reduction benefits of ensembling with the bias correction benefits of undersampling.

## Anomaly detection as an alternative

When class imbalance is extreme (IR greater than 1000:1, or when labeled minority-class samples are very scarce), framing the problem as [anomaly detection](/wiki/anomaly_detection) rather than binary classification can be more effective. Anomaly detection methods learn a model of "normal" behavior from the majority class and flag deviations as anomalies.

**One-Class SVM** learns a tight boundary around normal data in feature space and classifies points outside this boundary as anomalies. It works well when the normal class is compact in feature space but can be computationally intensive for large datasets, especially with non-linear kernels.

**Isolation Forest** builds an ensemble of random binary trees that recursively partition the feature space. Anomalies, being rare and different from normal patterns, tend to be isolated by fewer partitions and thus appear in shorter paths within the trees. Isolation Forest handles large, high-dimensional datasets efficiently and is relatively insensitive to the contamination rate.

**Autoencoders** trained on majority-class data learn to reconstruct normal patterns. At inference time, anomalies produce high reconstruction error because the model has never seen similar patterns during training. This approach is particularly useful for complex, high-dimensional data like images and time series.

The anomaly detection framing is most appropriate when very few labeled minority-class examples are available or when the minority class is too heterogeneous to model directly.

## Multi-class imbalance

Imbalance in [multi-class classification](/wiki/multi-class_classification) is more complex than in binary settings because the imbalance can exist between any pair of classes. A dataset might have three classes with distributions of 90%, 8%, and 2%, creating multiple simultaneous imbalance relationships.

### Decomposition strategies

One common approach is to decompose the multi-class problem into multiple binary problems. **One-vs-Rest (OvR)** creates one binary classifier per class, where each classifier distinguishes one class from all others. However, OvR inherently creates imbalanced binary problems: the "rest" group is almost always larger than the single target class. **One-vs-One (OvO)** creates a binary classifier for each pair of classes. This naturally produces more balanced binary subproblems but requires training $$O(k^2)$$ classifiers for $$k$$ classes.

### Multi-class SMOTE

Applying SMOTE to multi-class problems requires deciding which classes to oversample and by how much. Common strategies include oversampling all minority classes to match the majority class, oversampling each class to match the median class size, or using class-specific oversampling ratios based on the degree of imbalance each class faces.

### Multi-class cost matrices

Cost-sensitive learning extends naturally to multi-class settings by defining a $$k$$-by-$$k$$ cost matrix where each entry specifies the cost of predicting class $$i$$ when the true class is $$j$$. In practice, designing an appropriate multi-class cost matrix requires domain expertise, since the relative costs of confusing different class pairs may vary significantly.

## Deep learning with imbalanced data

Deep learning models face the same challenges as traditional classifiers when trained on imbalanced data, but their scale and flexibility enable additional mitigation strategies.

### Loss function modifications

The most common approach is to replace standard cross-entropy loss with a loss function that accounts for class imbalance. **Focal loss** (described above) is the most widely used option. Other alternatives include:

- **Class-balanced loss** (Cui et al., 2019) [13]: re-weights the loss by the effective number of samples per class, computed as $$(1 - \beta^n) / (1 - \beta)$$ where $$n$$ is the number of samples and $$\beta$$ is a hyperparameter.
- **Dice loss**: originally developed for image segmentation tasks with imbalanced foreground and background pixels; measures the overlap between predicted and ground-truth regions.
- **Label-distribution-aware margin (LDAM) loss**: enforces larger classification margins for minority classes, providing stronger regularization for under-represented classes.

### Data-level strategies for deep learning

Because deep learning models train in mini-batches, severe imbalance can result in batches that contain no minority-class samples at all. Strategies to address this include:

- **Class-balanced sampling**: constructing each mini-batch so that it contains an approximately equal number of samples from each class.
- **Curriculum learning**: starting training with a balanced subset of easy examples and gradually introducing harder and more imbalanced batches.
- **Two-phase training**: first pre-training on a balanced subset (using oversampling or undersampling), then [fine-tuning](/wiki/fine_tuning) on the original imbalanced data with a lower [learning rate](/wiki/learning_rate).
- **Decoupled training**: training the feature extractor (backbone) with instance-balanced sampling for representation learning, then retraining only the classification head with class-balanced sampling. This approach, introduced by Kang et al. (2020), has shown strong results on long-tailed recognition benchmarks [12].

### Data augmentation

For image and text data, [data augmentation](/wiki/data_augmentation) can serve as a form of oversampling that generates genuinely new minority-class examples rather than simple interpolations. Techniques such as random cropping, rotation, and color jittering (for images), as well as synonym replacement and back-translation (for text), increase the diversity of minority-class training data. When combined with focal loss or class-balanced sampling, augmentation-based approaches can substantially improve minority-class recall without sacrificing overall performance.

## Which metrics should you use for imbalanced data?

Choosing the right evaluation metric is as important as choosing the right resampling strategy. Standard [accuracy](/wiki/accuracy) is unreliable for imbalanced data. The following metrics provide a more faithful picture of model performance.

| Metric | Formula or definition | Why it helps with imbalanced data |
|---|---|---|
| [Precision](/wiki/precision) | $$\text{TP} / (\text{TP} + \text{FP})$$ | Measures the fraction of predicted positives that are truly positive; high precision means few false alarms |
| [Recall](/wiki/recall) (sensitivity) | $$\text{TP} / (\text{TP} + \text{FN})$$ | Measures the fraction of actual positives that are correctly detected; high recall means few missed cases |
| [F1 score](/wiki/f1_score) | $$\frac{2 \cdot (\text{Precision} \cdot \text{Recall})}{\text{Precision} + \text{Recall}}$$ | Harmonic mean of precision and recall; balances both concerns in a single number |
| F-beta score | $$\frac{(1 + \beta^2) \cdot (\text{Precision} \cdot \text{Recall})}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$ | Generalization of F1 that allows weighting recall more ($$\beta > 1$$) or precision more ($$\beta < 1$$) |
| PR-AUC | Area under the Precision-Recall curve | Focuses on positive-class performance; more informative than ROC-AUC when the positive class is rare |
| ROC-AUC | Area under the [ROC curve](/wiki/roc_receiver_operating_characteristic_curve) | Measures trade-off between true positive rate and false positive rate across all thresholds |
| Matthews Correlation Coefficient (MCC) | $$\frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$$ | Uses all four cells of the [confusion matrix](/wiki/confusion_matrix); returns a value between -1 and +1 that is informative even with severe imbalance |
| Balanced accuracy | $$(\text{Sensitivity} + \text{Specificity}) / 2$$ | Averages per-class accuracy; corrects for majority-class dominance |
| Cohen's Kappa | Agreement beyond chance | Compares observed accuracy with expected accuracy under random prediction; penalizes models that merely predict the majority class |
| G-Mean | $$\sqrt{\text{Sensitivity} \cdot \text{Specificity}}$$ | Geometric mean of per-class accuracies; penalizes models that sacrifice one class for another |

### PR-AUC vs. ROC-AUC

A long-standing debate in the literature concerns whether ROC-AUC or PR-AUC is more appropriate for imbalanced settings. ROC-AUC plots the true positive rate against the false positive rate and tends to present an optimistic view when the negative class is very large, because a small false positive rate still corresponds to a large absolute number of false positives. PR-AUC plots precision against recall and is more sensitive to errors involving the positive (minority) class.

Saito and Rehmsmeier (2015) demonstrated, in a paper titled "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," that "PRC plots can provide the viewer with an accurate prediction of future classification performance" because only the precision-recall curve changes with the ratio of positives to negatives [8]. Their analysis showed that a method performing well on the ROC curve may still perform poorly on the PR curve if it generates many false positives, which are obscured by the large number of true negatives in the ROC analysis.

As a general guideline, PR-AUC is preferred when the primary goal is to accurately identify minority-class instances (for example, fraud detection, rare disease diagnosis), while ROC-AUC is appropriate when the costs of false positives and false negatives are roughly symmetric. In practice, reporting both curves alongside the MCC provides the most complete picture.

### The Matthews Correlation Coefficient

The MCC deserves special attention for imbalanced datasets. Chicco and Jurman (2020) demonstrated that MCC is more informative than both F1 score and accuracy for binary classification evaluation, because the coefficient "produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset" [9]. Unlike F1, MCC accounts for true negatives and is symmetric with respect to both classes. An MCC of +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates complete disagreement. Because MCC uses all four quadrants of the confusion matrix, it is harder to "game" by simply predicting the majority class.

## What is the imbalanced-learn library?

**imbalanced-learn** (imported as `imblearn`) is an open-source Python library specifically designed to handle class-imbalanced datasets, first described by Lemaitre, Nogueira, and Aridas (2017) [5]. It is part of the scikit-learn-contrib ecosystem and provides a consistent API that integrates seamlessly with [scikit-learn](/wiki/scikit-learn) pipelines. As of June 2026 the library is at version 0.14.2 and requires Python 3.10 or newer and scikit-learn 1.4.2 or newer [17].

The library organizes its methods into four categories:

1. **Over-sampling:** Random oversampling, SMOTE, ADASYN, Borderline-SMOTE, K-Means SMOTE, SVM-SMOTE, SMOTE-NC
2. **Under-sampling:** Random undersampling, Tomek Links, NearMiss (versions 1, 2, 3), Edited Nearest Neighbors, Condensed Nearest Neighbor, One-Sided Selection, Neighbourhood Cleaning Rule
3. **Combination:** SMOTE-Tomek, SMOTE-ENN
4. **Ensemble:** EasyEnsembleClassifier, BalancedRandomForestClassifier, BalancedBaggingClassifier, RUSBoostClassifier

A basic usage example with SMOTE:

```python
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score

# Split data (stratified to preserve class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Create a pipeline with SMOTE and a classifier
pipeline = Pipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Cross-validate on training data
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print(f"Cross-validated F1: {scores.mean():.3f} +/- {scores.std():.3f}")

# Fit and evaluate on test set
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
```

An important best practice is to apply resampling only to the training set, never to the validation or test set. The `imblearn.pipeline.Pipeline` class handles this automatically by applying the resampling step only during `fit`, not during `predict` or `score`. Resampling the test set would distort the evaluation metrics and give a misleading picture of real-world performance.

## Comparison of approaches

The following table summarizes the main categories of techniques for handling imbalanced data, along with their typical use cases and trade-offs.

| Approach | When to use | Advantages | Disadvantages |
|---|---|---|---|
| Random oversampling | Small datasets with moderate imbalance | Simple; no hyperparameters | Risk of overfitting from duplicated samples |
| SMOTE and variants | Moderate imbalance with continuous features | Generates diverse synthetic samples | May create noise in overlapping regions; struggles with high-dimensional data |
| GAN-based oversampling | Complex distributions; sufficient minority samples for GAN training | Captures non-linear distributions | Expensive to train; risk of mode collapse |
| Random undersampling | Very large datasets where computation is a concern | Reduces training time | Discards potentially useful majority-class information |
| Informed undersampling (Tomek, NearMiss) | Moderate to large datasets with noisy boundaries | Cleans decision boundary | May not reduce imbalance enough on its own |
| Class weights | Any classifier that supports weighted loss | No data modification needed; easy to implement | May not be sufficient for extreme imbalance |
| Cost-sensitive learning | Problems with well-defined misclassification costs | Directly optimizes for business objectives | Requires domain knowledge to set cost matrix |
| Threshold moving | Any probabilistic classifier | Simple post-hoc adjustment; no retraining needed | Only adjusts the decision point, not the learned representation |
| Focal loss | Deep learning models on imbalanced data | Automatically down-weights easy examples | Requires tuning $$\gamma$$ and $$\alpha$$ hyperparameters |
| Ensemble methods | General-purpose; when single models underperform | Combines benefits of resampling and model averaging | Higher computational cost; more complex to deploy |
| Anomaly detection | Extreme imbalance (>1000:1) or very few minority labels | Does not require balanced training data | Cannot leverage minority-class labels effectively |

## Best practices

1. **Always evaluate with appropriate metrics.** Use [precision](/wiki/precision), [recall](/wiki/recall), F1, PR-AUC, MCC, or balanced accuracy rather than raw accuracy.
2. **Resample training data only.** Never apply SMOTE, undersampling, or any resampling to the test or validation set.
3. **Use stratified splits.** When splitting data into training, validation, and test sets, use stratified sampling to preserve the original class distribution in each split.
4. **Try multiple strategies.** There is no universally best approach. Experiment with data-level, algorithm-level, and ensemble methods, and compare results using [cross-validation](/wiki/cross-validation).
5. **Combine techniques.** Using class weights together with SMOTE, or ensemble methods with focal loss, often outperforms any single technique.
6. **Consider the cost structure.** If false negatives are far more costly than false positives (for example, missed cancer diagnoses), weight your evaluation and training accordingly.
7. **Be cautious with SMOTE on high-dimensional data.** SMOTE relies on nearest-neighbor distances, which can become unreliable in very high-dimensional spaces. [Dimensionality reduction](/wiki/dimension_reduction) before SMOTE can help.
8. **Monitor for overfitting.** Oversampling, especially random oversampling, can lead to [overfitting](/wiki/overfitting). Track performance on a held-out validation set throughout training.
9. **Use pipelines for resampling.** Wrap resampling and modeling in a single pipeline (e.g., `imblearn.pipeline.Pipeline`) to prevent data leakage during cross-validation.
10. **Account for data complexity.** Imbalance ratio alone does not determine difficulty. Assess class overlap, noise, and small disjuncts to choose the most appropriate technique.
11. **Start simple.** Try class weights or threshold moving before resorting to complex resampling or ensemble methods. In many cases, simple approaches are competitive with more elaborate ones.

## See also

- [Oversampling](/wiki/oversampling)
- [Downsampling](/wiki/downsampling)
- [Confusion matrix](/wiki/confusion_matrix)
- [Precision](/wiki/precision)
- [Recall](/wiki/recall)
- [F1 score](/wiki/f1_score)
- [ROC curve](/wiki/roc_receiver_operating_characteristic_curve)
- [Anomaly detection](/wiki/anomaly_detection)
- [Data augmentation](/wiki/data_augmentation)
- [Binary classification](/wiki/binary_classification)

## References

1. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. *Journal of Artificial Intelligence Research*, 16, 321-357.
2. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. *IEEE International Joint Conference on Neural Networks*, 1322-1328.
3. Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. *International Conference on Intelligent Computing (ICIC)*, 878-887.
4. Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. *IEEE International Conference on Computer Vision (ICCV)*, 2980-2988.
5. Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. *Journal of Machine Learning Research*, 18(17), 1-5.
6. Tomek, I. (1976). Two Modifications of CNN. *IEEE Transactions on Systems, Man, and Cybernetics*, 6(11), 769-772.
7. Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550.
8. Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. *PLOS ONE*, 10(3), e0118432.
9. Chicco, D., & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. *BMC Genomics*, 21(6).
10. Krawczyk, B. (2016). Learning from Imbalanced Data: Open Challenges and Future Directions. *Progress in Artificial Intelligence*, 5(4), 221-232.
11. Fernandez, A., Garcia, S., Galar, M., Prati, R. C., Krawczyk, B., & Herrera, F. (2018). *Learning from Imbalanced Data Sets*. Springer.
12. Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2020). Decoupling Representation and Classifier for Long-Tailed Recognition. *International Conference on Learning Representations (ICLR)*.
13. Cui, Y., Jia, M., Lin, T. Y., Song, Y., & Belongie, S. (2019). Class-Balanced Loss Based on Effective Number of Samples. *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 9268-9277.
14. Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular Data using Conditional GAN. *Advances in Neural Information Processing Systems (NeurIPS)*, 32.
15. Song, J., Huang, X., Qin, S., & Song, Q. (2019). Bayes Imbalance Impact Index: A Measure of Class Imbalanced Dataset for Classification Problem. *IEEE Transactions on Neural Networks and Learning Systems*, 30(11), 3525-3538.
16. Semantic Scholar. SMOTE: Synthetic Minority Over-sampling Technique (citation count). Retrieved 2026. https://www.semanticscholar.org/paper/8cb44f06586f609a29d9b496cc752ec01475dffe
17. Imbalanced-learn developers. Getting Started (version 0.14.2 installation requirements). Retrieved June 2026. https://imbalanced-learn.org/stable/install.html