In a classification problem, the minority class is the class label with fewer training instances in an imbalanced dataset. For example, in a credit card transaction dataset where 99.83% of records are legitimate and 0.17% are fraudulent, "fraudulent" is the minority class. The opposite of this label is the majority class, which contains the bulk of the samples and dominates the loss surface during training.
The minority class is usually the class of practical interest. Fraud, disease, defects, intrusions, and rare astronomical events are all minority outcomes that carry high cost when missed. Standard learning algorithms tend to ignore them, which is why a large body of techniques (resampling, cost-sensitive losses, focal loss, threshold tuning, anomaly detection) has been developed specifically to recover minority-class performance.
Given a labeled training set with $K$ classes, let $n_k$ denote the number of samples in class $k$. The minority class is $\arg\min_k n_k$. In the binary case the dataset reduces to two counts, $n_+$ and $n_-$, and the minority class is whichever of the two is smaller. The imbalance ratio (IR) is defined as $n_{\text{maj}} / n_{\text{min}}$. A balanced dataset has IR around 1; values above 10 are considered moderately imbalanced; values above 100 are severe; and values above 1000 are extreme, often appearing in fraud, intrusion detection, and rare disease screening.
The term "minority class" is mostly used in supervised binary or multi-class settings. In unsupervised contexts the analogous concept is a rare cluster or an outlier. In multi-label classification, every label can have its own imbalance ratio, and a single instance may carry both majority and minority labels at once.
Several application domains routinely produce datasets where the minority class is the one we want the model to find. The table below summarizes typical positive-class prevalence based on commonly cited benchmarks and surveys.
| Domain | Minority class | Typical prevalence | Why it matters |
|---|---|---|---|
| Credit card fraud detection | Fraudulent transaction | 0.1% to 0.2% of records | Each missed fraud is a direct chargeback or laundered payment |
| Cancer screening | Malignant case in general population | 1% to 5% of screened patients | A false negative delays treatment of a life-threatening disease |
| Click-through prediction | Click event | 0.5% to 5% of impressions | Wrong predictions waste ad spend and suppress revenue per impression |
| Manufacturing defect detection | Defective unit | 0.1% to 2% of units | Defects shipped to customers cause recalls and brand damage |
| Customer churn (in some segments) | Churning customer | 1% to 10% of users in a billing period | Retained customers are far cheaper than acquired ones |
| Network intrusion detection | Malicious packet | Less than 1% of traffic | A single intrusion can compromise an entire system |
| Hallucination detection in LLM outputs | Hallucinated span | A few percent of generations on factual prompts | Determines whether downstream factual checks fire |
| Adverse drug reaction reports | Confirmed serious reaction | Less than 1% of post-market reports | Drives label changes and recalls |
In all of these settings, the cost of a false negative on the minority class is much higher than the cost of a false positive on the majority class. This asymmetry is why metrics built on raw accuracy are misleading for these problems and why the minority class drives almost every modeling choice.
Most classifiers are trained to minimize empirical risk, usually a sum of per-sample losses. When one class dominates the dataset, that class also dominates the loss. The optimization easily finds a decision rule that ignores the minority class entirely and still scores high accuracy. A trivial classifier that predicts "not fraud" on every transaction in the credit card dataset described above achieves about 99.83% accuracy and zero recall on the class anyone actually cares about.
Three mechanisms feed the problem:
Class overlap and noise compound the imbalance. Krawczyk (2016) shows that the imbalance ratio alone is a poor predictor of difficulty; data complexity factors interact with imbalance to determine how badly classifiers will fail. This is why "just rebalance the data" rarely solves the problem on its own.
Oversampling expands the minority class so the learner sees more of it during training. The simplest form duplicates existing minority samples; more advanced methods generate synthetic samples in feature space.
| Technique | Year | Idea | Strength | Weakness |
|---|---|---|---|---|
| Random oversampling | classical | Duplicate randomly chosen minority samples until the desired ratio is reached | No new assumptions; trivial to implement | Repeated copies tend to cause overfitting |
| SMOTE (Chawla et al.) | 2002 | For each minority sample, pick one of its $k$ nearest minority neighbors and create a synthetic point on the line segment between them | Generates novel samples that vary per run; widely supported | Linear interpolation can land in majority regions when classes overlap |
| Borderline-SMOTE (Han, Wang, Mao) | 2005 | Apply SMOTE only to minority samples whose neighbors include many majority points (those near the boundary) | Focuses synthetic data where the classifier struggles | Borderline detection itself is sensitive to noise |
| SMOTE-NC (Chawla et al.) | 2002 | Handle nominal/categorical features by interpolating numerical features and selecting the most frequent category for nominal ones | Works on mixed-type tables | Requires at least one continuous feature |
| SVMSMOTE | 2009 | Train an SVM, take its minority support vectors as the borderline, and generate synthetic samples around them | Boundary defined by a strong learner rather than $k$-NN voting | Adds the cost of training an SVM upfront |
| ADASYN (He et al.) | 2008 | Weight each minority sample by how many of its $k$ neighbors are majority, then generate more synthetic data for the harder examples | Adaptive density: hard regions get more attention | Can amplify outliers and label noise |
| K-Means SMOTE | 2017 | Cluster the data with K-Means, find clusters dominated by the minority class, and apply SMOTE within them | Avoids generating synthetic data in noisy or empty regions | Adds a clustering step with its own hyperparameters |
| SMOTE+ENN, SMOTE+Tomek | classical | Run SMOTE first, then clean the dataset with Edited Nearest Neighbors or Tomek links | Removes noisy or boundary-crossing synthetic samples | More expensive than SMOTE alone |
| GAN-based oversampling (CTGAN, WGAN-GP, CopulaGAN) | 2019+ | Train a generative adversarial network on the minority class and sample from it | Captures non-linear distributions better than interpolation | Needs enough minority data to train the GAN; mode collapse risk |
SMOTE is the most cited of these methods. The original 2002 paper by Chawla, Bowyer, Hall, and Kegelmeyer published in the Journal of Artificial Intelligence Research has been cited tens of thousands of times. Its core idea is simple: rather than copy an existing minority sample $x$, pick one of its $k$ nearest minority neighbors $x_{nn}$ and produce $x_{\text{new}} = x + \lambda (x_{nn} - x)$ with $\lambda$ uniform on $[0,1]$. The synthetic point lies somewhere on the segment between two real minority samples, which gives the model new variation rather than repeated duplicates.
SMOTE assumes that the line segment between two minority neighbors stays inside the minority region. That assumption breaks down when the minority class is heavily overlapped with the majority class, when the feature space is very high dimensional (where Euclidean nearest-neighbor distances become unstable), or when features are not actually continuous. Applying SMOTE directly to raw pixels of an image or token IDs of a sentence produces nonsense; for those modalities, augmentation in input space (rotation, crop, paraphrase, back-translation) or interpolation in a learned embedding space is preferred.
Undersampling shrinks the majority class instead of growing the minority class. Random undersampling drops majority examples uniformly at random and is fast but throws away potentially useful data. Informed methods such as Tomek links, Edited Nearest Neighbors, NearMiss, and One-Sided Selection target borderline or redundant majority points. Combined approaches such as SMOTE+Tomek and SMOTE+ENN use oversampling to enrich the minority class and then clean noisy synthetic samples with an undersampling step. These hybrids are common defaults in practice because they address both bias toward the majority class and noise introduced by interpolation.
Instead of changing the data, you can change the loss. Cost-sensitive learning assigns higher penalty to misclassifying minority instances. In binary terms, the cost matrix has $C(\text{FN}) > C(\text{FP})$, so the optimizer prefers to err on the side of predicting the minority class.
The most common implementation is per-class class weight. scikit-learn supports this directly: passing class_weight='balanced' to most classifiers computes
weight_j = n_samples / (n_classes * n_samples_j)
so that minority classes get larger weights inversely proportional to their frequency. For a 950 negative / 50 positive split, the positive class receives weight $1000 / (2 \cdot 50) = 10$, ten times the weight of a negative example. The same effect can be achieved per-instance with the sample_weight argument at fit time, which lets you encode richer cost structures than uniform per-class weights.
Class weights are attractive because they require no data manipulation and integrate cleanly into cross-validation, but they are not a silver bullet. For very extreme imbalance (IR above 1000) they can cause unstable optimization, and they do not change the underlying lack of minority-class diversity in the training set.
Focal loss, introduced by Lin, Goyal, Girshick, He, and Dollar at ICCV 2017, was designed for one-stage dense object detection (RetinaNet) where the foreground/background ratio in image pixels is roughly 1 to 1000. It modifies the standard cross-entropy loss by adding a factor that down-weights well-classified examples:
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
The table below decomposes this expression.
| Symbol | Meaning | Effect on loss |
|---|---|---|
| $p_t$ | Predicted probability for the true class | Confident correct predictions have $p_t \to 1$ |
| $-\log(p_t)$ | Standard cross-entropy term | Penalizes low confidence on the true class |
| $(1 - p_t)^\gamma$ | Modulating factor | Approaches 0 for easy examples, stays near 1 for hard ones |
| $\gamma$ | Focusing parameter, typical value 2 | Higher $\gamma$ pushes more emphasis onto hard examples |
| $\alpha_t$ | Per-class weighting | Lets you also up-weight the minority class |
With $\gamma = 0$ focal loss reduces to ordinary cross-entropy. With $\gamma = 2$ (the default in the original paper) a confidently classified easy example with $p_t = 0.9$ contributes only $0.01$ times its cross-entropy loss, so the gradient is dominated by harder, misclassified samples. Lin et al. report a 2.9 average precision gain from focal loss on COCO and show the result is fairly robust across choices of $\gamma$ and $\alpha$. Focal loss is now a default choice for detection, segmentation, and other dense prediction problems with severe class imbalance, and it has been adapted to text and tabular tasks as well.
When the minority class is so rare that even oversampling cannot produce a stable model, it is sometimes more effective to skip binary classification and treat the problem as one-class learning or anomaly detection. A model is trained on the majority ("normal") class, and inputs that look unusual relative to that model are flagged as candidates for the minority class.
This framing is most useful when minority labels are not just rare but unreliable, expensive, or non-existent. It pairs well with downstream human review on flagged candidates and is widely used in fraud, intrusion detection, and industrial sensor monitoring.
A probability classifier produces a score in $[0,1]$ and applies a threshold (usually 0.5) to assign a class. For imbalanced problems, 0.5 is almost always the wrong choice. Lowering the threshold trades precision for recall on the minority class: more borderline cases are flagged as positive, catching more true positives at the cost of more false alarms.
Threshold tuning is cheap because it requires no retraining. The standard recipe is to fit the model normally, then sweep the threshold on a held-out validation set to optimize a target metric such as $F_1$, $F_\beta$ with $\beta > 1$, the precision-recall break-even point, or expected business cost. In production, the same tuning loop can be re-run as base rates drift, without touching the underlying model.
Accuracy hides minority-class failure. The metrics below are the ones practitioners actually report.
| Metric | Definition | Why it helps with the minority class |
|---|---|---|
| Precision | $TP / (TP + FP)$ | Among predicted positives, how many are real |
| Recall (sensitivity) | $TP / (TP + FN)$ | Among actual positives, how many we caught |
| F1 score | $2 \cdot P \cdot R / (P + R)$ | Harmonic mean of precision and recall |
| F-beta | $(1 + \beta^2) \cdot P \cdot R / (\beta^2 \cdot P + R)$ | Lets you favor recall ($\beta > 1$) or precision ($\beta < 1$) |
| PR-AUC | Area under the precision-recall curve | Sensitive to false positives even when the negative class is huge |
| ROC-AUC | Area under the ROC curve | Less informative than PR-AUC under heavy imbalance |
| Matthews correlation coefficient (MCC) | Correlation between predicted and true labels using all four cells of the confusion matrix | Single number that is hard to game by predicting the majority class |
| Balanced accuracy | $(\text{Sensitivity} + \text{Specificity}) / 2$ | Per-class accuracy averaged so the minority class has equal weight |
| G-mean | $\sqrt{\text{Sensitivity} \cdot \text{Specificity}}$ | Penalizes a model that wins on one class only |
Saito and Rehmsmeier (2015), in PLOS ONE, ran simulation studies on imbalanced binary classifiers and showed that the precision-recall plot is more informative than the ROC plot in this setting. ROC-AUC can stay deceptively high because the false positive rate is normalized by a very large number of true negatives; PR-AUC reacts directly to false positives among the predicted minority class. Chicco and Jurman (2020), in BMC Genomics, made the parallel case for MCC over $F_1$ and accuracy: MCC is high only when all four entries of the confusion matrix look good, so it cannot be inflated by simply predicting the majority class.
Before declaring a model a success on imbalanced data, check it against trivial baselines:
class_weight='balanced' and no resampling. This is hard to beat without genuine signal.Any model that does not clearly beat these should be treated with suspicion. "99% accurate" on a 99/1 split is meaningless on its own.
A few common mistakes show up repeatedly in applied work on the minority class.
Generating too many synthetic minority samples can distort the class distribution, especially when SMOTE produces points that fall into majority regions. Keeping the synthetic ratio modest (often 1:1 or 1:2 minority to majority after augmentation rather than equal) and combining SMOTE with an undersampling cleanup step usually helps.
Applying SMOTE in very high-dimensional spaces is unreliable because nearest-neighbor distances become almost equal between any two points, so the synthetic samples lose their geometric meaning. Reducing dimensionality before SMOTE, or using an embedding-aware variant, is recommended.
Resampling the entire dataset before splitting into train and test leaks information and inflates reported metrics. Resampling must happen inside each cross-validation fold and only on the training portion. The imblearn.pipeline.Pipeline from imbalanced-learn enforces this by skipping the resampler at predict time.
SMOTE on raw text or image inputs makes no sense: interpolating between token IDs or pixel values produces gibberish. Use augmentation in input space or interpolate in a learned embedding instead.
Extremely high class weights or aggressive focal loss settings can trigger numerical instability and divergent training, especially with small batch sizes that may contain no minority samples at all. Class-balanced batch sampling and gradient clipping are common mitigations.
Finally, optimizing for $F_1$ on a fixed validation set can overfit the threshold itself when the validation set is small. Reporting performance over multiple folds, or using nested cross-validation, gives a more honest estimate.
The imbalanced-learn (imblearn) library, created by Guillaume Lemaitre, Fernando Nogueira, and Christos K. Aridas, is the de facto Python toolkit for minority-class problems. It is part of the scikit-learn-contrib ecosystem and exposes resampling estimators that follow the scikit-learn API. The library groups its methods into four categories:
RandomOverSampler, SMOTE, SMOTENC, BorderlineSMOTE, SVMSMOTE, KMeansSMOTE, ADASYN.RandomUnderSampler, TomekLinks, NearMiss, EditedNearestNeighbours, OneSidedSelection, CondensedNearestNeighbour.SMOTETomek, SMOTEENN.BalancedRandomForestClassifier, EasyEnsembleClassifier, RUSBoostClassifier, BalancedBaggingClassifier.A typical pipeline wraps a resampler with a classifier:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
pipe = Pipeline([
("smote", SMOTE(random_state=42)),
("clf", LogisticRegression(class_weight="balanced", max_iter=1000)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(cross_val_score(pipe, X, y, cv=cv, scoring="average_precision").mean())
Using imblearn.pipeline.Pipeline (rather than sklearn.pipeline.Pipeline) ensures that SMOTE is applied only during fit, never during predict or score, which keeps test-set evaluation honest.
Deep learning frameworks such as PyTorch and TensorFlow expose WeightedRandomSampler-style utilities to draw class-balanced batches, and most modern training loops accept focal loss or class-balanced cross-entropy as drop-in replacements for the standard loss.
Minority-class problems have gained new visibility in safety-critical machine learning evaluation. Hallucination detection in large language models is a clear example: factual errors are a small fraction of generations on most prompts, so the positive class ("this output contains a hallucination") is a minority class with all the usual symptoms. Bias and fairness audits, jailbreak detection, agent-trajectory monitoring, and adversarial input detection share the same shape. Each of them needs careful evaluation with PR-AUC and recall on the minority class rather than aggregate accuracy.
Long-tailed visual recognition benchmarks, such as iNaturalist and ImageNet-LT, have driven a wave of methods aimed specifically at the minority ("tail") classes: class-balanced loss (Cui et al., 2019), label-distribution-aware margin loss, decoupled training of feature extractor and classifier (Kang et al., 2020), and balanced contrastive learning. These methods generalize the older imbalanced-learning toolkit to the deep-learning regime and show that even with millions of training samples, the minority-class problem persists as soon as the head and tail of the class distribution are far apart.
compute_class_weight API documentation. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html