Minority class

In a classification problem, the minority class is the class label with fewer training instances in an imbalanced dataset. For example, in a credit card transaction dataset where 99.83% of records are legitimate and 0.17% are fraudulent, "fraudulent" is the minority class. The opposite of this label is the majority class, which contains the bulk of the samples and dominates the loss surface during training.

The minority class is usually the class of practical interest. Fraud, disease, defects, intrusions, and rare astronomical events are all minority outcomes that carry high cost when missed. Standard learning algorithms tend to ignore them, which is why a large body of techniques (resampling, cost-sensitive losses, focal loss, threshold tuning, anomaly detection) has been developed specifically to recover minority-class performance.

definition and notation

Given a labeled training set with $K$ classes, let $n_k$ denote the number of samples in class $k$. The minority class is $\arg\min_k n_k$. In the binary case the dataset reduces to two counts, $n_+$ and $n_-$, and the minority class is whichever of the two is smaller. The imbalance ratio (IR) is defined as $n_{\text{maj}} / n_{\text{min}}$. A balanced dataset has IR around 1; values above 10 are considered moderately imbalanced; values above 100 are severe; and values above 1000 are extreme, often appearing in fraud, intrusion detection, and rare disease screening.

The term "minority class" is mostly used in supervised binary or multi-class settings. In unsupervised contexts the analogous concept is a rare cluster or an outlier. In multi-label classification, every label can have its own imbalance ratio, and a single instance may carry both majority and minority labels at once.

real-world examples

Several application domains routinely produce datasets where the minority class is the one we want the model to find. The table below summarizes typical positive-class prevalence based on commonly cited benchmarks and surveys.

Domain	Minority class	Typical prevalence	Why it matters
Credit card fraud detection	Fraudulent transaction	0.1% to 0.2% of records	Each missed fraud is a direct chargeback or laundered payment
Cancer screening	Malignant case in general population	1% to 5% of screened patients	A false negative delays treatment of a life-threatening disease
Click-through prediction	Click event	0.5% to 5% of impressions	Wrong predictions waste ad spend and suppress revenue per impression
Manufacturing defect detection	Defective unit	0.1% to 2% of units	Defects shipped to customers cause recalls and brand damage
Customer churn (in some segments)	Churning customer	1% to 10% of users in a billing period	Retained customers are far cheaper than acquired ones
Network intrusion detection	Malicious packet	Less than 1% of traffic	A single intrusion can compromise an entire system
Hallucination detection in LLM outputs	Hallucinated span	A few percent of generations on factual prompts	Determines whether downstream factual checks fire
Adverse drug reaction reports	Confirmed serious reaction	Less than 1% of post-market reports	Drives label changes and recalls

In all of these settings, the cost of a false negative on the minority class is much higher than the cost of a false positive on the majority class. This asymmetry is why metrics built on raw accuracy are misleading for these problems and why the minority class drives almost every modeling choice.

why the minority class is hard to learn

Most classifiers are trained to minimize empirical risk, usually a sum of per-sample losses. When one class dominates the dataset, that class also dominates the loss. The optimization easily finds a decision rule that ignores the minority class entirely and still scores high accuracy. A trivial classifier that predicts "not fraud" on every transaction in the credit card dataset described above achieves about 99.83% accuracy and zero recall on the class anyone actually cares about.

Three mechanisms feed the problem:

The loss surface is dominated by majority-class examples, so the gradient pushes the decision boundary away from minority-class regions.
The minority class often has too few samples to define its own manifold reliably, which leads to overfitting on the few examples seen during training.
In real datasets the minority class frequently overlaps with the majority class in feature space and contains "small disjuncts," tiny clusters that look like noise to the learner.

Class overlap and noise compound the imbalance. Krawczyk (2016) shows that the imbalance ratio alone is a poor predictor of difficulty; data complexity factors interact with imbalance to determine how badly classifiers will fail. This is why "just rebalance the data" rarely solves the problem on its own.

oversampling techniques

Oversampling expands the minority class so the learner sees more of it during training. The simplest form duplicates existing minority samples; more advanced methods generate synthetic samples in feature space.

Technique	Year	Idea	Strength	Weakness
Random oversampling	classical	Duplicate randomly chosen minority samples until the desired ratio is reached	No new assumptions; trivial to implement	Repeated copies tend to cause overfitting
SMOTE (Chawla et al.)	2002	For each minority sample, pick one of its $k$ nearest minority neighbors and create a synthetic point on the line segment between them	Generates novel samples that vary per run; widely supported	Linear interpolation can land in majority regions when classes overlap
Borderline-SMOTE (Han, Wang, Mao)	2005	Apply SMOTE only to minority samples whose neighbors include many majority points (those near the boundary)	Focuses synthetic data where the classifier struggles	Borderline detection itself is sensitive to noise
SMOTE-NC (Chawla et al.)	2002	Handle nominal/categorical features by interpolating numerical features and selecting the most frequent category for nominal ones	Works on mixed-type tables	Requires at least one continuous feature
SVMSMOTE	2009	Train an SVM, take its minority support vectors as the borderline, and generate synthetic samples around them	Boundary defined by a strong learner rather than $k$-NN voting	Adds the cost of training an SVM upfront
ADASYN (He et al.)	2008	Weight each minority sample by how many of its $k$ neighbors are majority, then generate more synthetic data for the harder examples	Adaptive density: hard regions get more attention	Can amplify outliers and label noise
K-Means SMOTE	2017	Cluster the data with K-Means, find clusters dominated by the minority class, and apply SMOTE within them	Avoids generating synthetic data in noisy or empty regions	Adds a clustering step with its own hyperparameters
SMOTE+ENN, SMOTE+Tomek	classical	Run SMOTE first, then clean the dataset with Edited Nearest Neighbors or Tomek links	Removes noisy or boundary-crossing synthetic samples	More expensive than SMOTE alone
GAN-based oversampling (CTGAN, WGAN-GP, CopulaGAN)	2019+	Train a generative adversarial network on the minority class and sample from it	Captures non-linear distributions better than interpolation	Needs enough minority data to train the GAN; mode collapse risk

SMOTE is the most cited of these methods. The original 2002 paper by Chawla, Bowyer, Hall, and Kegelmeyer published in the Journal of Artificial Intelligence Research has been cited tens of thousands of times. Its core idea is simple: rather than copy an existing minority sample $x$, pick one of its $k$ nearest minority neighbors $x_{nn}$ and produce $x_{\text{new}} = x + \lambda (x_{nn} - x)$ with $\lambda$ uniform on $[0,1]$. The synthetic point lies somewhere on the segment between two real minority samples, which gives the model new variation rather than repeated duplicates.

SMOTE assumes that the line segment between two minority neighbors stays inside the minority region. That assumption breaks down when the minority class is heavily overlapped with the majority class, when the feature space is very high dimensional (where Euclidean nearest-neighbor distances become unstable), or when features are not actually continuous. Applying SMOTE directly to raw pixels of an image or token IDs of a sentence produces nonsense; for those modalities, augmentation in input space (rotation, crop, paraphrase, back-translation) or interpolation in a learned embedding space is preferred.

undersampling and combined sampling

Undersampling shrinks the majority class instead of growing the minority class. Random undersampling drops majority examples uniformly at random and is fast but throws away potentially useful data. Informed methods such as Tomek links, Edited Nearest Neighbors, NearMiss, and One-Sided Selection target borderline or redundant majority points. Combined approaches such as SMOTE+Tomek and SMOTE+ENN use oversampling to enrich the minority class and then clean noisy synthetic samples with an undersampling step. These hybrids are common defaults in practice because they address both bias toward the majority class and noise introduced by interpolation.

cost-sensitive learning and class weights

Instead of changing the data, you can change the loss. Cost-sensitive learning assigns higher penalty to misclassifying minority instances. In binary terms, the cost matrix has $C(\text{FN}) > C(\text{FP})$, so the optimizer prefers to err on the side of predicting the minority class.

The most common implementation is per-class class weight. scikit-learn supports this directly: passing class_weight='balanced' to most classifiers computes

weight_j = n_samples / (n_classes * n_samples_j)

so that minority classes get larger weights inversely proportional to their frequency. For a 950 negative / 50 positive split, the positive class receives weight $1000 / (2 \cdot 50) = 10$, ten times the weight of a negative example. The same effect can be achieved per-instance with the sample_weight argument at fit time, which lets you encode richer cost structures than uniform per-class weights.

Class weights are attractive because they require no data manipulation and integrate cleanly into cross-validation, but they are not a silver bullet. For very extreme imbalance (IR above 1000) they can cause unstable optimization, and they do not change the underlying lack of minority-class diversity in the training set.

focal loss

Focal loss, introduced by Lin, Goyal, Girshick, He, and Dollar at ICCV 2017, was designed for one-stage dense object detection (RetinaNet) where the foreground/background ratio in image pixels is roughly 1 to 1000. It modifies the standard cross-entropy loss by adding a factor that down-weights well-classified examples:

FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

The table below decomposes this expression.

Symbol	Meaning	Effect on loss
$p_t$	Predicted probability for the true class	Confident correct predictions have $p_t \to 1$
$-\log(p_t)$	Standard cross-entropy term	Penalizes low confidence on the true class
$(1 - p_t)^\gamma$	Modulating factor	Approaches 0 for easy examples, stays near 1 for hard ones
$\gamma$	Focusing parameter, typical value 2	Higher $\gamma$ pushes more emphasis onto hard examples
$\alpha_t$	Per-class weighting	Lets you also up-weight the minority class

With $\gamma = 0$ focal loss reduces to ordinary cross-entropy. With $\gamma = 2$ (the default in the original paper) a confidently classified easy example with $p_t = 0.9$ contributes only $0.01$ times its cross-entropy loss, so the gradient is dominated by harder, misclassified samples. Lin et al. report a 2.9 average precision gain from focal loss on COCO and show the result is fairly robust across choices of $\gamma$ and $\alpha$. Focal loss is now a default choice for detection, segmentation, and other dense prediction problems with severe class imbalance, and it has been adapted to text and tabular tasks as well.

modeling the minority class directly

When the minority class is so rare that even oversampling cannot produce a stable model, it is sometimes more effective to skip binary classification and treat the problem as one-class learning or anomaly detection. A model is trained on the majority ("normal") class, and inputs that look unusual relative to that model are flagged as candidates for the minority class.

One-Class SVM fits a tight boundary around the normal data in feature space and reports anything outside as an anomaly.
Isolation Forest builds random partitions of the feature space; points isolated by very few cuts are scored as anomalies.
Deep SVDD trains a neural network to map normal data into a small hypersphere in latent space.
Autoencoders trained on normal data flag inputs with high reconstruction error.

This framing is most useful when minority labels are not just rare but unreliable, expensive, or non-existent. It pairs well with downstream human review on flagged candidates and is widely used in fraud, intrusion detection, and industrial sensor monitoring.

threshold tuning

A probability classifier produces a score in $[0,1]$ and applies a threshold (usually 0.5) to assign a class. For imbalanced problems, 0.5 is almost always the wrong choice. Lowering the threshold trades precision for recall on the minority class: more borderline cases are flagged as positive, catching more true positives at the cost of more false alarms.

Threshold tuning is cheap because it requires no retraining. The standard recipe is to fit the model normally, then sweep the threshold on a held-out validation set to optimize a target metric such as $F_1$, $F_\beta$ with $\beta > 1$, the precision-recall break-even point, or expected business cost. In production, the same tuning loop can be re-run as base rates drift, without touching the underlying model.

evaluation metrics that emphasize the minority class

Accuracy hides minority-class failure. The metrics below are the ones practitioners actually report.

Metric	Definition	Why it helps with the minority class
Precision	$TP / (TP + FP)$	Among predicted positives, how many are real
Recall (sensitivity)	$TP / (TP + FN)$	Among actual positives, how many we caught
F1 score	$2 \cdot P \cdot R / (P + R)$	Harmonic mean of precision and recall
F-beta	$(1 + \beta^2) \cdot P \cdot R / (\beta^2 \cdot P + R)$	Lets you favor recall ($\beta > 1$) or precision ($\beta < 1$)
PR-AUC	Area under the precision-recall curve	Sensitive to false positives even when the negative class is huge
ROC-AUC	Area under the ROC curve	Less informative than PR-AUC under heavy imbalance
Matthews correlation coefficient (MCC)	Correlation between predicted and true labels using all four cells of the confusion matrix	Single number that is hard to game by predicting the majority class
Balanced accuracy	$(\text{Sensitivity} + \text{Specificity}) / 2$	Per-class accuracy averaged so the minority class has equal weight
G-mean	$\sqrt{\text{Sensitivity} \cdot \text{Specificity}}$	Penalizes a model that wins on one class only

Saito and Rehmsmeier (2015), in PLOS ONE, ran simulation studies on imbalanced binary classifiers and showed that the precision-recall plot is more informative than the ROC plot in this setting. ROC-AUC can stay deceptively high because the false positive rate is normalized by a very large number of true negatives; PR-AUC reacts directly to false positives among the predicted minority class. Chicco and Jurman (2020), in BMC Genomics, made the parallel case for MCC over $F_1$ and accuracy: MCC is high only when all four entries of the confusion matrix look good, so it cannot be inflated by simply predicting the majority class.

baselines worth beating

Before declaring a model a success on imbalanced data, check it against trivial baselines:

A stratified random predictor that samples each class with its empirical frequency.
A constant predictor that always returns the majority class. Its accuracy equals the majority-class prevalence and is often above 95%, while its recall on the minority class is exactly 0.
A logistic regression with class_weight='balanced' and no resampling. This is hard to beat without genuine signal.

Any model that does not clearly beat these should be treated with suspicion. "99% accurate" on a 99/1 split is meaningless on its own.

pitfalls

A few common mistakes show up repeatedly in applied work on the minority class.

Generating too many synthetic minority samples can distort the class distribution, especially when SMOTE produces points that fall into majority regions. Keeping the synthetic ratio modest (often 1:1 or 1:2 minority to majority after augmentation rather than equal) and combining SMOTE with an undersampling cleanup step usually helps.

Applying SMOTE in very high-dimensional spaces is unreliable because nearest-neighbor distances become almost equal between any two points, so the synthetic samples lose their geometric meaning. Reducing dimensionality before SMOTE, or using an embedding-aware variant, is recommended.

Resampling the entire dataset before splitting into train and test leaks information and inflates reported metrics. Resampling must happen inside each cross-validation fold and only on the training portion. The imblearn.pipeline.Pipeline from imbalanced-learn enforces this by skipping the resampler at predict time.

SMOTE on raw text or image inputs makes no sense: interpolating between token IDs or pixel values produces gibberish. Use augmentation in input space or interpolate in a learned embedding instead.

Extremely high class weights or aggressive focal loss settings can trigger numerical instability and divergent training, especially with small batch sizes that may contain no minority samples at all. Class-balanced batch sampling and gradient clipping are common mitigations.

Finally, optimizing for $F_1$ on a fixed validation set can overfit the threshold itself when the validation set is small. Reporting performance over multiple folds, or using nested cross-validation, gives a more honest estimate.

tooling

The imbalanced-learn (imblearn) library, created by Guillaume Lemaitre, Fernando Nogueira, and Christos K. Aridas, is the de facto Python toolkit for minority-class problems. It is part of the scikit-learn-contrib ecosystem and exposes resampling estimators that follow the scikit-learn API. The library groups its methods into four categories:

Over-sampling: RandomOverSampler, SMOTE, SMOTENC, BorderlineSMOTE, SVMSMOTE, KMeansSMOTE, ADASYN.
Under-sampling: RandomUnderSampler, TomekLinks, NearMiss, EditedNearestNeighbours, OneSidedSelection, CondensedNearestNeighbour.
Combination: SMOTETomek, SMOTEENN.
Ensemble: BalancedRandomForestClassifier, EasyEnsembleClassifier, RUSBoostClassifier, BalancedBaggingClassifier.

A typical pipeline wraps a resampler with a classifier:

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score

pipe = Pipeline([
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(class_weight="balanced", max_iter=1000)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
print(cross_val_score(pipe, X, y, cv=cv, scoring="average_precision").mean())

Using imblearn.pipeline.Pipeline (rather than sklearn.pipeline.Pipeline) ensures that SMOTE is applied only during fit, never during predict or score, which keeps test-set evaluation honest.

Deep learning frameworks such as PyTorch and TensorFlow expose WeightedRandomSampler-style utilities to draw class-balanced batches, and most modern training loops accept focal loss or class-balanced cross-entropy as drop-in replacements for the standard loss.

modern relevance

Minority-class problems have gained new visibility in safety-critical machine learning evaluation. Hallucination detection in large language models is a clear example: factual errors are a small fraction of generations on most prompts, so the positive class ("this output contains a hallucination") is a minority class with all the usual symptoms. Bias and fairness audits, jailbreak detection, agent-trajectory monitoring, and adversarial input detection share the same shape. Each of them needs careful evaluation with PR-AUC and recall on the minority class rather than aggregate accuracy.

Long-tailed visual recognition benchmarks, such as iNaturalist and ImageNet-LT, have driven a wave of methods aimed specifically at the minority ("tail") classes: class-balanced loss (Cui et al., 2019), label-distribution-aware margin loss, decoupled training of feature extractor and classifier (Kang et al., 2020), and balanced contrastive learning. These methods generalize the older imbalanced-learning toolkit to the deep-learning regime and show that even with millions of training samples, the minority-class problem persists as soon as the head and tail of the class distribution are far apart.

references

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. International Conference on Intelligent Computing (ICIC), Lecture Notes in Computer Science 3644, 878-887.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IEEE International Joint Conference on Neural Networks, 1322-1328.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal Loss for Dense Object Detection. IEEE International Conference on Computer Vision (ICCV), 2980-2988.
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432.
Chicco, D., & Jurman, G. (2020). The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics, 21, article 6.
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18(17), 1-5.
Krawczyk, B. (2016). Learning from Imbalanced Data: Open Challenges and Future Directions. Progress in Artificial Intelligence, 5(4), 221-232.
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., & Belongie, S. (2019). Class-Balanced Loss Based on Effective Number of Samples. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9268-9277.
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2020). Decoupling Representation and Classifier for Long-Tailed Recognition. International Conference on Learning Representations (ICLR).
King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137-163.
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. IEEE International Conference on Data Mining (ICDM), 413-422.
Schoelkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7), 1443-1471.
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Mueller, E., & Kloft, M. (2018). Deep One-Class Classification. International Conference on Machine Learning (ICML), 4393-4402.
scikit-learn developers. compute_class_weight API documentation. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html

Minority class

definition and notation

real-world examples

why the minority class is hard to learn

oversampling techniques

undersampling and combined sampling

cost-sensitive learning and class weights

focal loss

modeling the minority class directly

threshold tuning

evaluation metrics that emphasize the minority class

baselines worth beating

pitfalls

tooling

modern relevance

see also

references

Improve this article

definition and notation

real-world examples

why the minority class is hard to learn

oversampling techniques

undersampling and combined sampling

cost-sensitive learning and class weights

focal loss

modeling the minority class directly

threshold tuning

evaluation metrics that emphasize the minority class

baselines worth beating

pitfalls

tooling

modern relevance

see also

references

definition and notation

real-world examples

why the minority class is hard to learn

oversampling techniques

undersampling and combined sampling

cost-sensitive learning and class weights

focal loss

modeling the minority class directly

threshold tuning

evaluation metrics that emphasize the minority class

baselines worth beating

pitfalls

tooling

modern relevance

see also

references

Improve this article

Related Articles

Majority class

Undersampling

Upweighting

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

definition and notation

real-world examples

why the minority class is hard to learn

oversampling techniques

undersampling and combined sampling

cost-sensitive learning and class weights

focal loss

modeling the minority class directly

threshold tuning

evaluation metrics that emphasize the minority class

baselines worth beating

pitfalls

tooling

modern relevance

see also

references

Related Articles

Majority class

Undersampling

Upweighting

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy