Majority class

In machine learning, the majority class is the class label that appears most frequently in a labeled dataset used for classification. It is the direct counterpart of the minority class, which is the underrepresented label in the same dataset. The terms are most often used in the context of binary classification on a class-imbalanced dataset, where one label dominates the sample count by a large margin. In a fraud detection problem with 99.8% legitimate transactions and 0.2% fraudulent ones, "legitimate" is the majority class. In a chest X-ray screening dataset where 2% of scans show pneumonia, "no pneumonia" is the majority class.

The majority class plays a special role in supervised learning because most standard algorithms minimize a global loss. When one label dominates, the loss landscape is dominated by that label as well, and a model can drive its training error very low simply by predicting the majority answer for everything. Understanding the majority class, measuring how much it dominates the data, and deciding what to do about that dominance is a core part of working with real-world classification problems.

Definition and notation

Given a labeled dataset with $N$ examples and $K$ classes, let $N_k$ denote the number of examples with label $k$. The majority class is the label $k^* = \arg\max_k N_k$. In binary problems with classes ${0, 1}$, this collapses to whichever count is larger. The remaining classes are minority classes. When the dataset has only two classes, practitioners usually call them "majority" and "minority" without further qualification.

The degree to which the majority class dominates is captured by the imbalance ratio (IR):

$$\text{IR} = \frac{N_\text{majority}}{N_\text{minority}}$$

An IR of 1 means the data is balanced. An IR of 100 means there are 100 majority examples for every minority example. Some authors instead report the minority prevalence $p = N_\text{minority} / N$, which is more directly comparable to a base rate. The two quantities are related: $\text{IR} = (1 - p) / p$.

Real-world examples

In most operational machine learning problems, the majority class is whatever label corresponds to the "normal" or "uninteresting" outcome. The minority class is usually the rare event the model is being built to detect. The table below lists representative domains and the typical proportion of the majority class.

Domain	Majority class	Approximate share	Source
Credit card fraud (Kaggle ULB dataset)	Legitimate transactions	99.83%	Pozzolo et al., 2014 dataset notes
Network intrusion detection (KDD Cup 99)	Normal traffic	Roughly 80% across categories	Tavallaee et al., 2009
Manufacturing defect inspection	Non-defective items	Usually 95% to 99%	He and Garcia, 2009
Online ad click-through rate	No-click impressions	Usually 95% or more	Richardson et al., 2007
Spam email filtering	Variable, often closer to balanced	50% to 80% non-spam	Dataset dependent
Cancer screening (mammography)	Negative scans	Roughly 99%	Woods et al., 1993

These percentages move with the operational pipeline. A bank that pre-filters obvious fraud will see a more imbalanced supervised dataset downstream than the raw transaction stream. A hospital that screens only high-risk patients will see a higher minority prevalence than population-level screening. The majority share is a function of both the underlying base rate and the data collection pipeline.

Why the majority class matters

The accuracy paradox

A model that always predicts the majority class achieves accuracy equal to the majority share. On the Kaggle credit card fraud dataset, that constant predictor scores 99.83% accuracy while catching zero fraud. The model is useless, but accuracy looks great. This is the accuracy paradox and it is the first thing to check when an imbalanced classifier reports a high score. The majority-class baseline (sometimes called the zero rule or majority classifier) sets the accuracy floor any useful model must clear, often by a wide margin to be interesting.

Loss imbalance during training

Most classifiers are trained by minimizing a per-example loss summed across the dataset. With an imbalance ratio of 100, roughly 99% of the gradient signal at each step comes from majority examples. The optimizer therefore spends almost all of its capacity learning to fit the majority distribution. Decision boundaries shift away from the minority class, and the prior probability the model assigns to the majority label drifts upward. He and Garcia (2009) survey this effect in detail and call it one of the central difficulties of imbalanced learning.

Loss of practical signal

The minority class is usually the class of interest. The cost of a missed fraud, a missed disease, or a missed defect is typically much higher than the cost of a false alarm. A model trained naively on the raw majority distribution will have low recall on the minority class even when its overall accuracy is high. This is why so much work on imbalanced learning focuses specifically on what to do with the majority side of the dataset.

Strategies that act on the majority class

Methods that target the majority class fall mostly into the family of undersampling approaches. They reduce the influence of majority examples either by removing some of them outright or by reweighting them so they contribute less to the loss.

Technique	Origin	What it does to the majority class
Random undersampling	Long-standing baseline	Drops majority examples uniformly at random until the desired ratio is reached
Tomek links removal	Tomek, 1976	Finds pairs of nearest neighbors with different labels and removes the majority member of each pair
Edited Nearest Neighbors (ENN)	Wilson, 1972	Removes any majority example whose label disagrees with the majority of its k nearest neighbors
Condensed Nearest Neighbor (CNN)	Hart, 1968	Keeps only majority examples needed to correctly classify the rest with a 1-NN rule
One-Sided Selection (OSS)	Kubat and Matwin, 1997	Combines Tomek link removal with CNN, removing borderline noise and redundant interior majority points
NearMiss-1, NearMiss-2, NearMiss-3	Mani and Zhang, 2003	Selects majority examples based on distances to the nearest or farthest minority examples
Cluster-based undersampling	Various	Clusters the majority class and keeps centroids or representatives from each cluster

Random undersampling is the simplest member of this family. It is fast, easy to reason about, and often a strong baseline, but it can throw away informative examples by chance. The neighborhood-based methods try to be smarter: Tomek links and Edited Nearest Neighbors target majority points sitting on the wrong side of the decision boundary, while One-Sided Selection adds a second pass that condenses the interior of the majority class. Kubat and Matwin (1997) argued that this kind of "one-sided" cleaning is preferable to symmetric resampling because it preserves the structure of the majority distribution where it is unambiguous and only intervenes where the boundary is noisy.

NearMiss takes a distance-based view. The three NearMiss variants differ in which majority points they keep: those closest on average to the nearest minority points (NearMiss-1), those closest to the farthest minority points (NearMiss-2), or a per-minority neighborhood selection (NearMiss-3). All three are sensitive to noise because a single mislabeled minority point can pull many majority points into the kept set.

Combining majority handling with oversampling

Undersampling and oversampling are not mutually exclusive. The most popular hybrid pipelines apply SMOTE to the minority class first and then clean the majority side:

SMOTE + Tomek: After SMOTE generates synthetic minority points, Tomek-link removal deletes majority points that now sit next to a synthetic minority point of the opposite label. This sharpens the boundary.
SMOTE + ENN: Edited Nearest Neighbors is applied after SMOTE to remove any example, majority or minority, that disagrees with its k nearest neighbors. ENN is more aggressive than Tomek and tends to clean more noise at the cost of removing more data.

Batista, Prati, and Monard (2004) compared these combinations on a wide range of UCI benchmarks and found that SMOTE + ENN often outperformed SMOTE alone, especially when the original data was noisy.

Algorithm-level alternatives

The other lever for handling the majority class is to leave the data untouched and change how the model treats different labels. These approaches are sometimes called cost-sensitive learning.

Class weights scale the per-example loss so that minority examples count for more. In scikit-learn, passing class_weight="balanced" to a classifier sets the weight for class $k$ to $N / (K \cdot N_k)$, the inverse of the class frequency normalized so the weights average to 1. The same idea appears as pos_weight in PyTorch's BCEWithLogitsLoss and as scale_pos_weight in XGBoost. The effect is similar to undersampling the majority class but without throwing away data, which is useful when the dataset is small.

Threshold moving keeps the trained model fixed and adjusts the decision threshold at inference. By default, a binary classifier predicts the positive class when the predicted probability exceeds 0.5. Lowering this threshold to, say, 0.1 trades precision for recall on the minority class. Threshold moving is often the first thing to try after training a probabilistic model on imbalanced data because it requires no retraining.

Focal loss, introduced by Lin et al. (2017) for the RetinaNet object detector, modifies cross entropy with a modulating factor $(1 - p_t)^\gamma$ that down-weights well-classified examples. Most majority examples are easy, so focal loss reduces their contribution and lets the optimizer focus on the harder, often minority, examples. The technique was originally developed to handle the extreme foreground-background imbalance in dense object detection, where there are roughly 1,000 background anchors per object, but it is now widely used outside computer vision.

Ensemble methods built on majority undersampling

A single random undersample throws away most of the majority data. Ensemble methods reuse the majority class by training many learners on different subsets and combining their predictions. Liu, Wu, and Zhou (2009) introduced two influential designs:

EasyEnsemble trains an AdaBoost ensemble on each of several random undersamples of the majority class, then combines all base learners. Each majority example has many chances to appear in some subset, so almost no information is lost.
BalanceCascade trains learners sequentially. After each round, majority examples that the current ensemble already classifies correctly are dropped from the pool, so subsequent learners focus on the harder remaining majority points.

RUSBoost, proposed by Seiffert et al. (2010), combines random undersampling with AdaBoost.M2 in a single algorithm. It tends to match or beat SMOTEBoost on standard imbalanced benchmarks while being noticeably faster, because random undersampling is cheap compared with synthetic generation.

The imbalanced-learn library exposes these as BalancedBaggingClassifier, EasyEnsembleClassifier, BalancedRandomForestClassifier, and RUSBoostClassifier, all with scikit-learn-compatible interfaces.

Evaluation metrics suited for imbalanced data

Accuracy and other metrics that weight per-class performance proportionally to class size are misleading when the majority class dominates. The metrics in the table below are commonly used instead.

Metric	What it measures	Behavior under imbalance
Precision	Fraction of predicted positives that are correct	Sensitive to false positives, which dominate when the negative class is huge
Recall	Fraction of actual positives that were caught	Insensitive to majority class size; key for rare-event detection
F1 score	Harmonic mean of precision and recall	Balances precision and recall; ignores true negatives
Balanced accuracy	Mean of per-class recall	Treats both classes equally regardless of size
Matthews correlation coefficient (MCC)	Correlation between predicted and actual labels	Uses all four confusion-matrix cells; symmetric in classes
Cohen's kappa	Agreement above chance	Adjusts for the high agreement expected from a constant predictor
ROC AUC	Area under the receiver operating characteristic curve	Invariant to class balance; can look optimistic on highly imbalanced data
PR AUC	Area under the precision-recall curve	Baseline equals minority prevalence; more informative when minority class is small

Saito and Rehmsmeier (2015) showed that ROC curves can be visually misleading on highly imbalanced data because changing the negative count by an order of magnitude leaves the ROC curve unchanged while completely changing precision. They recommend reporting PR curves and PR AUC alongside or instead of ROC AUC whenever the minority prevalence is below roughly 10%.

Risks of acting on the majority class

Undersampling and reweighting are not free. The main risks include:

Information loss: Random undersampling discards data that may contain rare but useful patterns. A subset that happens to drop the few "borderline" majority examples can hurt the model's ability to define the boundary precisely.
Distorted prior: After undersampling, the training distribution no longer matches deployment. The model's predicted probabilities are calibrated to the resampled distribution, not to the natural base rate. Predictions then need to be recalibrated, for example by Platt scaling or by adjusting the log-odds offset back to the original prior.
Boundary artifacts: Aggressive cleaning methods like ENN can over-prune the majority class in regions where it legitimately overlaps with the minority class, biasing the boundary in the other direction.
Variance: A single random undersample is noisy. Ensemble methods like EasyEnsemble exist precisely to average this noise out.

A model trained on a 1:1 resampled dataset will overpredict the minority class at deployment time unless its threshold or its output probabilities are corrected. This is particularly important in regulated domains like credit scoring or medical screening, where calibrated probabilities are required for downstream decisions.

Anomaly detection framing

At extreme imbalance ratios, treating the problem as classification stops paying off. With an imbalance ratio of 10,000 or more, even SMOTE struggles because there are too few minority examples to interpolate from, and even undersampling leaves you with a tiny training set. In that regime it often makes more sense to switch to anomaly detection: model only the majority distribution and flag anything that looks unusual.

Common one-class methods include one-class SVM, isolation forest, local outlier factor, and autoencoder reconstruction error. These approaches treat the majority class as "normal" and learn a description of it, then score new points by how far they deviate. They sidestep the imbalance entirely because they never need minority labels at training time, although they typically need some minority examples for tuning the decision threshold.

Software

The imbalanced-learn library, introduced by Lemaitre, Nogueira, and Aridas in JMLR (2017), is the most widely used implementation of these techniques in Python. It groups methods into four categories: undersampling, oversampling, combinations, and ensembles. All estimators follow the scikit-learn API and can be dropped into existing pipelines. Other relevant libraries include smote-variants, which packages dozens of SMOTE variants, and imbalanced-ensemble, which extends imbalanced-learn with more ensemble methods. In R, the ROSE, UBL, and themis packages cover similar ground.

Modern context

Class imbalance is not a problem that has been solved and put away. It shows up in nearly every applied machine learning system, from search ranking (clicks are rare relative to impressions) to LLM safety evaluation (hallucinations and policy violations are rare relative to ordinary completions). The basic vocabulary of majority and minority classes, imbalance ratios, undersampling, and threshold moving still applies to these modern problems, and the techniques described above remain part of the standard toolkit. What has changed is the scale: a deep model trained on a billion examples can absorb a much higher absolute number of minority examples than a classical model could, which sometimes lets practitioners skip explicit rebalancing entirely. Whether that is a good idea depends, as it always has, on what the minority class is worth.

Explain like I'm 5

Imagine a jar with 99 red marbles and 1 blue marble. If a friend asks you to guess the color of a marble pulled from the jar without looking, your best bet is to always guess red. You will be right 99 times out of 100, and your guessing strategy is super simple: it ignores blue completely. Computers do the same thing when they learn from data. If they see way more red marbles, they learn to just say "red" all the time. The red marbles are the majority class. To teach the computer to also notice the blue marble, you can hide some of the red ones, ask it to pay extra attention when it sees blue, or count blue answers as worth more points than red ones. All of these tricks are different ways of telling the computer that the rare thing actually matters.

References

Wilson, D. L. (1972). "Asymptotic Properties of Nearest Neighbor Rules Using Edited Data." *IEEE Transactions on Systems, Man, and Cybernetics*, 2(3), 408-421.
Hart, P. E. (1968). "The Condensed Nearest Neighbor Rule." *IEEE Transactions on Information Theory*, 14(3), 515-516.
Tomek, I. (1976). "Two Modifications of CNN." *IEEE Transactions on Systems, Man, and Communications*, 6, 769-772.
Kubat, M. and Matwin, S. (1997). "Addressing the Curse of Imbalanced Training Sets: One-Sided Selection." *Proceedings of the 14th International Conference on Machine Learning*, 179-186.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
Mani, I. and Zhang, J. (2003). "kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction." *Proceedings of the ICML Workshop on Learning from Imbalanced Datasets*.
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data." *ACM SIGKDD Explorations Newsletter*, 6(1), 20-29.
He, H. and Garcia, E. A. (2009). "Learning from Imbalanced Data." *IEEE Transactions on Knowledge and Data Engineering*, 21(9), 1263-1284.
Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). "Exploratory Undersampling for Class-Imbalance Learning." *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550.
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2010). "RUSBoost: A Hybrid Approach to Alleviating Class Imbalance." *IEEE Transactions on Systems, Man, and Cybernetics, Part A*, 40(1), 185-197.
Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning." *Journal of Machine Learning Research*, 18(17), 1-5.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2980-2988.
Chicco, D. and Jurman, G. (2020). "The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation." *BMC Genomics*, 21(6).
Pozzolo, A. D., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). "Calibrating Probability with Undersampling for Unbalanced Classification." *IEEE Symposium Series on Computational Intelligence (SSCI)*, 159-166.

Majority class

Definition and notation

Real-world examples

Why the majority class matters

The accuracy paradox

Loss imbalance during training

Loss of practical signal

Strategies that act on the majority class

Combining majority handling with oversampling

Algorithm-level alternatives

Ensemble methods built on majority undersampling

Evaluation metrics suited for imbalanced data

Risks of acting on the majority class

Anomaly detection framing

Software

Modern context

Explain like I'm 5

See also

References

Improve this article

Definition and notation

Real-world examples

Why the majority class matters

The accuracy paradox

Loss imbalance during training

Loss of practical signal

Strategies that act on the majority class

Combining majority handling with oversampling

Algorithm-level alternatives

Ensemble methods built on majority undersampling

Evaluation metrics suited for imbalanced data

Risks of acting on the majority class

Anomaly detection framing

Software

Modern context

Explain like I'm 5

See also

References

Definition and notation

Real-world examples

Why the majority class matters

The accuracy paradox

Loss imbalance during training

Loss of practical signal

Strategies that act on the majority class

Combining majority handling with oversampling

Algorithm-level alternatives

Ensemble methods built on majority undersampling

Evaluation metrics suited for imbalanced data

Risks of acting on the majority class

Anomaly detection framing

Software

Modern context

Explain like I'm 5

See also

References

Improve this article

Related Articles

Minority class

Undersampling

Upweighting

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Definition and notation

Real-world examples

Why the majority class matters

The accuracy paradox

Loss imbalance during training

Loss of practical signal

Strategies that act on the majority class

Combining majority handling with oversampling

Algorithm-level alternatives

Ensemble methods built on majority undersampling

Evaluation metrics suited for imbalanced data

Risks of acting on the majority class

Anomaly detection framing

Software

Modern context

Explain like I'm 5

See also

References

Related Articles

Minority class

Undersampling

Upweighting

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy