In machine learning, the majority class is the class label that appears most frequently in a labeled dataset used for classification. It is the direct counterpart of the minority class, which is the underrepresented label in the same dataset. The terms are most often used in the context of binary classification on a class-imbalanced dataset, where one label dominates the sample count by a large margin. In a fraud detection problem with 99.8% legitimate transactions and 0.2% fraudulent ones, "legitimate" is the majority class. In a chest X-ray screening dataset where 2% of scans show pneumonia, "no pneumonia" is the majority class.
The majority class plays a special role in supervised learning because most standard algorithms minimize a global loss. When one label dominates, the loss landscape is dominated by that label as well, and a model can drive its training error very low simply by predicting the majority answer for everything. Understanding the majority class, measuring how much it dominates the data, and deciding what to do about that dominance is a core part of working with real-world classification problems.
Given a labeled dataset with $N$ examples and $K$ classes, let $N_k$ denote the number of examples with label $k$. The majority class is the label $k^* = \arg\max_k N_k$. In binary problems with classes ${0, 1}$, this collapses to whichever count is larger. The remaining classes are minority classes. When the dataset has only two classes, practitioners usually call them "majority" and "minority" without further qualification.
The degree to which the majority class dominates is captured by the imbalance ratio (IR):
$$\text{IR} = \frac{N_\text{majority}}{N_\text{minority}}$$
An IR of 1 means the data is balanced. An IR of 100 means there are 100 majority examples for every minority example. Some authors instead report the minority prevalence $p = N_\text{minority} / N$, which is more directly comparable to a base rate. The two quantities are related: $\text{IR} = (1 - p) / p$.
In most operational machine learning problems, the majority class is whatever label corresponds to the "normal" or "uninteresting" outcome. The minority class is usually the rare event the model is being built to detect. The table below lists representative domains and the typical proportion of the majority class.
| Domain | Majority class | Approximate share | Source |
|---|---|---|---|
| Credit card fraud (Kaggle ULB dataset) | Legitimate transactions | 99.83% | Pozzolo et al., 2014 dataset notes |
| Network intrusion detection (KDD Cup 99) | Normal traffic | Roughly 80% across categories | Tavallaee et al., 2009 |
| Manufacturing defect inspection | Non-defective items | Usually 95% to 99% | He and Garcia, 2009 |
| Online ad click-through rate | No-click impressions | Usually 95% or more | Richardson et al., 2007 |
| Spam email filtering | Variable, often closer to balanced | 50% to 80% non-spam | Dataset dependent |
| Cancer screening (mammography) | Negative scans | Roughly 99% | Woods et al., 1993 |
These percentages move with the operational pipeline. A bank that pre-filters obvious fraud will see a more imbalanced supervised dataset downstream than the raw transaction stream. A hospital that screens only high-risk patients will see a higher minority prevalence than population-level screening. The majority share is a function of both the underlying base rate and the data collection pipeline.
A model that always predicts the majority class achieves accuracy equal to the majority share. On the Kaggle credit card fraud dataset, that constant predictor scores 99.83% accuracy while catching zero fraud. The model is useless, but accuracy looks great. This is the accuracy paradox and it is the first thing to check when an imbalanced classifier reports a high score. The majority-class baseline (sometimes called the zero rule or majority classifier) sets the accuracy floor any useful model must clear, often by a wide margin to be interesting.
Most classifiers are trained by minimizing a per-example loss summed across the dataset. With an imbalance ratio of 100, roughly 99% of the gradient signal at each step comes from majority examples. The optimizer therefore spends almost all of its capacity learning to fit the majority distribution. Decision boundaries shift away from the minority class, and the prior probability the model assigns to the majority label drifts upward. He and Garcia (2009) survey this effect in detail and call it one of the central difficulties of imbalanced learning.
The minority class is usually the class of interest. The cost of a missed fraud, a missed disease, or a missed defect is typically much higher than the cost of a false alarm. A model trained naively on the raw majority distribution will have low recall on the minority class even when its overall accuracy is high. This is why so much work on imbalanced learning focuses specifically on what to do with the majority side of the dataset.
Methods that target the majority class fall mostly into the family of undersampling approaches. They reduce the influence of majority examples either by removing some of them outright or by reweighting them so they contribute less to the loss.
| Technique | Origin | What it does to the majority class |
|---|---|---|
| Random undersampling | Long-standing baseline | Drops majority examples uniformly at random until the desired ratio is reached |
| Tomek links removal | Tomek, 1976 | Finds pairs of nearest neighbors with different labels and removes the majority member of each pair |
| Edited Nearest Neighbors (ENN) | Wilson, 1972 | Removes any majority example whose label disagrees with the majority of its k nearest neighbors |
| Condensed Nearest Neighbor (CNN) | Hart, 1968 | Keeps only majority examples needed to correctly classify the rest with a 1-NN rule |
| One-Sided Selection (OSS) | Kubat and Matwin, 1997 | Combines Tomek link removal with CNN, removing borderline noise and redundant interior majority points |
| NearMiss-1, NearMiss-2, NearMiss-3 | Mani and Zhang, 2003 | Selects majority examples based on distances to the nearest or farthest minority examples |
| Cluster-based undersampling | Various | Clusters the majority class and keeps centroids or representatives from each cluster |
Random undersampling is the simplest member of this family. It is fast, easy to reason about, and often a strong baseline, but it can throw away informative examples by chance. The neighborhood-based methods try to be smarter: Tomek links and Edited Nearest Neighbors target majority points sitting on the wrong side of the decision boundary, while One-Sided Selection adds a second pass that condenses the interior of the majority class. Kubat and Matwin (1997) argued that this kind of "one-sided" cleaning is preferable to symmetric resampling because it preserves the structure of the majority distribution where it is unambiguous and only intervenes where the boundary is noisy.
NearMiss takes a distance-based view. The three NearMiss variants differ in which majority points they keep: those closest on average to the nearest minority points (NearMiss-1), those closest to the farthest minority points (NearMiss-2), or a per-minority neighborhood selection (NearMiss-3). All three are sensitive to noise because a single mislabeled minority point can pull many majority points into the kept set.
Undersampling and oversampling are not mutually exclusive. The most popular hybrid pipelines apply SMOTE to the minority class first and then clean the majority side:
Batista, Prati, and Monard (2004) compared these combinations on a wide range of UCI benchmarks and found that SMOTE + ENN often outperformed SMOTE alone, especially when the original data was noisy.
The other lever for handling the majority class is to leave the data untouched and change how the model treats different labels. These approaches are sometimes called cost-sensitive learning.
Class weights scale the per-example loss so that minority examples count for more. In scikit-learn, passing class_weight="balanced" to a classifier sets the weight for class $k$ to $N / (K \cdot N_k)$, the inverse of the class frequency normalized so the weights average to 1. The same idea appears as pos_weight in PyTorch's BCEWithLogitsLoss and as scale_pos_weight in XGBoost. The effect is similar to undersampling the majority class but without throwing away data, which is useful when the dataset is small.
Threshold moving keeps the trained model fixed and adjusts the decision threshold at inference. By default, a binary classifier predicts the positive class when the predicted probability exceeds 0.5. Lowering this threshold to, say, 0.1 trades precision for recall on the minority class. Threshold moving is often the first thing to try after training a probabilistic model on imbalanced data because it requires no retraining.
Focal loss, introduced by Lin et al. (2017) for the RetinaNet object detector, modifies cross entropy with a modulating factor $(1 - p_t)^\gamma$ that down-weights well-classified examples. Most majority examples are easy, so focal loss reduces their contribution and lets the optimizer focus on the harder, often minority, examples. The technique was originally developed to handle the extreme foreground-background imbalance in dense object detection, where there are roughly 1,000 background anchors per object, but it is now widely used outside computer vision.
A single random undersample throws away most of the majority data. Ensemble methods reuse the majority class by training many learners on different subsets and combining their predictions. Liu, Wu, and Zhou (2009) introduced two influential designs:
RUSBoost, proposed by Seiffert et al. (2010), combines random undersampling with AdaBoost.M2 in a single algorithm. It tends to match or beat SMOTEBoost on standard imbalanced benchmarks while being noticeably faster, because random undersampling is cheap compared with synthetic generation.
The imbalanced-learn library exposes these as BalancedBaggingClassifier, EasyEnsembleClassifier, BalancedRandomForestClassifier, and RUSBoostClassifier, all with scikit-learn-compatible interfaces.
Accuracy and other metrics that weight per-class performance proportionally to class size are misleading when the majority class dominates. The metrics in the table below are commonly used instead.
| Metric | What it measures | Behavior under imbalance |
|---|---|---|
| Precision | Fraction of predicted positives that are correct | Sensitive to false positives, which dominate when the negative class is huge |
| Recall | Fraction of actual positives that were caught | Insensitive to majority class size; key for rare-event detection |
| F1 score | Harmonic mean of precision and recall | Balances precision and recall; ignores true negatives |
| Balanced accuracy | Mean of per-class recall | Treats both classes equally regardless of size |
| Matthews correlation coefficient (MCC) | Correlation between predicted and actual labels | Uses all four confusion-matrix cells; symmetric in classes |
| Cohen's kappa | Agreement above chance | Adjusts for the high agreement expected from a constant predictor |
| ROC AUC | Area under the receiver operating characteristic curve | Invariant to class balance; can look optimistic on highly imbalanced data |
| PR AUC | Area under the precision-recall curve | Baseline equals minority prevalence; more informative when minority class is small |
Saito and Rehmsmeier (2015) showed that ROC curves can be visually misleading on highly imbalanced data because changing the negative count by an order of magnitude leaves the ROC curve unchanged while completely changing precision. They recommend reporting PR curves and PR AUC alongside or instead of ROC AUC whenever the minority prevalence is below roughly 10%.
Undersampling and reweighting are not free. The main risks include:
A model trained on a 1:1 resampled dataset will overpredict the minority class at deployment time unless its threshold or its output probabilities are corrected. This is particularly important in regulated domains like credit scoring or medical screening, where calibrated probabilities are required for downstream decisions.
At extreme imbalance ratios, treating the problem as classification stops paying off. With an imbalance ratio of 10,000 or more, even SMOTE struggles because there are too few minority examples to interpolate from, and even undersampling leaves you with a tiny training set. In that regime it often makes more sense to switch to anomaly detection: model only the majority distribution and flag anything that looks unusual.
Common one-class methods include one-class SVM, isolation forest, local outlier factor, and autoencoder reconstruction error. These approaches treat the majority class as "normal" and learn a description of it, then score new points by how far they deviate. They sidestep the imbalance entirely because they never need minority labels at training time, although they typically need some minority examples for tuning the decision threshold.
The imbalanced-learn library, introduced by Lemaitre, Nogueira, and Aridas in JMLR (2017), is the most widely used implementation of these techniques in Python. It groups methods into four categories: undersampling, oversampling, combinations, and ensembles. All estimators follow the scikit-learn API and can be dropped into existing pipelines. Other relevant libraries include smote-variants, which packages dozens of SMOTE variants, and imbalanced-ensemble, which extends imbalanced-learn with more ensemble methods. In R, the ROSE, UBL, and themis packages cover similar ground.
Class imbalance is not a problem that has been solved and put away. It shows up in nearly every applied machine learning system, from search ranking (clicks are rare relative to impressions) to LLM safety evaluation (hallucinations and policy violations are rare relative to ordinary completions). The basic vocabulary of majority and minority classes, imbalance ratios, undersampling, and threshold moving still applies to these modern problems, and the techniques described above remain part of the standard toolkit. What has changed is the scale: a deep model trained on a billion examples can absorb a much higher absolute number of minority examples than a classical model could, which sometimes lets practitioners skip explicit rebalancing entirely. Whether that is a good idea depends, as it always has, on what the minority class is worth.
Imagine a jar with 99 red marbles and 1 blue marble. If a friend asks you to guess the color of a marble pulled from the jar without looking, your best bet is to always guess red. You will be right 99 times out of 100, and your guessing strategy is super simple: it ignores blue completely. Computers do the same thing when they learn from data. If they see way more red marbles, they learn to just say "red" all the time. The red marbles are the majority class. To teach the computer to also notice the blue marble, you can hide some of the red ones, ask it to pay extra attention when it sees blue, or count blue answers as worth more points than red ones. All of these tricks are different ways of telling the computer that the rare thing actually matters.