# Majority class

> Source: https://aiwiki.ai/wiki/majority_class
> Updated: 2026-06-27
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

In [machine learning](/wiki/machine_learning), the **majority class** is the class label that appears most frequently in a labeled [dataset](/wiki/dataset) used for [classification](/wiki/classification). It is the direct counterpart of the [minority class](/wiki/minority_class), the underrepresented label in the same dataset. Google's Machine Learning Glossary defines it concisely: "The more common label in a class-imbalanced dataset. For example, given a dataset containing 99% negative labels and 1% positive labels, the negative labels are the majority class."[16] The terms are most often used in the context of [binary classification](/wiki/binary_classification) on a [class-imbalanced dataset](/wiki/class-imbalanced_dataset), where one label dominates the sample count by a large margin. In a fraud detection problem with 99.8% legitimate transactions and 0.2% fraudulent ones, "legitimate" is the majority class. In a chest X-ray screening dataset where 2% of scans show pneumonia, "no pneumonia" is the majority class.

The majority class plays a special role in supervised learning because most standard algorithms minimize a global loss. When one label dominates, the loss landscape is dominated by that label as well, and a model can drive its training error very low simply by predicting the majority answer for everything.[8] Understanding the majority class, measuring how much it dominates the data, and deciding what to do about that dominance is a core part of working with real-world classification problems.

## What is the majority class? Definition and notation

Given a labeled dataset with $N$ examples and $K$ classes, let $N_k$ denote the number of examples with label $k$. The majority class is the label $k^* = \arg\max_k N_k$. In binary problems with classes $\{0, 1\}$, this collapses to whichever count is larger. The remaining classes are minority classes. When the dataset has only two classes, practitioners usually call them "majority" and "minority" without further qualification. As Google's Machine Learning Crash Course puts it, "The more common label is called the majority class. The less common label is called the minority class."[17]

The degree to which the majority class dominates is captured by the **imbalance ratio (IR)**:

$$\text{IR} = \frac{N_\text{majority}}{N_\text{minority}}$$

An IR of 1 means the data is balanced. An IR of 100 means there are 100 majority examples for every minority example. Google's glossary illustrates the extreme end with a binary dataset of "1,000,000 negative labels [and] 10 positive labels," a ratio of 100,000 to 1.[16] Some authors instead report the **minority prevalence** $p = N_\text{minority} / N$, which is more directly comparable to a base rate. The two quantities are related: $\text{IR} = (1 - p) / p$.

## What are real-world examples of the majority class?

In most operational machine learning problems, the majority class is whatever label corresponds to the "normal" or "uninteresting" outcome. The minority class is usually the rare event the model is being built to detect. Google's Crash Course gives two canonical illustrations: in credit card data, "fraudulent purchases might make up less than 0.1% of the examples," and in medical diagnosis, "the number of patients with a rare virus might be less than 0.01% of the total examples."[17] The table below lists representative domains and the typical proportion of the majority class.

| Domain | Majority class | Approximate share | Source |
|---|---|---|---|
| Credit card fraud (Kaggle ULB dataset) | Legitimate transactions | 99.83% | Pozzolo et al., 2014 dataset notes |
| Network intrusion detection (KDD Cup 99) | Normal traffic | Roughly 80% across categories | Tavallaee et al., 2009 |
| Manufacturing defect inspection | Non-defective items | Usually 95% to 99% | He and Garcia, 2009 |
| Online ad click-through rate | No-click impressions | Usually 95% or more | Richardson et al., 2007 |
| Spam email filtering | Variable, often closer to balanced | 50% to 80% non-spam | Dataset dependent |
| Cancer screening (mammography) | Negative scans | Roughly 99% | Woods et al., 1993 |

These percentages move with the operational pipeline. A bank that pre-filters obvious fraud will see a more imbalanced supervised dataset downstream than the raw transaction stream. A hospital that screens only high-risk patients will see a higher minority prevalence than population-level screening. The majority share is a function of both the underlying base rate and the data collection pipeline.

## Why is the majority class a problem?

### The accuracy paradox

A model that always predicts the majority class achieves accuracy equal to the majority share. On the Kaggle credit card fraud dataset, that constant predictor scores 99.83% [accuracy](/wiki/accuracy) while catching zero fraud.[15] The model is useless, but accuracy looks great. This is the **accuracy paradox** and it is the first thing to check when an imbalanced classifier reports a high score. The majority-class baseline (sometimes called the **zero rule** or **majority classifier**) sets the accuracy floor any useful model must clear, often by a wide margin to be interesting.

### Loss imbalance during training

Most classifiers are trained by minimizing a per-example loss summed across the dataset. With an imbalance ratio of 100, roughly 99% of the gradient signal at each step comes from majority examples. The optimizer therefore spends almost all of its capacity learning to fit the majority distribution. Decision boundaries shift away from the minority class, and the prior probability the model assigns to the majority label drifts upward. He and Garcia (2009) survey this effect in detail and call it one of the central difficulties of imbalanced learning.[8]

### Loss of practical signal

The minority class is usually the class of interest. The cost of a missed fraud, a missed disease, or a missed defect is typically much higher than the cost of a false alarm. A model trained naively on the raw majority distribution will have low [recall](/wiki/recall) on the minority class even when its overall accuracy is high. This is why so much work on imbalanced learning focuses specifically on what to do with the majority side of the dataset.

## How do you handle the majority class with resampling?

Methods that target the majority class fall mostly into the family of **undersampling** approaches. They reduce the influence of majority examples either by removing some of them outright or by reweighting them so they contribute less to the loss. Google's Crash Course describes the core idea in two steps: "Downsampling means training on a disproportionately low percentage of majority class examples," after which "you must 'upweight' the majority classes by the factor to which you downsampled."[17]

| Technique | Origin | What it does to the majority class |
|---|---|---|
| [Random undersampling](/wiki/undersampling) | Long-standing baseline | Drops majority examples uniformly at random until the desired ratio is reached |
| Tomek links removal | Tomek, 1976 | Finds pairs of nearest neighbors with different labels and removes the majority member of each pair |
| Edited Nearest Neighbors (ENN) | Wilson, 1972 | Removes any majority example whose label disagrees with the majority of its k nearest neighbors |
| Condensed Nearest Neighbor (CNN) | Hart, 1968 | Keeps only majority examples needed to correctly classify the rest with a 1-NN rule |
| One-Sided Selection (OSS) | Kubat and Matwin, 1997 | Combines Tomek link removal with CNN, removing borderline noise and redundant interior majority points |
| NearMiss-1, NearMiss-2, NearMiss-3 | Mani and Zhang, 2003 | Selects majority examples based on distances to the nearest or farthest minority examples |
| Cluster-based undersampling | Various | Clusters the majority class and keeps centroids or representatives from each cluster |

[Random undersampling](/wiki/undersampling) is the simplest member of this family. It is fast, easy to reason about, and often a strong baseline, but it can throw away informative examples by chance. The neighborhood-based methods try to be smarter: [Tomek links](/wiki/tomek_links)[3] and Edited Nearest Neighbors[1] target majority points sitting on the wrong side of the decision boundary, while One-Sided Selection adds a second pass that condenses the interior of the majority class.[2] Kubat and Matwin (1997) argued that this kind of "one-sided" cleaning is preferable to symmetric resampling because it preserves the structure of the majority distribution where it is unambiguous and only intervenes where the boundary is noisy.[4]

NearMiss takes a distance-based view. The three NearMiss variants differ in which majority points they keep: those closest on average to the nearest minority points (NearMiss-1), those closest to the farthest minority points (NearMiss-2), or a per-minority neighborhood selection (NearMiss-3).[6] All three are sensitive to noise because a single mislabeled minority point can pull many majority points into the kept set.

## How do you combine majority undersampling with oversampling?

Undersampling and oversampling are not mutually exclusive. The most popular hybrid pipelines apply [SMOTE](/wiki/smote)[5] to the minority class first and then clean the majority side:

- **SMOTE + Tomek**: After SMOTE generates synthetic minority points, Tomek-link removal deletes majority points that now sit next to a synthetic minority point of the opposite label. This sharpens the boundary.
- **SMOTE + ENN**: Edited Nearest Neighbors is applied after SMOTE to remove any example, majority or minority, that disagrees with its k nearest neighbors. ENN is more aggressive than Tomek and tends to clean more noise at the cost of removing more data.

Batista, Prati, and Monard (2004) compared these combinations on a wide range of UCI benchmarks and found that SMOTE + ENN often outperformed SMOTE alone, especially when the original data was noisy.[7]

## How do you handle the majority class without resampling?

The other lever for handling the majority class is to leave the data untouched and change how the model treats different labels. These approaches are sometimes called [cost-sensitive learning](/wiki/cost-sensitive_learning).

[Class weights](/wiki/class_weight) scale the per-example loss so that minority examples count for more. In [scikit-learn](/wiki/scikit-learn), passing `class_weight="balanced"` to a classifier sets the weight for class $k$ to $N / (K \cdot N_k)$, the inverse of the class frequency normalized so the weights average to 1.[18] The same idea appears as `pos_weight` in PyTorch's `BCEWithLogitsLoss` and as `scale_pos_weight` in XGBoost. The effect is similar to undersampling the majority class but without throwing away data, which is useful when the dataset is small.

**Threshold moving** keeps the trained model fixed and adjusts the decision threshold at inference. By default, a binary classifier predicts the positive class when the predicted probability exceeds 0.5. Lowering this threshold to, say, 0.1 trades [precision](/wiki/precision) for recall on the minority class. Threshold moving is often the first thing to try after training a probabilistic model on imbalanced data because it requires no retraining.

**Focal loss**, introduced by Lin et al. (2017) for the RetinaNet object detector, modifies cross entropy with a modulating factor $(1 - p_t)^\gamma$ that down-weights well-classified examples.[13] Most majority examples are easy, so focal loss reduces their contribution and lets the optimizer focus on the harder, often minority, examples. The technique was originally developed to handle the extreme foreground-background imbalance in dense object detection, where there are roughly 1,000 background anchors per object, but it is now widely used outside computer vision.

## How do ensemble methods reuse the majority class?

A single random undersample throws away most of the majority data. Ensemble methods reuse the majority class by training many learners on different subsets and combining their predictions. Liu, Wu, and Zhou (2009) introduced two influential designs:[9]

- **EasyEnsemble** trains an AdaBoost ensemble on each of several random undersamples of the majority class, then combines all base learners. Each majority example has many chances to appear in some subset, so almost no information is lost.
- **BalanceCascade** trains learners sequentially. After each round, majority examples that the current ensemble already classifies correctly are dropped from the pool, so subsequent learners focus on the harder remaining majority points.

**RUSBoost**, proposed by Seiffert et al. (2010), combines random undersampling with AdaBoost.M2 in a single algorithm. It tends to match or beat SMOTEBoost on standard imbalanced benchmarks while being noticeably faster, because random undersampling is cheap compared with synthetic generation.[10]

The `imbalanced-learn` library exposes these as `BalancedBaggingClassifier`, `EasyEnsembleClassifier`, `BalancedRandomForestClassifier`, and `RUSBoostClassifier`, all with scikit-learn-compatible interfaces.

## Which evaluation metrics suit imbalanced data?

Accuracy and other metrics that weight per-class performance proportionally to class size are misleading when the majority class dominates. The metrics in the table below are commonly used instead.

| Metric | What it measures | Behavior under imbalance |
|---|---|---|
| [Precision](/wiki/precision) | Fraction of predicted positives that are correct | Sensitive to false positives, which dominate when the negative class is huge |
| [Recall](/wiki/recall) | Fraction of actual positives that were caught | Insensitive to majority class size; key for rare-event detection |
| [F1 score](/wiki/f1_score) | Harmonic mean of precision and recall | Balances precision and recall; ignores true negatives |
| Balanced accuracy | Mean of per-class recall | Treats both classes equally regardless of size |
| Matthews correlation coefficient (MCC) | Correlation between predicted and actual labels | Uses all four confusion-matrix cells; symmetric in classes[14] |
| Cohen's kappa | Agreement above chance | Adjusts for the high agreement expected from a constant predictor |
| [ROC AUC](/wiki/auc_area_under_the_curve) | Area under the receiver operating characteristic curve | Invariant to class balance; can look optimistic on highly imbalanced data |
| [PR AUC](/wiki/pr_auc_area_under_the_pr_curve) | Area under the precision-recall curve | Baseline equals minority prevalence; more informative when minority class is small |

Saito and Rehmsmeier (2015) showed that ROC curves can be visually misleading on highly imbalanced data because changing the negative count by an order of magnitude leaves the ROC curve unchanged while completely changing precision.[11] They recommend reporting PR curves and PR AUC alongside or instead of ROC AUC whenever the minority prevalence is below roughly 10%.

## What are the risks of acting on the majority class?

Undersampling and reweighting are not free. The main risks include:

- **Information loss**: Random undersampling discards data that may contain rare but useful patterns. A subset that happens to drop the few "borderline" majority examples can hurt the model's ability to define the boundary precisely.
- **Distorted prior**: After undersampling, the training distribution no longer matches deployment. The model's predicted probabilities are calibrated to the resampled distribution, not to the natural base rate. Predictions then need to be recalibrated, for example by Platt scaling or by adjusting the log-odds offset back to the original prior.[15]
- **Boundary artifacts**: Aggressive cleaning methods like ENN can over-prune the majority class in regions where it legitimately overlaps with the minority class, biasing the boundary in the other direction.
- **Variance**: A single random undersample is noisy. Ensemble methods like EasyEnsemble exist precisely to average this noise out.

A model trained on a 1:1 resampled dataset will overpredict the minority class at deployment time unless its threshold or its output probabilities are corrected. This is particularly important in regulated domains like credit scoring or medical screening, where calibrated probabilities are required for downstream decisions.

## When should you treat it as anomaly detection instead?

At extreme imbalance ratios, treating the problem as classification stops paying off. With an imbalance ratio of 10,000 or more, even SMOTE struggles because there are too few minority examples to interpolate from, and even undersampling leaves you with a tiny training set. In that regime it often makes more sense to switch to [anomaly detection](/wiki/anomaly_detection): model only the majority distribution and flag anything that looks unusual.

Common one-class methods include one-class SVM, isolation forest, local outlier factor, and autoencoder reconstruction error. These approaches treat the majority class as "normal" and learn a description of it, then score new points by how far they deviate. They sidestep the imbalance entirely because they never need minority labels at training time, although they typically need some minority examples for tuning the decision threshold.

## What software handles the majority class?

The [imbalanced-learn](/wiki/imbalanced-learn) library, introduced by Lemaitre, Nogueira, and Aridas in JMLR (2017), is the most widely used implementation of these techniques in Python.[12] It groups methods into four categories: undersampling, oversampling, combinations, and ensembles. All estimators follow the [scikit-learn](/wiki/scikit-learn) API and can be dropped into existing pipelines. Other relevant libraries include `smote-variants`, which packages dozens of SMOTE variants, and `imbalanced-ensemble`, which extends imbalanced-learn with more ensemble methods. In R, the `ROSE`, `UBL`, and `themis` packages cover similar ground.

## How does the majority class show up in modern AI?

Class imbalance is not a problem that has been solved and put away. It shows up in nearly every applied machine learning system, from search ranking (clicks are rare relative to impressions) to LLM safety evaluation (hallucinations and policy violations are rare relative to ordinary completions). The basic vocabulary of majority and minority classes, imbalance ratios, undersampling, and threshold moving still applies to these modern problems, and the techniques described above remain part of the standard toolkit. What has changed is the scale: a deep model trained on a billion examples can absorb a much higher absolute number of minority examples than a classical model could, which sometimes lets practitioners skip explicit rebalancing entirely. Whether that is a good idea depends, as it always has, on what the minority class is worth.

## Explain like I'm 5

Imagine a jar with 99 red marbles and 1 blue marble. If a friend asks you to guess the color of a marble pulled from the jar without looking, your best bet is to always guess red. You will be right 99 times out of 100, and your guessing strategy is super simple: it ignores blue completely. Computers do the same thing when they learn from data. If they see way more red marbles, they learn to just say "red" all the time. The red marbles are the majority class. To teach the computer to also notice the blue marble, you can hide some of the red ones, ask it to pay extra attention when it sees blue, or count blue answers as worth more points than red ones. All of these tricks are different ways of telling the computer that the rare thing actually matters.

## See also

- [Minority class](/wiki/minority_class)
- [Class-imbalanced dataset](/wiki/class-imbalanced_dataset)
- [Imbalanced dataset](/wiki/imbalanced_dataset)
- [Cost-sensitive learning](/wiki/cost-sensitive_learning)
- [Anomaly detection](/wiki/anomaly_detection)
- [imbalanced-learn](/wiki/imbalanced-learn)

## References

1. Wilson, D. L. (1972). "Asymptotic Properties of Nearest Neighbor Rules Using Edited Data." *IEEE Transactions on Systems, Man, and Cybernetics*, 2(3), 408-421.
2. Hart, P. E. (1968). "The Condensed Nearest Neighbor Rule." *IEEE Transactions on Information Theory*, 14(3), 515-516.
3. Tomek, I. (1976). "Two Modifications of CNN." *IEEE Transactions on Systems, Man, and Communications*, 6, 769-772.
4. Kubat, M. and Matwin, S. (1997). "Addressing the Curse of Imbalanced Training Sets: One-Sided Selection." *Proceedings of the 14th International Conference on Machine Learning*, 179-186.
5. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
6. Mani, I. and Zhang, J. (2003). "kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction." *Proceedings of the ICML Workshop on Learning from Imbalanced Datasets*.
7. Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). "A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data." *ACM SIGKDD Explorations Newsletter*, 6(1), 20-29.
8. He, H. and Garcia, E. A. (2009). "Learning from Imbalanced Data." *IEEE Transactions on Knowledge and Data Engineering*, 21(9), 1263-1284.
9. Liu, X.-Y., Wu, J., and Zhou, Z.-H. (2009). "Exploratory Undersampling for Class-Imbalance Learning." *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550.
10. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2010). "RUSBoost: A Hybrid Approach to Alleviating Class Imbalance." *IEEE Transactions on Systems, Man, and Cybernetics, Part A*, 40(1), 185-197.
11. Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
12. Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning." *Journal of Machine Learning Research*, 18(17), 1-5.
13. Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. (2017). "Focal Loss for Dense Object Detection." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2980-2988.
14. Chicco, D. and Jurman, G. (2020). "The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation." *BMC Genomics*, 21(6).
15. Pozzolo, A. D., Caelen, O., Johnson, R. A., and Bontempi, G. (2015). "Calibrating Probability with Undersampling for Unbalanced Classification." *IEEE Symposium Series on Computational Intelligence (SSCI)*, 159-166.
16. Google for Developers. "Machine Learning Glossary: ML Fundamentals (Majority class, Minority class, Class-imbalanced dataset)." developers.google.com/machine-learning/glossary/fundamentals. Accessed 2026.
17. Google for Developers. "Datasets: Class-imbalanced datasets." Machine Learning Crash Course. developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets. Accessed 2026.
18. scikit-learn developers. "compute_class_weight." scikit-learn 1.x documentation. scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html. Accessed 2026.

