Negative class

See also: Machine learning terms

The negative class is the label assigned to instances in a binary classification problem that do not possess the target characteristic the model is trying to detect. By convention, the positive class is the outcome the practitioner cares about (a tumor present, a fraudulent transaction, a defective part), and the negative class is everything else. The split is asymmetric in practice: the negative class is usually larger, less interesting on its own, and often more heterogeneous, which has consequences for how models are trained and evaluated.

Origin and convention

The positive/negative vocabulary comes from medical screening and diagnostic testing, where a "positive" test result indicates the suspected condition is present and a "negative" result indicates it is absent. The same convention carried over into epidemiology and statistics, and from there into machine learning, where it now anchors the language of confusion matrix entries (true positive, false positive, true negative, false negative), precision, recall, and ROC analysis.

The choice of which label is positive is a design decision, not a property of the data. In a churn prediction model the positive class is usually "will churn" because that is what the business wants to act on, even though most customers do not churn. In quality control the positive class is usually "defect" even though most parts are fine. Flipping the convention does not change what the model can learn, but it does flip the meaning of every metric, which is a frequent source of confusion when comparing reports across teams.

Role in supervised learning

In supervised learning the model sees labeled examples and learns a decision function that maps inputs to one of the two classes. During training, both classes contribute to the loss; the model is penalized for both false positives (negatives misclassified as positive) and false negatives (positives misclassified as negative). At inference the model produces a score, and a threshold is applied to decide which side of the boundary the example falls on. Examples below the threshold are predicted as negative, and the negative class therefore acts as the default or background label.

In a spam detector the negative class is "not spam," which encompasses a much wider distribution of email content than the positive "spam" class. In a fraud model the negative class is the bulk of legitimate transactions. In intrusion detection it is normal network traffic. The negative label hides this internal diversity behind a single tag, which is convenient for evaluation but can hurt generalization when the negative population shifts.

Confusion matrix and the negative class

Two of the four cells in a confusion matrix describe negative-class outcomes.

Actual \ Predicted	Predicted positive	Predicted negative
Actual positive	True positive (TP)	False negative (FN)
Actual negative	False positive (FP)	True negative (TN)

True negatives are negative-class instances correctly predicted as negative. False positives are negative-class instances the model mistakenly flags as positive. The base rate of the negative class drives several derived metrics. Specificity, defined as TN / (TN + FP), is the recall of the negative class. The false positive rate, FP / (FP + TN), is one minus specificity and forms the x-axis of the ROC curve. When negatives vastly outnumber positives, even a small false positive rate can produce more false alarms than true detections, which is why precision and the precision-recall curve are often preferred over ROC for highly imbalanced problems.

One-vs-rest in multiclass settings

In multi-class classification, the binary positive/negative framing extends through the one-vs-rest (OvR) reduction. For each class c, the algorithm trains a binary classifier that treats c as positive and the union of every other class as negative. With ten classes, the same example serves as positive for one classifier and as negative for the other nine. The negative class in each subproblem is therefore a deliberate mixture, which is one reason OvR scores need to be calibrated before they can be compared across classifiers. The same idea appears in one-vs-one decomposition, although there each pairwise classifier sees only two classes and the negative-class definition is narrower.

Anomaly detection and one-class settings

In anomaly detection the labeling is often inverted relative to intuition. Normal behavior is by far the larger class, and rare anomalies are what the system needs to flag. Some pipelines treat the rare events as the positive class and normal behavior as the negative class, mirroring the medical screening convention. Other one-class methods, such as one-class SVM and isolation forest, train only on the normal data and treat anything sufficiently different as anomalous at inference. Here the negative class is essentially everything the model has ever seen, and the model never gets explicit anomaly examples during training.

Open-set recognition is a related setting: a model trained on a fixed set of known classes must reject inputs from unknown classes at test time. The rejected inputs form a kind of dynamic negative class that was never present in the training distribution. Methods for open-set recognition include OpenMax, evidential deep learning, and energy-based out-of-distribution detection.

Class imbalance

Most real-world classification problems are imbalanced, and the imbalance almost always favors the negative class. The imbalance ratio (IR) is defined as the size of the majority class divided by the size of the minority class. When IR is large, naive accuracy becomes misleading: a model that always predicts the negative class on a 1:99 problem achieves 99% accuracy while detecting nothing. See imbalanced dataset for a fuller treatment.

Domain	Approximate positive rate	Implied imbalance ratio
Credit card fraud detection	< 1%	> 100:1
Click-through prediction (display ads)	0.1% to 1%	100:1 to 1000:1
Rare disease screening	< 0.1%	> 1000:1
Manufacturing defect detection	1% to 5%	20:1 to 100:1
Spam detection (modern email)	30% to 50%	roughly 1:1 to 2:1
Sentiment classification (balanced corpus)	~50%	1:1

Metrics designed to cope with imbalance include the area under the precision-recall curve, the F1 score, Matthews correlation coefficient, balanced accuracy, and the geometric mean of class-wise recalls (G-mean), which is widely used in the imbalanced-learning literature.

Resampling techniques

Resampling rebalances the training distribution before fitting a model. The methods fall into three groups.

Technique	What it does	Risk
Random undersampling	Drops negative examples uniformly at random	Discards potentially useful information
Tomek links removal	Removes negative examples that form a nearest-neighbor pair with a positive	Modest balancing effect, mainly cleans the boundary
NearMiss-1/2/3	Keeps negatives that are closest to positives by various distance rules	Can amplify noise near the decision boundary
Edited Nearest Neighbours (ENN)	Removes examples whose label disagrees with the majority of their k nearest neighbors	Sensitive to k and to noisy labels
Random oversampling	Duplicates positive examples until the classes match	Encourages overfitting to the duplicated points
SMOTE	Generates synthetic positives by interpolating between minority examples and their k nearest neighbors	Can blur the class boundary; struggles with categorical features
Borderline-SMOTE	Restricts SMOTE to minority points near the decision boundary	Sensitive to noise on the boundary
ADASYN	Adapts the per-point sampling weight so harder positives get more synthetic neighbors	Can over-emphasize outliers
SMOTE-Tomek	Combines SMOTE oversampling with Tomek-link cleanup	Inherits the limitations of both components

The SMOTE family begins with the original SMOTE algorithm published by Chawla, Bowyer, Hall, and Kegelmeyer in the Journal of Artificial Intelligence Research in 2002. Borderline-SMOTE was introduced by Han, Wang, and Mao in 2005, and ADASYN by He, Bai, Garcia, and Li at IJCNN 2008. The most widely used implementation in Python is imbalanced-learn (imported as imblearn), which is part of the scikit-learn-contrib ecosystem and provides drop-in API-compatible samplers.

The choice between undersampling and oversampling depends on data volume and the cost of missing the positive class. Undersampling the negative class is fast and works well when there is enough data that throwing some away is harmless, but it discards examples the model could have learned from. Oversampling preserves all of the negative-class information at the cost of larger training sets and the risk that the model memorizes the duplicated or interpolated positives.

Cost-sensitive learning

Resampling reshapes the data; cost-sensitive methods reshape the loss instead. They leave the dataset alone and tell the optimizer to weigh mistakes on the minority class more heavily than mistakes on the majority class. See cost-sensitive learning for a deeper treatment.

Method	Mechanism
Class-weighted cross entropy	Multiplies the loss for each example by a per-class weight, typically inversely proportional to class frequency
`class_weight='balanced'` in scikit-learn	Sets weights to n_samples / (n_classes * n_samples_per_class) automatically
Threshold moving	Trains an unweighted model and then shifts the decision threshold to favor the minority class at inference
Calibrated cost matrix	Specifies an explicit cost for each (predicted, actual) pair and chooses the prediction that minimizes expected cost
Focal loss	Down-weights well-classified examples by a factor of (1 - p_t)^gamma, originally proposed for dense object detection

Focal loss was introduced by Lin, Goyal, Girshick, He, and Dollar in 2017 as part of the RetinaNet object detector, where each image contains tens of thousands of background (negative) anchor boxes and only a handful of foreground (positive) ones. The same idea has since been used for long-tailed image classification, named entity recognition, and other tasks where the negative class swamps the positive one.

Algorithmic ensemble methods combine resampling with bagging. EasyEnsemble trains many base learners on independently undersampled subsets of the negatives and averages their predictions. BalancedRandomForest does the same inside each decision tree of a random forest. Both keep the speed of undersampling while recovering some of the information that a single undersample would discard.

Negative sampling in representation learning

The term "negative class" also appears in self-supervised and contrastive methods, where the labels are constructed rather than given.

In word2vec, Mikolov, Sutskever, Chen, Corrado, and Dean (NeurIPS 2013) replaced the expensive softmax over the full vocabulary with negative sampling: for each true (word, context) pair, a small number of "negative" words are drawn from a noise distribution and the model is trained to score the true pair higher than the noise pairs. The negative class here is not a category in the data but a synthetic foil constructed from random vocabulary draws.

Contrastive learning methods adopt the same logic for images and other modalities. SimCLR (Chen et al., 2020) treats other examples in the same minibatch as negatives, so the effective number of negatives scales with batch size. MoCo (He et al., 2020) decouples the negatives from the batch by maintaining a queue of recent encoder outputs. Hard-negative mining tries to focus the loss on the negatives that are currently hardest to push away from the anchor, which is faster than uniform sampling but can amplify noisy labels if applied too aggressively.

Metric learning for face and image retrieval relies on the same idea. FaceNet (Schroff, Kalenichenko, Philbin, CVPR 2015) trains with a triplet loss that pulls an anchor toward a positive of the same identity and pushes it away from a negative of a different identity, and uses online semi-hard negative mining to choose informative triplets within each minibatch.

Choosing what counts as negative

The negative class is whatever is left over after the positive class is defined, which sounds simple but rarely is. A spam detector trained on a corpus of marketing email may flag legitimate marketing as spam when deployed against a wider population, because the implicit definition of "not spam" was narrower than the real world. A defect detector trained on photos of one production line may fail on another line whose backgrounds differ. The fix is usually to broaden the negative population during training, either by adding harder negatives, by using domain-randomized negatives, or by reframing the problem as anomaly detection where only the positive distribution is modeled directly.

When the negative class is heterogeneous, it can also help to break it apart. A binary intent classifier that lumps every non-purchase intent into a single negative often does worse than a multiclass model with explicit categories for browse, support, and complaint, because the multiclass loss forces the model to keep those subpopulations separate in feature space.

Practical guidance

A few rules of thumb apply across most projects.

Decide which side is positive based on what action the prediction triggers. The positive class should be the one whose detection has business meaning.
Look at the imbalance ratio before choosing a metric. Accuracy is meaningful when the classes are roughly balanced and almost useless past about 10:1.
Try a class weight adjustment first; it is cheap, leaves the data intact, and is often enough.
Resampling and class weighting tend to interact, so apply one at a time and measure the effect on a held-out set with the same imbalance as production.
When using SMOTE or other synthetic methods, fit the sampler inside the cross-validation loop on the training fold only. Sampling before splitting leaks information.
For deep learning on extreme imbalance, focal loss or hard-negative mining usually beats naive resampling, especially when the negative class contains many easy examples.

Explain like I'm 5 (ELI5)

When we teach a computer to tell two things apart, we give the things names. The thing we want the computer to find is called the positive class, and the other thing is called the negative class.

Imagine you are sorting apples and oranges. If you want the computer to find apples, then apples are the positive class and oranges are the negative class. The computer learns the difference between them so it can pick out the apples.

Now imagine your basket has a thousand oranges and only one apple. If the computer just guesses "orange" every time, it will be right almost always, but it will never find the apple. That is the problem of an imbalanced negative class. So we use tricks like making more pretend apples (oversampling), throwing away some oranges (undersampling), or telling the computer that missing an apple is much worse than mistaking an orange. Those tricks help the computer pay attention to the rare thing we actually care about.

References

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. *Journal of Artificial Intelligence Research*, 16, 321-357. https://www.jair.org/index.php/jair/article/view/10302
Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. *Advances in Intelligent Computing*, 878-887.
He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. *2008 IEEE International Joint Conference on Neural Networks (IJCNN)*, 1322-1328. https://ieeexplore.ieee.org/document/4633969
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2980-2988. https://arxiv.org/abs/1708.02002
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. *Advances in Neural Information Processing Systems (NeurIPS)*, 26. https://arxiv.org/abs/1310.4546
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 815-823. https://arxiv.org/abs/1503.03832
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations (SimCLR). *Proceedings of the 37th International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/2002.05709
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning (MoCo). *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/1911.05722
Liu, X. Y., Wu, J., & Zhou, Z. H. (2009). Exploratory undersampling for class-imbalance learning (EasyEnsemble and BalanceCascade). *IEEE Transactions on Systems, Man, and Cybernetics, Part B*, 39(2), 539-550.
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. *Journal of Machine Learning Research*, 18(17), 1-5. https://imbalanced-learn.org/
Pedregosa, F. et al. (2011). Scikit-learn: Machine learning in Python (`class_weight='balanced'` documentation). *Journal of Machine Learning Research*, 12, 2825-2830. https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html

Origin and convention

Role in supervised learning

Confusion matrix and the negative class

One-vs-rest in multiclass settings

Anomaly detection and one-class settings

Class imbalance

Resampling techniques

Cost-sensitive learning

Negative sampling in representation learning

Choosing what counts as negative

Practical guidance

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset

Origin and convention

Role in supervised learning

Confusion matrix and the negative class

One-vs-rest in multiclass settings

Anomaly detection and one-class settings

Class imbalance

Resampling techniques

Cost-sensitive learning

Negative sampling in representation learning

Choosing what counts as negative

Practical guidance

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Binary Classification

Class-Imbalanced Dataset