See also: Machine learning terms
The negative class is the label assigned to instances in a binary classification problem that do not possess the target characteristic the model is trying to detect. By convention, the positive class is the outcome the practitioner cares about (a tumor present, a fraudulent transaction, a defective part), and the negative class is everything else. The split is asymmetric in practice: the negative class is usually larger, less interesting on its own, and often more heterogeneous, which has consequences for how models are trained and evaluated.
The positive/negative vocabulary comes from medical screening and diagnostic testing, where a "positive" test result indicates the suspected condition is present and a "negative" result indicates it is absent. The same convention carried over into epidemiology and statistics, and from there into machine learning, where it now anchors the language of confusion matrix entries (true positive, false positive, true negative, false negative), precision, recall, and ROC analysis.
The choice of which label is positive is a design decision, not a property of the data. In a churn prediction model the positive class is usually "will churn" because that is what the business wants to act on, even though most customers do not churn. In quality control the positive class is usually "defect" even though most parts are fine. Flipping the convention does not change what the model can learn, but it does flip the meaning of every metric, which is a frequent source of confusion when comparing reports across teams.
In supervised learning the model sees labeled examples and learns a decision function that maps inputs to one of the two classes. During training, both classes contribute to the loss; the model is penalized for both false positives (negatives misclassified as positive) and false negatives (positives misclassified as negative). At inference the model produces a score, and a threshold is applied to decide which side of the boundary the example falls on. Examples below the threshold are predicted as negative, and the negative class therefore acts as the default or background label.
In a spam detector the negative class is "not spam," which encompasses a much wider distribution of email content than the positive "spam" class. In a fraud model the negative class is the bulk of legitimate transactions. In intrusion detection it is normal network traffic. The negative label hides this internal diversity behind a single tag, which is convenient for evaluation but can hurt generalization when the negative population shifts.
Two of the four cells in a confusion matrix describe negative-class outcomes.
| Actual \ Predicted | Predicted positive | Predicted negative |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
True negatives are negative-class instances correctly predicted as negative. False positives are negative-class instances the model mistakenly flags as positive. The base rate of the negative class drives several derived metrics. Specificity, defined as TN / (TN + FP), is the recall of the negative class. The false positive rate, FP / (FP + TN), is one minus specificity and forms the x-axis of the ROC curve. When negatives vastly outnumber positives, even a small false positive rate can produce more false alarms than true detections, which is why precision and the precision-recall curve are often preferred over ROC for highly imbalanced problems.
In multi-class classification, the binary positive/negative framing extends through the one-vs-rest (OvR) reduction. For each class c, the algorithm trains a binary classifier that treats c as positive and the union of every other class as negative. With ten classes, the same example serves as positive for one classifier and as negative for the other nine. The negative class in each subproblem is therefore a deliberate mixture, which is one reason OvR scores need to be calibrated before they can be compared across classifiers. The same idea appears in one-vs-one decomposition, although there each pairwise classifier sees only two classes and the negative-class definition is narrower.
In anomaly detection the labeling is often inverted relative to intuition. Normal behavior is by far the larger class, and rare anomalies are what the system needs to flag. Some pipelines treat the rare events as the positive class and normal behavior as the negative class, mirroring the medical screening convention. Other one-class methods, such as one-class SVM and isolation forest, train only on the normal data and treat anything sufficiently different as anomalous at inference. Here the negative class is essentially everything the model has ever seen, and the model never gets explicit anomaly examples during training.
Open-set recognition is a related setting: a model trained on a fixed set of known classes must reject inputs from unknown classes at test time. The rejected inputs form a kind of dynamic negative class that was never present in the training distribution. Methods for open-set recognition include OpenMax, evidential deep learning, and energy-based out-of-distribution detection.
Most real-world classification problems are imbalanced, and the imbalance almost always favors the negative class. The imbalance ratio (IR) is defined as the size of the majority class divided by the size of the minority class. When IR is large, naive accuracy becomes misleading: a model that always predicts the negative class on a 1:99 problem achieves 99% accuracy while detecting nothing. See imbalanced dataset for a fuller treatment.
| Domain | Approximate positive rate | Implied imbalance ratio |
|---|---|---|
| Credit card fraud detection | < 1% | > 100:1 |
| Click-through prediction (display ads) | 0.1% to 1% | 100:1 to 1000:1 |
| Rare disease screening | < 0.1% | > 1000:1 |
| Manufacturing defect detection | 1% to 5% | 20:1 to 100:1 |
| Spam detection (modern email) | 30% to 50% | roughly 1:1 to 2:1 |
| Sentiment classification (balanced corpus) | ~50% | 1:1 |
Metrics designed to cope with imbalance include the area under the precision-recall curve, the F1 score, Matthews correlation coefficient, balanced accuracy, and the geometric mean of class-wise recalls (G-mean), which is widely used in the imbalanced-learning literature.
Resampling rebalances the training distribution before fitting a model. The methods fall into three groups.
| Technique | What it does | Risk |
|---|---|---|
| Random undersampling | Drops negative examples uniformly at random | Discards potentially useful information |
| Tomek links removal | Removes negative examples that form a nearest-neighbor pair with a positive | Modest balancing effect, mainly cleans the boundary |
| NearMiss-1/2/3 | Keeps negatives that are closest to positives by various distance rules | Can amplify noise near the decision boundary |
| Edited Nearest Neighbours (ENN) | Removes examples whose label disagrees with the majority of their k nearest neighbors | Sensitive to k and to noisy labels |
| Random oversampling | Duplicates positive examples until the classes match | Encourages overfitting to the duplicated points |
| SMOTE | Generates synthetic positives by interpolating between minority examples and their k nearest neighbors | Can blur the class boundary; struggles with categorical features |
| Borderline-SMOTE | Restricts SMOTE to minority points near the decision boundary | Sensitive to noise on the boundary |
| ADASYN | Adapts the per-point sampling weight so harder positives get more synthetic neighbors | Can over-emphasize outliers |
| SMOTE-Tomek | Combines SMOTE oversampling with Tomek-link cleanup | Inherits the limitations of both components |
The SMOTE family begins with the original SMOTE algorithm published by Chawla, Bowyer, Hall, and Kegelmeyer in the Journal of Artificial Intelligence Research in 2002. Borderline-SMOTE was introduced by Han, Wang, and Mao in 2005, and ADASYN by He, Bai, Garcia, and Li at IJCNN 2008. The most widely used implementation in Python is imbalanced-learn (imported as imblearn), which is part of the scikit-learn-contrib ecosystem and provides drop-in API-compatible samplers.
The choice between undersampling and oversampling depends on data volume and the cost of missing the positive class. Undersampling the negative class is fast and works well when there is enough data that throwing some away is harmless, but it discards examples the model could have learned from. Oversampling preserves all of the negative-class information at the cost of larger training sets and the risk that the model memorizes the duplicated or interpolated positives.
Resampling reshapes the data; cost-sensitive methods reshape the loss instead. They leave the dataset alone and tell the optimizer to weigh mistakes on the minority class more heavily than mistakes on the majority class. See cost-sensitive learning for a deeper treatment.
| Method | Mechanism |
|---|---|
| Class-weighted cross entropy | Multiplies the loss for each example by a per-class weight, typically inversely proportional to class frequency |
class_weight='balanced' in scikit-learn | Sets weights to n_samples / (n_classes * n_samples_per_class) automatically |
| Threshold moving | Trains an unweighted model and then shifts the decision threshold to favor the minority class at inference |
| Calibrated cost matrix | Specifies an explicit cost for each (predicted, actual) pair and chooses the prediction that minimizes expected cost |
| Focal loss | Down-weights well-classified examples by a factor of (1 - p_t)^gamma, originally proposed for dense object detection |
Focal loss was introduced by Lin, Goyal, Girshick, He, and Dollar in 2017 as part of the RetinaNet object detector, where each image contains tens of thousands of background (negative) anchor boxes and only a handful of foreground (positive) ones. The same idea has since been used for long-tailed image classification, named entity recognition, and other tasks where the negative class swamps the positive one.
Algorithmic ensemble methods combine resampling with bagging. EasyEnsemble trains many base learners on independently undersampled subsets of the negatives and averages their predictions. BalancedRandomForest does the same inside each decision tree of a random forest. Both keep the speed of undersampling while recovering some of the information that a single undersample would discard.
The term "negative class" also appears in self-supervised and contrastive methods, where the labels are constructed rather than given.
In word2vec, Mikolov, Sutskever, Chen, Corrado, and Dean (NeurIPS 2013) replaced the expensive softmax over the full vocabulary with negative sampling: for each true (word, context) pair, a small number of "negative" words are drawn from a noise distribution and the model is trained to score the true pair higher than the noise pairs. The negative class here is not a category in the data but a synthetic foil constructed from random vocabulary draws.
Contrastive learning methods adopt the same logic for images and other modalities. SimCLR (Chen et al., 2020) treats other examples in the same minibatch as negatives, so the effective number of negatives scales with batch size. MoCo (He et al., 2020) decouples the negatives from the batch by maintaining a queue of recent encoder outputs. Hard-negative mining tries to focus the loss on the negatives that are currently hardest to push away from the anchor, which is faster than uniform sampling but can amplify noisy labels if applied too aggressively.
Metric learning for face and image retrieval relies on the same idea. FaceNet (Schroff, Kalenichenko, Philbin, CVPR 2015) trains with a triplet loss that pulls an anchor toward a positive of the same identity and pushes it away from a negative of a different identity, and uses online semi-hard negative mining to choose informative triplets within each minibatch.
The negative class is whatever is left over after the positive class is defined, which sounds simple but rarely is. A spam detector trained on a corpus of marketing email may flag legitimate marketing as spam when deployed against a wider population, because the implicit definition of "not spam" was narrower than the real world. A defect detector trained on photos of one production line may fail on another line whose backgrounds differ. The fix is usually to broaden the negative population during training, either by adding harder negatives, by using domain-randomized negatives, or by reframing the problem as anomaly detection where only the positive distribution is modeled directly.
When the negative class is heterogeneous, it can also help to break it apart. A binary intent classifier that lumps every non-purchase intent into a single negative often does worse than a multiclass model with explicit categories for browse, support, and complaint, because the multiclass loss forces the model to keep those subpopulations separate in feature space.
A few rules of thumb apply across most projects.
When we teach a computer to tell two things apart, we give the things names. The thing we want the computer to find is called the positive class, and the other thing is called the negative class.
Imagine you are sorting apples and oranges. If you want the computer to find apples, then apples are the positive class and oranges are the negative class. The computer learns the difference between them so it can pick out the apples.
Now imagine your basket has a thousand oranges and only one apple. If the computer just guesses "orange" every time, it will be right almost always, but it will never find the apple. That is the problem of an imbalanced negative class. So we use tricks like making more pretend apples (oversampling), throwing away some oranges (undersampling), or telling the computer that missing an apple is much worse than mistaking an orange. Those tricks help the computer pay attention to the rare thing we actually care about.