A false negative (FN) is an outcome in classification where a machine learning model incorrectly predicts the negative class for an instance that actually belongs to the positive class. In statistical hypothesis testing, a false negative corresponds to a Type II error, meaning the test fails to reject a null hypothesis that is in fact false. False negatives are one of the four possible outcomes in a confusion matrix, alongside true positives (TP), true negatives (TN), and false positives (FP).
The consequences of false negatives vary widely depending on the application domain. In medical screening, a false negative means a sick patient is told they are healthy. In fraud detection, it means a fraudulent transaction goes undetected. In security systems, it means a genuine threat is missed. Because of these high-stakes implications, understanding, measuring, and minimizing false negatives is a core concern in applied machine learning and statistical testing.
Imagine you are playing a game where you have to find all the red balls in a big box full of red and blue balls. Every time you pick up a ball, you have to say whether it is red or blue.
A false negative happens when you look at a red ball and say, "That one is blue!" You missed it. The red ball was really there, but you said it was not.
Now imagine a doctor is checking people to see if they are sick. If a sick person comes in and the doctor says, "You are healthy, you can go home," that is a false negative. The person was actually sick, but the test said they were fine. That is why false negatives can be a big problem: they mean you miss something important.
A confusion matrix is a table that summarizes the performance of a classification model by comparing predicted labels against actual labels. For binary classification, the matrix has four cells.
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
A false negative sits in the cell where the actual class is positive but the model's prediction is negative. In other words, the model "missed" a positive instance. The total number of actual positives in a dataset is the sum of true positives and false negatives:
P = TP + FN
This relationship is fundamental to several evaluation metrics discussed below.
The concept of a false negative has its roots in the Neyman-Pearson framework of statistical hypothesis testing, developed by Jerzy Neyman and Egon Pearson in the early 1930s. In this framework, two types of errors can occur when testing a hypothesis:
| Error type | Statistical name | Description |
|---|---|---|
| Type I error | False positive | Rejecting a true null hypothesis |
| Type II error | False negative | Failing to reject a false null hypothesis |
The probability of committing a Type II error is denoted by the Greek letter beta. Neyman and Pearson proposed that beta should not exceed 0.20 as an upper bound, meaning the test should detect true effects at least 80% of the time. The complement of beta (1 minus beta) is called the statistical power of a test, representing the probability of correctly detecting a true effect.
Ronald Fisher's earlier approach to significance testing did not include an alternative hypothesis, so there was no formal concept of a Type II error in his framework. The distinction between the two approaches remains an important topic in the philosophy of statistics.
Several standard evaluation metrics directly incorporate false negatives. Understanding these metrics helps practitioners diagnose whether their models are producing too many missed detections.
Recall measures the proportion of actual positive instances that the model correctly identifies. It is also called sensitivity or true positive rate.
Recall = TP / (TP + FN)
A recall of 1.0 means the model produced zero false negatives; every positive instance was correctly detected. A recall of 0.0 means the model missed every positive instance. In medical contexts, recall is often the most important metric because missing a disease diagnosis (a false negative) can be far more harmful than a false alarm.
The false negative rate (FNR), also called the miss rate, is the complement of recall:
FNR = FN / (TP + FN) = 1 - Recall
A false negative rate of 0.13 means that for every 100 actual positive cases, 13 are incorrectly classified as negative. This metric is especially informative in domains like medical imaging, where researchers have found that AI systems for chest X-ray analysis can produce false negative rates around 0.13, meaning roughly 13 out of 100 patients with a condition are missed.
The F1 score is the harmonic mean of precision and recall, balancing the trade-off between false positives and false negatives:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
When false negatives are more costly than false positives, practitioners may use the F-beta score with beta greater than 1 (for example, F2), which weights recall more heavily. Conversely, when false positives are more costly, an F-beta score with beta less than 1 (for example, F0.5) places greater emphasis on precision.
While specificity (true negative rate) measures performance on the negative class, it has an inverse relationship with false negatives through the precision-recall trade-off. Increasing a model's sensitivity (reducing false negatives) typically comes at the cost of reduced specificity (more false positives), and vice versa.
In most classifiers, particularly probabilistic ones like logistic regression, predictions are based on a confidence score. A classification threshold converts these continuous scores into discrete class labels. Adjusting this threshold directly affects the balance between false negatives and false positives.
| Threshold adjustment | Effect on false negatives | Effect on false positives | Recall | Precision |
|---|---|---|---|---|
| Lower the threshold | Decreases (fewer missed positives) | Increases (more false alarms) | Increases | Decreases |
| Raise the threshold | Increases (more missed positives) | Decreases (fewer false alarms) | Decreases | Increases |
This trade-off is visualized in two standard plots:
The optimal threshold depends entirely on the application. In cancer screening, a low threshold is preferred because the cost of missing a malignant tumor (false negative) far outweighs the cost of an unnecessary follow-up test (false positive). In email spam filtering, a slightly higher threshold may be acceptable because incorrectly sending a legitimate email to spam (false positive) can be more disruptive than letting occasional spam through (false negative).
False negatives arise from a variety of sources. Understanding the root causes helps guide the choice of mitigation strategy.
When the training data does not contain enough examples of the positive class, or when the examples present do not cover the full range of variation in the real world, the model may fail to recognize positive instances at inference time. For example, a skin cancer detector trained primarily on images of lighter skin tones may produce more false negatives on patients with darker skin.
Class imbalance occurs when the negative class vastly outnumbers the positive class in the training dataset. In such cases, a classifier can achieve high accuracy simply by predicting the majority (negative) class for nearly every instance. This strategy naturally produces many false negatives for the minority (positive) class. Credit card fraud detection is a classic example: legitimate transactions may outnumber fraudulent ones by a ratio of 1000 to 1 or more.
A model that is too simple to capture the underlying patterns in the data will underfit. For instance, a linear classifier applied to data with complex, non-linear decision boundaries may fail to identify many positive instances, resulting in elevated false negative rates.
Overfitting occurs when a model memorizes the training data instead of learning generalizable patterns. While overfitting often manifests as poor performance on unseen data in general, it can specifically increase false negatives if the model has memorized a narrow definition of what constitutes a positive example.
Poor feature engineering, noisy measurements, or missing feature values can obscure the signal that distinguishes positive from negative instances. In medical imaging, for example, low-resolution scans or artifacts in the image may cause a model to miss a lesion.
As discussed above, a threshold that is set too high requires the model to be very confident before predicting the positive class. This directly increases the number of false negatives because borderline positive cases are classified as negative.
If the ground truth labels in the training data contain errors (positive instances mislabeled as negative), the model learns to replicate these mistakes, leading to false negatives on correctly labeled test data.
Reducing false negatives requires a combination of data-level, algorithm-level, and post-processing techniques. The appropriate strategy depends on the specific problem and the acceptable trade-off with false positives.
The simplest approach is lowering the classification threshold. This allows the model to predict the positive class at lower confidence levels, catching more true positives at the cost of additional false positives. This is often the first technique to try because it requires no retraining.
Resampling the training data can help the model learn a better representation of the positive class:
| Technique | Description | Pros | Cons |
|---|---|---|---|
| Oversampling | Duplicate or synthetically generate minority class examples | Simple to implement; more positive examples for learning | Risk of overfitting to duplicated examples |
| SMOTE | Generate synthetic minority examples by interpolating between existing ones | Produces diverse synthetic examples; reduces overfitting compared to naive oversampling | Can create noisy samples in overlapping regions |
| Undersampling | Remove majority class examples to balance the dataset | Reduces training time; simplifies the learning problem | Discards potentially useful information |
| Hybrid methods | Combine oversampling of the minority class with undersampling of the majority class | Balances trade-offs of both approaches | Requires more careful tuning |
Cost-sensitive learning assigns different misclassification costs to different types of errors. By setting the cost of a false negative higher than the cost of a false positive, the loss function penalizes missed positives more heavily during training. Many popular algorithms support class weights or cost matrices, including support vector machines, logistic regression, decision trees, and neural networks.
For example, in scikit-learn, setting class_weight='balanced' or providing a custom weight dictionary automatically adjusts the loss function to account for class imbalance.
Ensemble learning techniques combine the predictions of multiple models to improve overall performance:
Data augmentation increases the diversity of training examples through transformations. In computer vision, this includes rotations, flips, color jittering, and cropping. In natural language processing, it can involve synonym replacement, back-translation, or paraphrasing. More diverse training data helps the model generalize to a wider range of positive instances.
Better features can make the distinction between positive and negative classes clearer. This may involve domain-specific feature extraction, dimensionality reduction, or using deep learning models that automatically learn relevant feature representations through transfer learning.
Choosing a model with appropriate complexity for the problem is essential. Regularization techniques such as L1/L2 penalties and dropout can prevent overfitting and improve generalization. When the underlying decision boundary is complex, more expressive models like convolutional neural networks or ensemble methods may be needed to avoid underfitting.
Model calibration techniques such as Platt scaling or isotonic regression can improve the reliability of predicted probabilities, making threshold selection more principled and reducing the likelihood of false negatives.
The impact and acceptable rate of false negatives varies significantly by domain. The following table summarizes how false negatives affect different fields.
| Domain | What a false negative means | Typical consequence | Priority |
|---|---|---|---|
| Medical diagnosis | A sick patient is classified as healthy | Delayed treatment, disease progression, potential death | Minimizing false negatives is the top priority |
| Cancer screening | A malignant tumor is missed | Cancer spreads to later stages before detection | Very high recall targets (often above 0.95) |
| Fraud detection | A fraudulent transaction is approved | Financial loss for the institution or customer | High recall needed, balanced with operational costs |
| Spam filtering | A spam or phishing email reaches the inbox | Security risk, user annoyance | Moderate recall; false positives (legitimate email in spam) are also costly |
| Autonomous vehicles | A pedestrian or obstacle is not detected | Collision, injury, or death | Near-zero false negative rate required |
| Cybersecurity / intrusion detection | A network attack goes undetected | Data breach, system compromise, financial and reputational damage | Very high recall is necessary |
| Manufacturing quality control | A defective product passes inspection | Defective products reach customers, potential recalls | High recall needed to maintain quality standards |
| Criminal justice / recidivism | A high-risk individual is classified as low risk | Potential re-offense after release | Ethically complex; false negatives and false positives both carry serious consequences |
In healthcare, false negatives can be life-threatening. If a diagnostic model classifies a patient with cancer as cancer-free, the patient may not receive timely treatment, leading to disease progression. Research on AI-assisted chest X-ray reading has demonstrated that when a computer-aided detection (CAD) system produces a false negative, only 21% of radiologists in the CAD-assisted condition correctly identified the cancer, compared to 46% in a condition without CAD assistance. This finding illustrates a dangerous pattern: clinicians may over-rely on AI predictions, and a false negative from the AI can cause a physician to overlook findings they would have otherwise caught.
Furthermore, when AI systems are trained on datasets that underrepresent certain demographic groups, the false negative rates for those groups can be disproportionately high, worsening existing health disparities.
In financial fraud detection, each false negative represents a fraudulent transaction that the system allowed to proceed. The cost of a single missed fraud can range from hundreds to millions of dollars. Institutions that consistently fail to detect fraud may also face regulatory penalties. Because the base rate of fraud is very low (often less than 0.1% of all transactions), class imbalance is a primary driver of false negatives in this domain.
In self-driving car systems, the object detection models must identify pedestrians, cyclists, vehicles, and obstacles with extremely low false negative rates. A missed detection can lead to a collision. In one well-known incident in 2018, an autonomous vehicle's perception system repeatedly misclassified a pedestrian, cycling through different object categories without ever correctly identifying her, contributing to a fatal accident. This case highlights that false negatives in safety-critical perception systems can have irreversible consequences.
In intrusion detection systems (IDS), a false negative allows an attacker to penetrate a network without triggering any alerts. Research on ransomware detection has shown that the cost of a false negative (an undetected ransomware infection) is typically far greater and less predictable than the cost of investigating a false positive (a benign file flagged as suspicious). For this reason, IDS systems generally prioritize high recall even at the expense of generating more false alarms for security analysts to investigate.
In email filtering, a false negative means a spam or phishing email reaches the user's inbox. While this may seem less severe than medical or safety applications, phishing emails that evade detection can lead to credential theft, financial fraud, and malware infections. The challenge is balancing recall (catching all spam) with precision (not sending legitimate emails to the spam folder).
False negatives and false positives represent opposite types of classification errors. The relative importance of each depends on the problem context.
| Aspect | False negative (FN) | False positive (FP) |
|---|---|---|
| Definition | Positive instance predicted as negative | Negative instance predicted as positive |
| Statistical name | Type II error | Type I error |
| Symbol for error probability | Beta | Alpha |
| Also known as | Miss, missed detection | False alarm |
| Effect on recall | Reduces recall | No direct effect |
| Effect on precision | No direct effect | Reduces precision |
| Medical example | Sick patient told they are healthy | Healthy patient told they are sick |
| Security example | Threat not detected | Harmless activity flagged as threat |
| Trade-off relationship | Reducing FN typically increases FP | Reducing FP typically increases FN |
The inverse relationship between Type I and Type II errors is a fundamental constraint in classification. Reducing one type of error generally increases the other, unless the model itself is improved (for example, through better features, more data, or a more appropriate algorithm).
Consider a disease screening model tested on 1,000 patients. Of these, 200 actually have the disease (positive class) and 800 do not (negative class). The model produces the following confusion matrix:
| Predicted positive | Predicted negative | Total | |
|---|---|---|---|
| Actually positive | 160 (TP) | 40 (FN) | 200 |
| Actually negative | 60 (FP) | 740 (TN) | 800 |
| Total | 220 | 780 | 1,000 |
From this confusion matrix, the key metrics are:
Note that while the accuracy appears high at 90%, the model missed 40 out of 200 sick patients (a false negative rate of 20%). In a medical context, this means 40 patients with the disease would be sent home without treatment. This example illustrates why accuracy alone is an insufficient metric when false negatives carry significant costs.
If the classification threshold is lowered to reduce false negatives from 40 to 10, the revised confusion matrix might look like:
| Predicted positive | Predicted negative | Total | |
|---|---|---|---|
| Actually positive | 190 (TP) | 10 (FN) | 200 |
| Actually negative | 150 (FP) | 650 (TN) | 800 |
| Total | 340 | 660 | 1,000 |
Recall improved from 0.80 to 0.95 (only 10 patients are now missed instead of 40), but precision dropped from 0.727 to 0.559, and accuracy fell from 0.90 to 0.84. In a medical screening context, this trade-off is generally considered worthwhile because the 30 additional false positives simply receive further testing, while the 30 additional true positives receive needed treatment.
In multiclass problems with more than two classes, false negatives are computed on a per-class basis. For each class, instances belonging to that class but predicted as any other class count as false negatives for that class.
For example, in a three-class classification problem (classes A, B, and C), if an instance of class A is predicted as class B, it is a false negative for class A and a false positive for class B. Metrics like recall are then computed for each class individually, and can be aggregated using:
The ROC curve plots the true positive rate (1 minus the false negative rate) against the false positive rate at all possible classification thresholds. A higher AUC (area under the ROC curve) generally indicates a model that achieves a better trade-off between detecting positives and avoiding false alarms. However, the ROC curve does not directly reveal the absolute number of false negatives at any given threshold, so it should be supplemented with a precision-recall analysis in cases where false negatives are the primary concern.
Using cross-validation (such as k-fold cross-validation) helps produce more reliable estimates of a model's false negative rate by evaluating performance across multiple splits of the data. This is especially important when the positive class is rare, as a single train-test split may not provide a stable estimate of recall or the false negative rate.
To assess how many false negatives a model will produce in practice, it is important to evaluate on validation data and test data that closely reflect the real-world distribution of positive and negative cases. If the test set is not representative, the observed false negative rate may be misleadingly low or high.
The formal study of false negatives in hypothesis testing dates to the work of Jerzy Neyman and Egon Pearson in their landmark 1933 paper, which introduced the concepts of Type I and Type II errors, power functions, and the theory of optimal testing. Their framework established the idea that any decision procedure involves two kinds of mistakes, and that minimizing one type of error while controlling the other requires explicit trade-offs.
In the machine learning era, the concept carried over directly. The development of ROC analysis during World War II for radar signal detection extended the statistical framework to practical detection problems, where missing a real signal (a false negative) could mean failing to detect an incoming aircraft. This military application established the pattern that continues today: in high-stakes detection systems, the cost of a false negative typically exceeds the cost of a false positive.
The rapid growth of machine learning applications in healthcare, autonomous driving, and security during the 2010s and 2020s brought renewed attention to false negatives, as these fields demanded models with very low miss rates and highlighted the inadequacy of accuracy as a sole evaluation metric.