False Negative (FN)

A false negative (FN) is an outcome in classification where a machine learning model incorrectly predicts the negative class for an instance that actually belongs to the positive class. In statistical hypothesis testing, a false negative corresponds to a Type II error, meaning the test fails to reject a null hypothesis that is in fact false. False negatives are one of the four possible outcomes in a confusion matrix, alongside true positives (TP), true negatives (TN), and false positives (FP).

The consequences of false negatives vary widely depending on the application domain. In medical screening, a false negative means a sick patient is told they are healthy. In fraud detection, it means a fraudulent transaction goes undetected. In security systems, it means a genuine threat is missed. Because of these high-stakes implications, understanding, measuring, and minimizing false negatives is a core concern in applied machine learning and statistical testing.

Explain like I'm 5 (ELI5)

Imagine you are playing a game where you have to find all the red balls in a big box full of red and blue balls. Every time you pick up a ball, you have to say whether it is red or blue.

A false negative happens when you look at a red ball and say, "That one is blue!" You missed it. The red ball was really there, but you said it was not.

Now imagine a doctor is checking people to see if they are sick. If a sick person comes in and the doctor says, "You are healthy, you can go home," that is a false negative. The person was actually sick, but the test said they were fine. That is why false negatives can be a big problem: they mean you miss something important.

Position in the confusion matrix

A confusion matrix is a table that summarizes the performance of a classification model by comparing predicted labels against actual labels. For binary classification, the matrix has four cells.

	Predicted positive	Predicted negative
Actual positive	True positive (TP)	False negative (FN)
Actual negative	False positive (FP)	True negative (TN)

A false negative sits in the cell where the actual class is positive but the model's prediction is negative. In other words, the model "missed" a positive instance. The total number of actual positives in a dataset is the sum of true positives and false negatives:

P = TP + FN

This relationship is fundamental to several evaluation metrics discussed below.

Statistical origins: Type II error

The concept of a false negative has its roots in the Neyman-Pearson framework of statistical hypothesis testing, developed by Jerzy Neyman and Egon Pearson in the early 1930s. In this framework, two types of errors can occur when testing a hypothesis:

Error type	Statistical name	Description
Type I error	False positive	Rejecting a true null hypothesis
Type II error	False negative	Failing to reject a false null hypothesis

The probability of committing a Type II error is denoted by the Greek letter beta. Neyman and Pearson proposed that beta should not exceed 0.20 as an upper bound, meaning the test should detect true effects at least 80% of the time. The complement of beta (1 minus beta) is called the statistical power of a test, representing the probability of correctly detecting a true effect.

Ronald Fisher's earlier approach to significance testing did not include an alternative hypothesis, so there was no formal concept of a Type II error in his framework. The distinction between the two approaches remains an important topic in the philosophy of statistics.

Metrics involving false negatives

Several standard evaluation metrics directly incorporate false negatives. Understanding these metrics helps practitioners diagnose whether their models are producing too many missed detections.

Recall (sensitivity, true positive rate)

Recall measures the proportion of actual positive instances that the model correctly identifies. It is also called sensitivity or true positive rate.

Recall = TP / (TP + FN)

A recall of 1.0 means the model produced zero false negatives; every positive instance was correctly detected. A recall of 0.0 means the model missed every positive instance. In medical contexts, recall is often the most important metric because missing a disease diagnosis (a false negative) can be far more harmful than a false alarm.

False negative rate (miss rate)

The false negative rate (FNR), also called the miss rate, is the complement of recall:

FNR = FN / (TP + FN) = 1 - Recall

A false negative rate of 0.13 means that for every 100 actual positive cases, 13 are incorrectly classified as negative. This metric is especially informative in domains like medical imaging, where researchers have found that AI systems for chest X-ray analysis can produce false negative rates around 0.13, meaning roughly 13 out of 100 patients with a condition are missed.

F1 score

The F1 score is the harmonic mean of precision and recall, balancing the trade-off between false positives and false negatives:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

When false negatives are more costly than false positives, practitioners may use the F-beta score with beta greater than 1 (for example, F2), which weights recall more heavily. Conversely, when false positives are more costly, an F-beta score with beta less than 1 (for example, F0.5) places greater emphasis on precision.

Specificity and its relationship to false negatives

While specificity (true negative rate) measures performance on the negative class, it has an inverse relationship with false negatives through the precision-recall trade-off. Increasing a model's sensitivity (reducing false negatives) typically comes at the cost of reduced specificity (more false positives), and vice versa.

The precision-recall trade-off

In most classifiers, particularly probabilistic ones like logistic regression, predictions are based on a confidence score. A classification threshold converts these continuous scores into discrete class labels. Adjusting this threshold directly affects the balance between false negatives and false positives.

Threshold adjustment	Effect on false negatives	Effect on false positives	Recall	Precision
Lower the threshold	Decreases (fewer missed positives)	Increases (more false alarms)	Increases	Decreases
Raise the threshold	Increases (more missed positives)	Decreases (fewer false alarms)	Decreases	Increases

This trade-off is visualized in two standard plots:

Precision-recall curve: Plots precision on the y-axis against recall on the x-axis across all possible thresholds. A model with high area under the precision-recall curve maintains high precision even at high recall levels.
ROC curve: Plots the true positive rate (recall) against the false positive rate. While the ROC curve does not explicitly show the false negative rate, it is implicitly represented because the true positive rate equals 1 minus the false negative rate. Moving left along the ROC curve (lower false positive rate) generally corresponds to a higher false negative rate.

The optimal threshold depends entirely on the application. In cancer screening, a low threshold is preferred because the cost of missing a malignant tumor (false negative) far outweighs the cost of an unnecessary follow-up test (false positive). In email spam filtering, a slightly higher threshold may be acceptable because incorrectly sending a legitimate email to spam (false positive) can be more disruptive than letting occasional spam through (false negative).

Causes of false negatives

False negatives arise from a variety of sources. Understanding the root causes helps guide the choice of mitigation strategy.

Insufficient or unrepresentative training data

When the training data does not contain enough examples of the positive class, or when the examples present do not cover the full range of variation in the real world, the model may fail to recognize positive instances at inference time. For example, a skin cancer detector trained primarily on images of lighter skin tones may produce more false negatives on patients with darker skin.

Class imbalance

Class imbalance occurs when the negative class vastly outnumbers the positive class in the training dataset. In such cases, a classifier can achieve high accuracy simply by predicting the majority (negative) class for nearly every instance. This strategy naturally produces many false negatives for the minority (positive) class. Credit card fraud detection is a classic example: legitimate transactions may outnumber fraudulent ones by a ratio of 1000 to 1 or more.

Model underfitting

A model that is too simple to capture the underlying patterns in the data will underfit. For instance, a linear classifier applied to data with complex, non-linear decision boundaries may fail to identify many positive instances, resulting in elevated false negative rates.

Model overfitting

Overfitting occurs when a model memorizes the training data instead of learning generalizable patterns. While overfitting often manifests as poor performance on unseen data in general, it can specifically increase false negatives if the model has memorized a narrow definition of what constitutes a positive example.

Feature quality and noise

Poor feature engineering, noisy measurements, or missing feature values can obscure the signal that distinguishes positive from negative instances. In medical imaging, for example, low-resolution scans or artifacts in the image may cause a model to miss a lesion.

Threshold selection

As discussed above, a threshold that is set too high requires the model to be very confident before predicting the positive class. This directly increases the number of false negatives because borderline positive cases are classified as negative.

Label noise

If the ground truth labels in the training data contain errors (positive instances mislabeled as negative), the model learns to replicate these mistakes, leading to false negatives on correctly labeled test data.

Strategies to reduce false negatives

Reducing false negatives requires a combination of data-level, algorithm-level, and post-processing techniques. The appropriate strategy depends on the specific problem and the acceptable trade-off with false positives.

Threshold adjustment

The simplest approach is lowering the classification threshold. This allows the model to predict the positive class at lower confidence levels, catching more true positives at the cost of additional false positives. This is often the first technique to try because it requires no retraining.

Data resampling

Resampling the training data can help the model learn a better representation of the positive class:

Technique	Description	Pros	Cons
Oversampling	Duplicate or synthetically generate minority class examples	Simple to implement; more positive examples for learning	Risk of overfitting to duplicated examples
SMOTE	Generate synthetic minority examples by interpolating between existing ones	Produces diverse synthetic examples; reduces overfitting compared to naive oversampling	Can create noisy samples in overlapping regions
Undersampling	Remove majority class examples to balance the dataset	Reduces training time; simplifies the learning problem	Discards potentially useful information
Hybrid methods	Combine oversampling of the minority class with undersampling of the majority class	Balances trade-offs of both approaches	Requires more careful tuning

Cost-sensitive learning

Cost-sensitive learning assigns different misclassification costs to different types of errors. By setting the cost of a false negative higher than the cost of a false positive, the loss function penalizes missed positives more heavily during training. Many popular algorithms support class weights or cost matrices, including support vector machines, logistic regression, decision trees, and neural networks.

For example, in scikit-learn, setting class_weight='balanced' or providing a custom weight dictionary automatically adjusts the loss function to account for class imbalance.

Ensemble methods

Ensemble learning techniques combine the predictions of multiple models to improve overall performance:

Bagging: Trains multiple models on different bootstrap samples of the data. Random forests are a common bagging approach that can reduce variance and improve detection of the minority class.
Boosting: Iteratively trains models that focus on the instances misclassified by previous models. Gradient boosting methods like XGBoost and LightGBM can be configured with scale_pos_weight parameters to emphasize the positive class.
Stacking: Combines different types of models through a meta-learner, potentially capturing complementary patterns that reduce false negatives.

Data augmentation

Data augmentation increases the diversity of training examples through transformations. In computer vision, this includes rotations, flips, color jittering, and cropping. In natural language processing, it can involve synonym replacement, back-translation, or paraphrasing. More diverse training data helps the model generalize to a wider range of positive instances.

Improved feature engineering

Better features can make the distinction between positive and negative classes clearer. This may involve domain-specific feature extraction, dimensionality reduction, or using deep learning models that automatically learn relevant feature representations through transfer learning.

Regularization and model selection

Choosing a model with appropriate complexity for the problem is essential. Regularization techniques such as L1/L2 penalties and dropout can prevent overfitting and improve generalization. When the underlying decision boundary is complex, more expressive models like convolutional neural networks or ensemble methods may be needed to avoid underfitting.

Post-processing calibration

Model calibration techniques such as Platt scaling or isotonic regression can improve the reliability of predicted probabilities, making threshold selection more principled and reducing the likelihood of false negatives.

False negatives across application domains

The impact and acceptable rate of false negatives varies significantly by domain. The following table summarizes how false negatives affect different fields.

Domain	What a false negative means	Typical consequence	Priority
Medical diagnosis	A sick patient is classified as healthy	Delayed treatment, disease progression, potential death	Minimizing false negatives is the top priority
Cancer screening	A malignant tumor is missed	Cancer spreads to later stages before detection	Very high recall targets (often above 0.95)
Fraud detection	A fraudulent transaction is approved	Financial loss for the institution or customer	High recall needed, balanced with operational costs
Spam filtering	A spam or phishing email reaches the inbox	Security risk, user annoyance	Moderate recall; false positives (legitimate email in spam) are also costly
Autonomous vehicles	A pedestrian or obstacle is not detected	Collision, injury, or death	Near-zero false negative rate required
Cybersecurity / intrusion detection	A network attack goes undetected	Data breach, system compromise, financial and reputational damage	Very high recall is necessary
Manufacturing quality control	A defective product passes inspection	Defective products reach customers, potential recalls	High recall needed to maintain quality standards
Criminal justice / recidivism	A high-risk individual is classified as low risk	Potential re-offense after release	Ethically complex; false negatives and false positives both carry serious consequences

Medical diagnosis

In healthcare, false negatives can be life-threatening. If a diagnostic model classifies a patient with cancer as cancer-free, the patient may not receive timely treatment, leading to disease progression. Research on AI-assisted chest X-ray reading has demonstrated that when a computer-aided detection (CAD) system produces a false negative, only 21% of radiologists in the CAD-assisted condition correctly identified the cancer, compared to 46% in a condition without CAD assistance. This finding illustrates a dangerous pattern: clinicians may over-rely on AI predictions, and a false negative from the AI can cause a physician to overlook findings they would have otherwise caught.

Furthermore, when AI systems are trained on datasets that underrepresent certain demographic groups, the false negative rates for those groups can be disproportionately high, worsening existing health disparities.

Fraud detection

In financial fraud detection, each false negative represents a fraudulent transaction that the system allowed to proceed. The cost of a single missed fraud can range from hundreds to millions of dollars. Institutions that consistently fail to detect fraud may also face regulatory penalties. Because the base rate of fraud is very low (often less than 0.1% of all transactions), class imbalance is a primary driver of false negatives in this domain.

Autonomous vehicles

In self-driving car systems, the object detection models must identify pedestrians, cyclists, vehicles, and obstacles with extremely low false negative rates. A missed detection can lead to a collision. In one well-known incident in 2018, an autonomous vehicle's perception system repeatedly misclassified a pedestrian, cycling through different object categories without ever correctly identifying her, contributing to a fatal accident. This case highlights that false negatives in safety-critical perception systems can have irreversible consequences.

Cybersecurity

In intrusion detection systems (IDS), a false negative allows an attacker to penetrate a network without triggering any alerts. Research on ransomware detection has shown that the cost of a false negative (an undetected ransomware infection) is typically far greater and less predictable than the cost of investigating a false positive (a benign file flagged as suspicious). For this reason, IDS systems generally prioritize high recall even at the expense of generating more false alarms for security analysts to investigate.

Spam and phishing detection

In email filtering, a false negative means a spam or phishing email reaches the user's inbox. While this may seem less severe than medical or safety applications, phishing emails that evade detection can lead to credential theft, financial fraud, and malware infections. The challenge is balancing recall (catching all spam) with precision (not sending legitimate emails to the spam folder).

False negatives vs. false positives

False negatives and false positives represent opposite types of classification errors. The relative importance of each depends on the problem context.

Aspect	False negative (FN)	False positive (FP)
Definition	Positive instance predicted as negative	Negative instance predicted as positive
Statistical name	Type II error	Type I error
Symbol for error probability	Beta	Alpha
Also known as	Miss, missed detection	False alarm
Effect on recall	Reduces recall	No direct effect
Effect on precision	No direct effect	Reduces precision
Medical example	Sick patient told they are healthy	Healthy patient told they are sick
Security example	Threat not detected	Harmless activity flagged as threat
Trade-off relationship	Reducing FN typically increases FP	Reducing FP typically increases FN

The inverse relationship between Type I and Type II errors is a fundamental constraint in classification. Reducing one type of error generally increases the other, unless the model itself is improved (for example, through better features, more data, or a more appropriate algorithm).

Worked example

Consider a disease screening model tested on 1,000 patients. Of these, 200 actually have the disease (positive class) and 800 do not (negative class). The model produces the following confusion matrix:

	Predicted positive	Predicted negative	Total
Actually positive	160 (TP)	40 (FN)	200
Actually negative	60 (FP)	740 (TN)	800
Total	220	780	1,000

From this confusion matrix, the key metrics are:

Recall = TP / (TP + FN) = 160 / (160 + 40) = 160 / 200 = 0.80
False negative rate = FN / (TP + FN) = 40 / 200 = 0.20
Precision = TP / (TP + FP) = 160 / (160 + 60) = 160 / 220 = 0.727
F1 score = 2 * (0.727 * 0.80) / (0.727 + 0.80) = 0.762
Accuracy = (TP + TN) / Total = (160 + 740) / 1000 = 0.90

Note that while the accuracy appears high at 90%, the model missed 40 out of 200 sick patients (a false negative rate of 20%). In a medical context, this means 40 patients with the disease would be sent home without treatment. This example illustrates why accuracy alone is an insufficient metric when false negatives carry significant costs.

If the classification threshold is lowered to reduce false negatives from 40 to 10, the revised confusion matrix might look like:

	Predicted positive	Predicted negative	Total
Actually positive	190 (TP)	10 (FN)	200
Actually negative	150 (FP)	650 (TN)	800
Total	340	660	1,000

Recall = 190 / 200 = 0.95
False negative rate = 10 / 200 = 0.05
Precision = 190 / 340 = 0.559
F1 score = 2 * (0.559 * 0.95) / (0.559 + 0.95) = 0.704
Accuracy = (190 + 650) / 1000 = 0.84

Recall improved from 0.80 to 0.95 (only 10 patients are now missed instead of 40), but precision dropped from 0.727 to 0.559, and accuracy fell from 0.90 to 0.84. In a medical screening context, this trade-off is generally considered worthwhile because the 30 additional false positives simply receive further testing, while the 30 additional true positives receive needed treatment.

Multiclass classification

In multiclass problems with more than two classes, false negatives are computed on a per-class basis. For each class, instances belonging to that class but predicted as any other class count as false negatives for that class.

For example, in a three-class classification problem (classes A, B, and C), if an instance of class A is predicted as class B, it is a false negative for class A and a false positive for class B. Metrics like recall are then computed for each class individually, and can be aggregated using:

Macro-averaging: Calculate recall for each class independently, then take the unweighted mean. This treats all classes equally regardless of their size.
Micro-averaging: Aggregate all true positives and false negatives across all classes before computing recall. This gives more weight to larger classes.
Weighted averaging: Calculate recall per class and take the weighted mean based on class size.

Relationship to other evaluation tools

ROC curve and AUC

The ROC curve plots the true positive rate (1 minus the false negative rate) against the false positive rate at all possible classification thresholds. A higher AUC (area under the ROC curve) generally indicates a model that achieves a better trade-off between detecting positives and avoiding false alarms. However, the ROC curve does not directly reveal the absolute number of false negatives at any given threshold, so it should be supplemented with a precision-recall analysis in cases where false negatives are the primary concern.

Cross-validation

Using cross-validation (such as k-fold cross-validation) helps produce more reliable estimates of a model's false negative rate by evaluating performance across multiple splits of the data. This is especially important when the positive class is rare, as a single train-test split may not provide a stable estimate of recall or the false negative rate.

Validation and test sets

To assess how many false negatives a model will produce in practice, it is important to evaluate on validation data and test data that closely reflect the real-world distribution of positive and negative cases. If the test set is not representative, the observed false negative rate may be misleadingly low or high.

Historical context

The formal study of false negatives in hypothesis testing dates to the work of Jerzy Neyman and Egon Pearson in their landmark 1933 paper, which introduced the concepts of Type I and Type II errors, power functions, and the theory of optimal testing. Their framework established the idea that any decision procedure involves two kinds of mistakes, and that minimizing one type of error while controlling the other requires explicit trade-offs.

In the machine learning era, the concept carried over directly. The development of ROC analysis during World War II for radar signal detection extended the statistical framework to practical detection problems, where missing a real signal (a false negative) could mean failing to detect an incoming aircraft. This military application established the pattern that continues today: in high-stakes detection systems, the cost of a false negative typically exceeds the cost of a false positive.

The rapid growth of machine learning applications in healthcare, autonomous driving, and security during the 2010s and 2020s brought renewed attention to false negatives, as these fields demanded models with very low miss rates and highlighted the inadequacy of accuracy as a sole evaluation metric.

References

Neyman, J., & Pearson, E. S. (1933). "On the Problem of the Most Efficient Tests of Statistical Hypotheses." *Philosophical Transactions of the Royal Society A*, 231(694-706), 289-337.
Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
Phua, C., Lee, V., Smith, K., & Gayler, R. (2010). "A Comprehensive Survey of Data Mining-based Fraud Detection Research." *Computing Research Repository*.
Cabitza, F., Rasoini, R., & Gensini, G. F. (2017). "Unintended Consequences of Machine Learning in Medicine." *JAMA*, 318(6), 517-518. https://pmc.ncbi.nlm.nih.gov/articles/PMC5701440/
Vickers, A. J., Van Calster, B., & Steyerberg, E. W. (2016). "Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests." *BMJ*, 352, i6.
Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., & Seliya, N. (2018). "A survey on addressing high-class imbalance in big data." *Journal of Big Data*, 5(1), 42.
Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C., & Faisal, A. A. (2018). "The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care." *Nature Medicine*, 24(11), 1716-1720.
Google Developers. "Classification: Accuracy, Recall, Precision, and Related Metrics." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
Gaube, S., Suresh, H., Raber, M., Merz, E., Schemmer, M., et al. (2023). "Can incorrect artificial intelligence (AI) results impact radiologists?" *European Radiology*, 33(12), 8263-8269. https://pmc.ncbi.nlm.nih.gov/articles/PMC10235827/
Flach, P. A. (2012). *Machine Learning: The Art and Science of Algorithms that Make Sense of Data.* Cambridge University Press.
Provost, F., & Fawcett, T. (2001). "Robust Classification for Imprecise Environments." *Machine Learning*, 42(3), 203-231.
Sokolova, M., & Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." *Information Processing & Management*, 45(4), 427-437.

Explain like I'm 5 (ELI5)

Position in the confusion matrix

Statistical origins: Type II error

Metrics involving false negatives

Recall (sensitivity, true positive rate)

False negative rate (miss rate)

F1 score

Specificity and its relationship to false negatives

The precision-recall trade-off

Causes of false negatives

Insufficient or unrepresentative training data

Class imbalance

Model underfitting

Model overfitting

Feature quality and noise

Threshold selection

Label noise

Strategies to reduce false negatives

Threshold adjustment

Data resampling

Cost-sensitive learning

Ensemble methods

Data augmentation

Improved feature engineering

Regularization and model selection

Post-processing calibration

False negatives across application domains

Medical diagnosis

Fraud detection

Autonomous vehicles

Cybersecurity

Spam and phishing detection

False negatives vs. false positives

Worked example

Multiclass classification

Relationship to other evaluation tools

ROC curve and AUC

Cross-validation

Validation and test sets

Historical context

See also

References

Improve this article

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Confusion Matrix

Decision Threshold

False Positive (FP)

False Positive Rate (FPR)

Explain like I'm 5 (ELI5)

Position in the confusion matrix

Statistical origins: Type II error

Metrics involving false negatives

Recall (sensitivity, true positive rate)

False negative rate (miss rate)

F1 score

Specificity and its relationship to false negatives

The precision-recall trade-off

Causes of false negatives

Insufficient or unrepresentative training data

Class imbalance

Model underfitting

Model overfitting

Feature quality and noise

Threshold selection

Label noise

Strategies to reduce false negatives

Threshold adjustment

Data resampling

Cost-sensitive learning

Ensemble methods

Data augmentation

Improved feature engineering

Regularization and model selection

Post-processing calibration

False negatives across application domains

Medical diagnosis

Fraud detection

Autonomous vehicles