Recall (metric)

Introduction

Recall is a performance metric commonly used in machine learning and information retrieval to evaluate the effectiveness of classification and retrieval models. It is particularly useful when the cost of false negatives (failing to identify positive instances) is high. This article provides an in-depth look at the concept of recall, its mathematical formulation, its relation to other performance metrics such as precision and F1 score, and the practical scenarios where recall takes priority.

Recall answers the question: "Of all the instances that are actually positive, how many did the model correctly identify?" A model with high recall catches most of the positive cases, even if it also flags some negatives along the way.

Definition and formula

Recall, also known as sensitivity, true positive rate (TPR), hit rate, or probability of detection, is the proportion of true positive instances (correctly identified positive instances) among all the actual positive instances in the dataset. Mathematically, recall is defined as:

Recall = TP / (TP + FN)

Where:

TP (True Positives) denotes the number of instances that are actually positive and were correctly predicted as positive.
FN (False Negatives) denotes the number of instances that are actually positive but were incorrectly predicted as negative. False negatives are also known as Type II errors, or "misses."

Recall is expressed as a value between 0 and 1 (or 0% to 100%), where a value of 1 indicates perfect recall (every positive instance was found), and a value of 0 indicates that no positive instances were identified.

Because recall is defined as TP / (TP + FN), it is exactly the complement of the false negative rate (FNR). In other words:

FNR = FN / (TP + FN) = 1 - Recall

This means that maximizing recall is equivalent to minimizing the false negative rate, which in statistical hypothesis testing corresponds to minimizing the probability of a Type II error.

Worked example

Consider a medical screening test for a disease. Out of 1,000 patients, 50 actually have the disease and 950 do not. After running the test:

	Predicted Positive (Disease)	Predicted Negative (Healthy)
Actually Positive	45 (TP)	5 (FN)
Actually Negative	30 (FP)	920 (TN)

The recall of this screening test is:

Recall = TP / (TP + FN) = 45 / (45 + 5) = 45 / 50 = 0.90

This means the test correctly identifies 90% of patients who actually have the disease. The remaining 10% (5 patients) have the disease but were missed by the test (false negatives).

Understanding true positives and false negatives

To fully understand recall, it helps to place TP and FN within the full confusion matrix:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

Recall focuses on the top row of this matrix: of everything that is actually positive (TP + FN), what fraction did the model catch (TP)? Note that recall does not consider false positives at all. A model can flag many negative instances as positive (low precision) and still have high recall, as long as it catches all the true positives.

Why false negatives matter

False negatives are costly in many real-world applications:

Application	What a False Negative Means	Consequence
Cancer screening	Cancerous tumor missed by the test	Delayed treatment; potentially fatal outcome
Airport security	Weapon not detected by scanner	Threat passes through undetected
Fraud detection	Fraudulent transaction not flagged	Financial loss for the institution or customer
Autonomous driving	Pedestrian not detected by the perception system	Potential collision
Product recall identification	Defective product not identified	Harm to consumers; liability
Cybersecurity intrusion detection	Active attack not detected	Data breach; system compromise

In every one of these cases, missing a positive instance is far more dangerous than generating a false alarm. This is why recall is the primary metric of concern in safety-critical applications.

Connection to sensitivity and other synonyms

Recall goes by many names across different fields. All of the following terms refer to the same formula, TP / (TP + FN):

Term	Field Where Commonly Used
Recall	Machine learning, information retrieval
Sensitivity	Medicine, biostatistics, clinical testing
True Positive Rate (TPR)	Statistics, ROC analysis
Hit rate	Signal detection theory, psychology
Detection rate	Radar engineering, security screening
Probability of detection	Engineering, telecommunications
Power (1 - beta)	Statistical hypothesis testing

In medical diagnostics, sensitivity is perhaps the most important metric for a screening test. A highly sensitive test catches nearly all patients with the condition, ensuring that few cases are missed. Follow-up confirmatory tests (which may prioritize specificity or precision) can then be used to eliminate false positives. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified.

In statistical hypothesis testing, the concept of recall maps onto the power of a test: the probability of correctly rejecting a false null hypothesis. A test with low power has a high rate of Type II errors (false negatives), meaning it frequently fails to detect a real effect.

Effect of classification threshold on recall

Most classification models do not output a hard binary label directly. Instead, they produce a continuous score or probability for each instance. A classification threshold determines the cutoff above which an instance is classified as positive.

Adjusting this threshold directly controls the tradeoff between recall and precision:

Threshold Change	Effect on Predictions	Recall	Precision	False Negatives	False Positives
Lower the threshold	More instances classified as positive	Increases	Decreases	Fewer	More
Raise the threshold	Fewer instances classified as positive	Decreases	Increases	More	Fewer

At the extreme, if the threshold is set to 0 (classify everything as positive), recall equals 1.0 because no positive instance is missed. However, precision drops because many negative instances are also classified as positive. Conversely, if the threshold is set to 1.0 (classify nothing as positive), recall drops to 0 because no instance is classified as positive at all.

In practice, the optimal threshold depends on the relative costs of false negatives and false positives. For safety-critical applications like disease screening, the threshold is often set low to maximize recall. For applications like spam filtering, the threshold may be set higher to avoid annoying false positives.

The recall vs. precision tradeoff

Recall and precision exist in tension. In most practical systems, improving recall comes at the cost of precision, and vice versa. This inverse relationship is known as the precision-recall tradeoff.

The precision-recall curve visualizes this tradeoff by plotting precision (y-axis) against recall (x-axis) at various threshold settings. A model that can achieve both high precision and high recall simultaneously is superior. The area under the precision-recall curve (AUPRC) provides a single-number summary of a model's performance across all thresholds.

On highly imbalanced datasets, the precision-recall curve is often more informative than the ROC curve. This is because the ROC curve's false positive rate denominator (FP + TN) is dominated by the large negative class, which can make even a large number of false positives appear as a small FPR. The precision-recall curve, by contrast, directly penalizes false positives through the precision metric.

When to favor recall over precision

Recall should be prioritized when the cost of a false negative is much higher than the cost of a false positive:

Medical screening: Missing a disease (false negative) can be fatal. A false alarm leads to a follow-up test, which is inconvenient but not dangerous.
Safety systems: Failing to detect a hazard (earthquake alert, fire detection, intrusion alarm) can result in loss of life. False alarms cause temporary disruption but keep people safe.
Search and rescue: It is better to investigate a false lead than to miss a trapped survivor.
Legal discovery: Missing a relevant document in litigation can result in sanctions or an adverse verdict. Reviewing an irrelevant document wastes time but has no legal consequence.

Relation to precision and F-scores

Recall is often used alongside precision, another performance metric in machine learning. Precision measures the proportion of true positive instances among all the instances that were predicted as positive by the model. While recall emphasizes the ability of a model to correctly identify positive instances, precision focuses on the model's accuracy in predicting positive instances.

When evaluating a classification model, it is often necessary to consider both recall and precision to get a comprehensive understanding of performance. One way to do this is by calculating the F1 score, which is the harmonic mean of recall and precision:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

The F1 score ranges from 0 to 1, with a value of 1 indicating a perfect balance between recall and precision. The harmonic mean is used instead of the arithmetic mean because it penalizes large differences between the two values. If precision is 1.0 but recall is 0.1, the arithmetic mean would be 0.55 (misleadingly high), but the F1 score is only 0.18, correctly reflecting the severe imbalance.

The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative weight given to recall versus precision:

F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

F-beta Variant	Beta Value	Weighting	Typical Use Case
F0.5 score	0.5	Precision weighted 2x more than recall	Spam filtering, where false positives are costly
F1 score	1.0	Equal weight to precision and recall	General-purpose balanced evaluation
F2 score	2.0	Recall weighted 2x more than precision	Medical screening, safety systems

When beta is greater than 1, the F-beta score gives more weight to recall. When beta is less than 1, it gives more weight to precision. Setting beta = 1 yields the standard F1 score.

Metric	Formula	Focus
Recall	TP / (TP + FN)	Completeness of positive identification
Precision	TP / (TP + FP)	Quality of positive predictions
F1 Score	2 * P * R / (P + R)	Balance of precision and recall
Specificity	TN / (TN + FP)	Correctness of negative predictions
False Negative Rate	FN / (TP + FN) = 1 - Recall	Rate of missed positives
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Overall correctness

ROC curves and the recall connection

Recall (TPR) is one of the two axes of the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (recall) on the y-axis against the False Positive Rate (FPR = FP / (FP + TN)) on the x-axis at various classification thresholds.

As the threshold decreases (more instances classified as positive), both the TPR and FPR increase. The model moves from the bottom-left corner of the ROC space (threshold = 1, nothing classified as positive) toward the top-right corner (threshold = 0, everything classified as positive). A good model reaches high TPR (recall) at a low FPR, curving toward the top-left corner of the plot.

The area under the ROC curve (AUC-ROC) is a threshold-independent metric that summarizes how well a model distinguishes between positive and negative classes. An AUC of 1.0 indicates perfect separation, while an AUC of 0.5 indicates random chance (the diagonal line).

While AUC-ROC is widely used, it can be misleading on highly imbalanced datasets because the FPR denominator (FP + TN) is dominated by the large negative class. In such cases, the precision-recall curve is often preferred, as discussed in the previous section.

Recall in multi-class settings

For multi-class classification problems, recall is computed per class and then aggregated. The three standard aggregation methods mirror those for precision:

Averaging Method	How It Works	When to Use
Macro recall	Compute recall independently for each class, then take the unweighted mean	When all classes are equally important
Micro recall	Sum all TP and FN across all classes, then compute a single recall value	When you care about overall detection rate
Weighted recall	Compute recall per class and take the mean weighted by class support	When class sizes differ and you want proportional representation

Macro recall

For C classes:

Macro Recall = (1/C) * sum of Recall_c for c = 1 to C

Macro recall gives equal weight to every class, which means that performance on rare classes has the same influence as performance on common classes. This is useful when minority classes are important (e.g., rare diseases in a diagnostic system).

Micro recall

Micro Recall = (sum of TP_c for all c) / (sum of TP_c + FN_c for all c)

For standard multi-class problems where each instance belongs to exactly one class, micro recall equals the overall accuracy. This equivalence occurs because, in single-label multi-class classification, every false negative for one class is simultaneously a false positive for another class, making the total TP + FN equal to the total number of instances.

Practical example

Consider a multi-class classifier for three types of manufacturing defects:

Defect Type	TP	FN	Total Actual	Recall
Crack	40	10	50	40/50 = 0.800
Scratch	70	5	75	70/75 = 0.933
Dent	15	10	25	15/25 = 0.600

Macro recall = (0.800 + 0.933 + 0.600) / 3 = 0.778
Micro recall = (40 + 70 + 15) / (50 + 75 + 25) = 125 / 150 = 0.833

Micro recall is higher because the model performs well on the large "Scratch" class. Macro recall reveals that the model struggles with "Dent" defects, a fact that micro recall obscures.

Recall@K in information retrieval

Recall@K is a variant of recall used in information retrieval and ranking systems. It measures the proportion of all relevant items that appear within the top K results:

Recall@K = (Number of relevant items in the top K results) / (Total number of relevant items)

For example, if a database contains 20 relevant documents for a query, and a search engine retrieves 12 of them within the top 50 results, then Recall@50 = 12 / 20 = 0.60.

Recall@K is important for evaluating:

Search engines: What fraction of all relevant web pages appear on the first few pages of results?
Recommendation systems: What fraction of items a user would find interesting are included in the top-K recommendations?
Candidate retrieval stages: In two-stage retrieval pipelines (retrieval followed by re-ranking), the retrieval stage must achieve high Recall@K to ensure that no relevant items are lost before re-ranking.
Vector similarity search: In approximate nearest neighbor systems, Recall@K measures how many of the true nearest neighbors are returned by the approximate search.

Metric	Question It Answers	Denominator
Precision@K	Of the K items returned, how many are relevant?	K (the number of items returned)
Recall@K	Of all relevant items, how many appear in the top K?	Total relevant items in the entire collection

A key property of Recall@K is that it does not consider the ranking order of items within the top K. Whether a relevant item appears at position 1 or position K, it contributes equally to Recall@K. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are used when ranking order matters.

As K increases, Recall@K generally increases (more relevant items are found), while Precision@K may decrease (the incremental results are less likely to be relevant).

Recall in object detection

In object detection, recall measures the proportion of ground-truth objects that the model successfully detects. A detection is considered a true positive only if its predicted bounding box overlaps sufficiently with a ground-truth bounding box, as measured by the Intersection over Union (IoU) threshold.

For example, at an IoU threshold of 0.5, a predicted bounding box must overlap at least 50% with the ground-truth box to count as a correct detection. If the IoU falls below the threshold, the prediction is either a false positive (if no matching ground-truth box exists) or the ground-truth object remains a false negative.

The COCO (Common Objects in Context) benchmark uses a metric called Average Recall (AR), which averages recall across multiple IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. AR is reported at different maximum detection limits per image:

COCO Recall Metric	Description
AR@1	Average recall with at most 1 detection per image
AR@10	Average recall with at most 10 detections per image
AR@100	Average recall with at most 100 detections per image
AR (small)	AR for objects with area less than 32x32 pixels
AR (medium)	AR for objects with area between 32x32 and 96x96 pixels
AR (large)	AR for objects with area greater than 96x96 pixels

AR correlates strongly with localization accuracy for IoU thresholds above 0.5. Models with higher AR tend to produce tighter, more accurate bounding boxes. Unlike mean Average Precision (mAP), which incorporates both precision and recall at multiple confidence thresholds, AR focuses solely on the recall side and does not account for confidence scores.

Implementation in scikit-learn

The scikit-learn library provides straightforward functions for computing recall in Python:

from sklearn.metrics import recall_score, classification_report

# Binary classification example
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Binary recall
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.3f}")  # Output: 0.800

# Multi-class averaging options
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_micro = recall_score(y_true, y_pred, average='micro')
recall_weighted = recall_score(y_true, y_pred, average='weighted')

# Full classification report with per-class recall
print(classification_report(y_true, y_pred))

The classification_report function produces a table showing precision, recall, and F1 score for each class, along with macro, weighted, and (in some configurations) micro averages. For binary classification, the pos_label parameter controls which class is treated as the positive class (default is 1). When average=None, the function returns per-class recall values as an array, which is useful for inspecting recall on individual classes.

Scikit-learn also provides precision_recall_curve for generating the precision-recall curve and PrecisionRecallDisplay for plotting it, both of which are useful for visualizing how recall changes across different threshold values.

Applications

Recall is especially important in situations where false negatives carry a high cost.

Medical diagnosis

In medical screening (e.g., mammography for breast cancer, PCR tests for infectious diseases), high recall is essential. Missing a positive case can lead to delayed treatment, disease progression, or even death. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified. A typical screening workflow uses a high-recall first-stage test followed by a high-precision confirmatory test.

Fraud detection

In financial fraud detection, a missed fraudulent transaction (false negative) results in direct monetary loss. Banks and payment processors typically tune their fraud models for high recall, accepting a moderate number of false alarms (which are resolved through customer verification) to ensure that as many fraudulent transactions as possible are caught.

Security and safety systems

Intrusion detection systems, airport scanners, and autonomous vehicle perception systems all require high recall. In autonomous driving, for example, failing to detect a pedestrian (false negative) can result in a collision, while a false positive (perceiving a pedestrian that is not there) results in unnecessary braking, which is inconvenient but not dangerous.

Information retrieval and legal discovery

In patent search, legal discovery, and systematic reviews (e.g., medical literature reviews for clinical guidelines), recall is often the primary metric. Missing a relevant document can have legal or clinical consequences, while including an irrelevant document simply adds to the review workload.

Explain like I'm 5 (ELI5)

Imagine you have a bag of colored balls. Some are red and some are blue. Your job is to pull out all the red balls.

Recall tells you how good you are at finding the red balls. If there are 10 red balls in the bag and you found 8 of them, your recall is 8 out of 10, or 80%. You missed 2 red balls.

A perfect recall score of 100% means you found every single red ball. It does not matter if you also accidentally grabbed some blue balls along the way. Recall only cares about one thing: did you find all the red ones?

References

Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240.
Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press.
Sokolova, M. and Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." *Information Processing and Management*, 45(4), 427-437.
Van Rijsbergen, C. J. (1979). *Information Retrieval*. 2nd edition. Butterworths.
Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Lin, T.-Y. et al. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*, 740-755.
Hossin, M. and Sulaiman, M. N. (2015). "A Review on Evaluation Metrics for Data Classification Evaluations." *International Journal of Data Mining and Knowledge Management Process*, 5(2), 1-11.
Padilla, R., Netto, S. L., and da Silva, E. A. B. (2020). "A Survey on Performance Metrics for Object-Detection Algorithms." *Proceedings of the International Conference on Systems, Signals and Image Processing (IWSSIP)*, 237-242.

Introduction

Definition and formula

Worked example

Understanding true positives and false negatives

Why false negatives matter

Connection to sensitivity and other synonyms

Effect of classification threshold on recall

The recall vs. precision tradeoff

When to favor recall over precision

Relation to precision and F-scores

Summary of related metrics

ROC curves and the recall connection

Recall in multi-class settings

Macro recall

Micro recall

Practical example

Recall@K in information retrieval

Recall in object detection

Implementation in scikit-learn

Applications

Medical diagnosis

Fraud detection

Security and safety systems

Information retrieval and legal discovery

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Confusion Matrix

Decision Threshold

False Negative Rate

Introduction

Definition and formula

Worked example

Understanding true positives and false negatives

Why false negatives matter

Connection to sensitivity and other synonyms

Effect of classification threshold on recall

The recall vs. precision tradeoff

When to favor recall over precision

Relation to precision and F-scores

Summary of related metrics

ROC curves and the recall connection

Recall in multi-class settings

Macro recall

Micro recall

Practical example

Recall@K in information retrieval

Recall in object detection

Implementation in scikit-learn

Applications

Medical diagnosis

Fraud detection

Security and safety systems

Information retrieval and legal discovery

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Confusion Matrix

Decision Threshold

False Negative Rate