See also: Machine learning terms
Recall is a performance metric commonly used in machine learning and information retrieval to evaluate the effectiveness of classification and retrieval models. It is particularly useful when the cost of false negatives (failing to identify positive instances) is high. This article provides an in-depth look at the concept of recall, its mathematical formulation, its relation to other performance metrics such as precision and F1 score, and the practical scenarios where recall takes priority.
Recall answers the question: "Of all the instances that are actually positive, how many did the model correctly identify?" A model with high recall catches most of the positive cases, even if it also flags some negatives along the way.
Recall, also known as sensitivity, true positive rate (TPR), hit rate, or probability of detection, is the proportion of true positive instances (correctly identified positive instances) among all the actual positive instances in the dataset. Mathematically, recall is defined as:
Recall = TP / (TP + FN)
Where:
Recall is expressed as a value between 0 and 1 (or 0% to 100%), where a value of 1 indicates perfect recall (every positive instance was found), and a value of 0 indicates that no positive instances were identified.
Because recall is defined as TP / (TP + FN), it is exactly the complement of the false negative rate (FNR). In other words:
FNR = FN / (TP + FN) = 1 - Recall
This means that maximizing recall is equivalent to minimizing the false negative rate, which in statistical hypothesis testing corresponds to minimizing the probability of a Type II error.
Consider a medical screening test for a disease. Out of 1,000 patients, 50 actually have the disease and 950 do not. After running the test:
| Predicted Positive (Disease) | Predicted Negative (Healthy) | |
|---|---|---|
| Actually Positive | 45 (TP) | 5 (FN) |
| Actually Negative | 30 (FP) | 920 (TN) |
The recall of this screening test is:
Recall = TP / (TP + FN) = 45 / (45 + 5) = 45 / 50 = 0.90
This means the test correctly identifies 90% of patients who actually have the disease. The remaining 10% (5 patients) have the disease but were missed by the test (false negatives).
To fully understand recall, it helps to place TP and FN within the full confusion matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
Recall focuses on the top row of this matrix: of everything that is actually positive (TP + FN), what fraction did the model catch (TP)? Note that recall does not consider false positives at all. A model can flag many negative instances as positive (low precision) and still have high recall, as long as it catches all the true positives.
False negatives are costly in many real-world applications:
| Application | What a False Negative Means | Consequence |
|---|---|---|
| Cancer screening | Cancerous tumor missed by the test | Delayed treatment; potentially fatal outcome |
| Airport security | Weapon not detected by scanner | Threat passes through undetected |
| Fraud detection | Fraudulent transaction not flagged | Financial loss for the institution or customer |
| Autonomous driving | Pedestrian not detected by the perception system | Potential collision |
| Product recall identification | Defective product not identified | Harm to consumers; liability |
| Cybersecurity intrusion detection | Active attack not detected | Data breach; system compromise |
In every one of these cases, missing a positive instance is far more dangerous than generating a false alarm. This is why recall is the primary metric of concern in safety-critical applications.
Recall goes by many names across different fields. All of the following terms refer to the same formula, TP / (TP + FN):
| Term | Field Where Commonly Used |
|---|---|
| Recall | Machine learning, information retrieval |
| Sensitivity | Medicine, biostatistics, clinical testing |
| True Positive Rate (TPR) | Statistics, ROC analysis |
| Hit rate | Signal detection theory, psychology |
| Detection rate | Radar engineering, security screening |
| Probability of detection | Engineering, telecommunications |
| Power (1 - beta) | Statistical hypothesis testing |
In medical diagnostics, sensitivity is perhaps the most important metric for a screening test. A highly sensitive test catches nearly all patients with the condition, ensuring that few cases are missed. Follow-up confirmatory tests (which may prioritize specificity or precision) can then be used to eliminate false positives. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified.
In statistical hypothesis testing, the concept of recall maps onto the power of a test: the probability of correctly rejecting a false null hypothesis. A test with low power has a high rate of Type II errors (false negatives), meaning it frequently fails to detect a real effect.
Most classification models do not output a hard binary label directly. Instead, they produce a continuous score or probability for each instance. A classification threshold determines the cutoff above which an instance is classified as positive.
Adjusting this threshold directly controls the tradeoff between recall and precision:
| Threshold Change | Effect on Predictions | Recall | Precision | False Negatives | False Positives |
|---|---|---|---|---|---|
| Lower the threshold | More instances classified as positive | Increases | Decreases | Fewer | More |
| Raise the threshold | Fewer instances classified as positive | Decreases | Increases | More | Fewer |
At the extreme, if the threshold is set to 0 (classify everything as positive), recall equals 1.0 because no positive instance is missed. However, precision drops because many negative instances are also classified as positive. Conversely, if the threshold is set to 1.0 (classify nothing as positive), recall drops to 0 because no instance is classified as positive at all.
In practice, the optimal threshold depends on the relative costs of false negatives and false positives. For safety-critical applications like disease screening, the threshold is often set low to maximize recall. For applications like spam filtering, the threshold may be set higher to avoid annoying false positives.
Recall and precision exist in tension. In most practical systems, improving recall comes at the cost of precision, and vice versa. This inverse relationship is known as the precision-recall tradeoff.
The precision-recall curve visualizes this tradeoff by plotting precision (y-axis) against recall (x-axis) at various threshold settings. A model that can achieve both high precision and high recall simultaneously is superior. The area under the precision-recall curve (AUPRC) provides a single-number summary of a model's performance across all thresholds.
On highly imbalanced datasets, the precision-recall curve is often more informative than the ROC curve. This is because the ROC curve's false positive rate denominator (FP + TN) is dominated by the large negative class, which can make even a large number of false positives appear as a small FPR. The precision-recall curve, by contrast, directly penalizes false positives through the precision metric.
Recall should be prioritized when the cost of a false negative is much higher than the cost of a false positive:
Recall is often used alongside precision, another performance metric in machine learning. Precision measures the proportion of true positive instances among all the instances that were predicted as positive by the model. While recall emphasizes the ability of a model to correctly identify positive instances, precision focuses on the model's accuracy in predicting positive instances.
When evaluating a classification model, it is often necessary to consider both recall and precision to get a comprehensive understanding of performance. One way to do this is by calculating the F1 score, which is the harmonic mean of recall and precision:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
The F1 score ranges from 0 to 1, with a value of 1 indicating a perfect balance between recall and precision. The harmonic mean is used instead of the arithmetic mean because it penalizes large differences between the two values. If precision is 1.0 but recall is 0.1, the arithmetic mean would be 0.55 (misleadingly high), but the F1 score is only 0.18, correctly reflecting the severe imbalance.
The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative weight given to recall versus precision:
F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)
| F-beta Variant | Beta Value | Weighting | Typical Use Case |
|---|---|---|---|
| F0.5 score | 0.5 | Precision weighted 2x more than recall | Spam filtering, where false positives are costly |
| F1 score | 1.0 | Equal weight to precision and recall | General-purpose balanced evaluation |
| F2 score | 2.0 | Recall weighted 2x more than precision | Medical screening, safety systems |
When beta is greater than 1, the F-beta score gives more weight to recall. When beta is less than 1, it gives more weight to precision. Setting beta = 1 yields the standard F1 score.
| Metric | Formula | Focus |
|---|---|---|
| Recall | TP / (TP + FN) | Completeness of positive identification |
| Precision | TP / (TP + FP) | Quality of positive predictions |
| F1 Score | 2 * P * R / (P + R) | Balance of precision and recall |
| Specificity | TN / (TN + FP) | Correctness of negative predictions |
| False Negative Rate | FN / (TP + FN) = 1 - Recall | Rate of missed positives |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Overall correctness |
Recall (TPR) is one of the two axes of the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the True Positive Rate (recall) on the y-axis against the False Positive Rate (FPR = FP / (FP + TN)) on the x-axis at various classification thresholds.
As the threshold decreases (more instances classified as positive), both the TPR and FPR increase. The model moves from the bottom-left corner of the ROC space (threshold = 1, nothing classified as positive) toward the top-right corner (threshold = 0, everything classified as positive). A good model reaches high TPR (recall) at a low FPR, curving toward the top-left corner of the plot.
The area under the ROC curve (AUC-ROC) is a threshold-independent metric that summarizes how well a model distinguishes between positive and negative classes. An AUC of 1.0 indicates perfect separation, while an AUC of 0.5 indicates random chance (the diagonal line).
While AUC-ROC is widely used, it can be misleading on highly imbalanced datasets because the FPR denominator (FP + TN) is dominated by the large negative class. In such cases, the precision-recall curve is often preferred, as discussed in the previous section.
For multi-class classification problems, recall is computed per class and then aggregated. The three standard aggregation methods mirror those for precision:
| Averaging Method | How It Works | When to Use |
|---|---|---|
| Macro recall | Compute recall independently for each class, then take the unweighted mean | When all classes are equally important |
| Micro recall | Sum all TP and FN across all classes, then compute a single recall value | When you care about overall detection rate |
| Weighted recall | Compute recall per class and take the mean weighted by class support | When class sizes differ and you want proportional representation |
For C classes:
Macro Recall = (1/C) * sum of Recall_c for c = 1 to C
Macro recall gives equal weight to every class, which means that performance on rare classes has the same influence as performance on common classes. This is useful when minority classes are important (e.g., rare diseases in a diagnostic system).
Micro Recall = (sum of TP_c for all c) / (sum of TP_c + FN_c for all c)
For standard multi-class problems where each instance belongs to exactly one class, micro recall equals the overall accuracy. This equivalence occurs because, in single-label multi-class classification, every false negative for one class is simultaneously a false positive for another class, making the total TP + FN equal to the total number of instances.
Consider a multi-class classifier for three types of manufacturing defects:
| Defect Type | TP | FN | Total Actual | Recall |
|---|---|---|---|---|
| Crack | 40 | 10 | 50 | 40/50 = 0.800 |
| Scratch | 70 | 5 | 75 | 70/75 = 0.933 |
| Dent | 15 | 10 | 25 | 15/25 = 0.600 |
Micro recall is higher because the model performs well on the large "Scratch" class. Macro recall reveals that the model struggles with "Dent" defects, a fact that micro recall obscures.
Recall@K is a variant of recall used in information retrieval and ranking systems. It measures the proportion of all relevant items that appear within the top K results:
Recall@K = (Number of relevant items in the top K results) / (Total number of relevant items)
For example, if a database contains 20 relevant documents for a query, and a search engine retrieves 12 of them within the top 50 results, then Recall@50 = 12 / 20 = 0.60.
Recall@K is important for evaluating:
| Metric | Question It Answers | Denominator |
|---|---|---|
| Precision@K | Of the K items returned, how many are relevant? | K (the number of items returned) |
| Recall@K | Of all relevant items, how many appear in the top K? | Total relevant items in the entire collection |
A key property of Recall@K is that it does not consider the ranking order of items within the top K. Whether a relevant item appears at position 1 or position K, it contributes equally to Recall@K. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are used when ranking order matters.
As K increases, Recall@K generally increases (more relevant items are found), while Precision@K may decrease (the incremental results are less likely to be relevant).
In object detection, recall measures the proportion of ground-truth objects that the model successfully detects. A detection is considered a true positive only if its predicted bounding box overlaps sufficiently with a ground-truth bounding box, as measured by the Intersection over Union (IoU) threshold.
For example, at an IoU threshold of 0.5, a predicted bounding box must overlap at least 50% with the ground-truth box to count as a correct detection. If the IoU falls below the threshold, the prediction is either a false positive (if no matching ground-truth box exists) or the ground-truth object remains a false negative.
The COCO (Common Objects in Context) benchmark uses a metric called Average Recall (AR), which averages recall across multiple IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. AR is reported at different maximum detection limits per image:
| COCO Recall Metric | Description |
|---|---|
| AR@1 | Average recall with at most 1 detection per image |
| AR@10 | Average recall with at most 10 detections per image |
| AR@100 | Average recall with at most 100 detections per image |
| AR (small) | AR for objects with area less than 32x32 pixels |
| AR (medium) | AR for objects with area between 32x32 and 96x96 pixels |
| AR (large) | AR for objects with area greater than 96x96 pixels |
AR correlates strongly with localization accuracy for IoU thresholds above 0.5. Models with higher AR tend to produce tighter, more accurate bounding boxes. Unlike mean Average Precision (mAP), which incorporates both precision and recall at multiple confidence thresholds, AR focuses solely on the recall side and does not account for confidence scores.
The scikit-learn library provides straightforward functions for computing recall in Python:
from sklearn.metrics import recall_score, classification_report
# Binary classification example
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Binary recall
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.3f}") # Output: 0.800
# Multi-class averaging options
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_micro = recall_score(y_true, y_pred, average='micro')
recall_weighted = recall_score(y_true, y_pred, average='weighted')
# Full classification report with per-class recall
print(classification_report(y_true, y_pred))
The classification_report function produces a table showing precision, recall, and F1 score for each class, along with macro, weighted, and (in some configurations) micro averages. For binary classification, the pos_label parameter controls which class is treated as the positive class (default is 1). When average=None, the function returns per-class recall values as an array, which is useful for inspecting recall on individual classes.
Scikit-learn also provides precision_recall_curve for generating the precision-recall curve and PrecisionRecallDisplay for plotting it, both of which are useful for visualizing how recall changes across different threshold values.
Recall is especially important in situations where false negatives carry a high cost.
In medical screening (e.g., mammography for breast cancer, PCR tests for infectious diseases), high recall is essential. Missing a positive case can lead to delayed treatment, disease progression, or even death. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified. A typical screening workflow uses a high-recall first-stage test followed by a high-precision confirmatory test.
In financial fraud detection, a missed fraudulent transaction (false negative) results in direct monetary loss. Banks and payment processors typically tune their fraud models for high recall, accepting a moderate number of false alarms (which are resolved through customer verification) to ensure that as many fraudulent transactions as possible are caught.
Intrusion detection systems, airport scanners, and autonomous vehicle perception systems all require high recall. In autonomous driving, for example, failing to detect a pedestrian (false negative) can result in a collision, while a false positive (perceiving a pedestrian that is not there) results in unnecessary braking, which is inconvenient but not dangerous.
In patent search, legal discovery, and systematic reviews (e.g., medical literature reviews for clinical guidelines), recall is often the primary metric. Missing a relevant document can have legal or clinical consequences, while including an irrelevant document simply adds to the review workload.
Imagine you have a bag of colored balls. Some are red and some are blue. Your job is to pull out all the red balls.
Recall tells you how good you are at finding the red balls. If there are 10 red balls in the bag and you found 8 of them, your recall is 8 out of 10, or 80%. You missed 2 red balls.
A perfect recall score of 100% means you found every single red ball. It does not matter if you also accidentally grabbed some blue balls along the way. Recall only cares about one thing: did you find all the red ones?