# Recall (metric)

> Source: https://aiwiki.ai/wiki/recall
> Updated: 2026-06-20
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What is recall?

**Recall** is a classification and retrieval metric that measures the proportion of actual positive instances a model correctly identifies, defined as TP / (TP + FN), the number of [true positives](/wiki/true_positive_tp) divided by the total number of real positives.[1][3] It answers a single question: of all the instances that are truly positive, how many did the model find? Recall is also called sensitivity, the true positive rate (TPR), hit rate, or probability of detection, and it is the metric of choice when the cost of a missed positive (a false negative) is high, such as in disease screening, fraud detection, or safety systems.[4][6]

Google's Machine Learning Crash Course states the definition directly: "The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as recall."[12] Recall is reported as a value between 0 and 1 (or 0% to 100%): the scikit-learn reference notes that for recall "the best value is 1 and the worst value is 0."[8] A recall of 1 means every positive instance was found; a recall of 0 means none were.

**Recall** is a performance metric commonly used in [machine learning](/wiki/machine_learning) and [information retrieval](/wiki/information_retrieval) to evaluate the effectiveness of classification and retrieval models.[1][3] It is particularly useful when the cost of false negatives (failing to identify positive instances) is high.[4] This article provides an in-depth look at the concept of recall, its mathematical formulation, its relation to other performance metrics such as [precision](/wiki/precision) and [F1 score](/wiki/f1_score), and the practical scenarios where recall takes priority.

Recall answers the question: "Of all the instances that are actually positive, how many did the model correctly identify?" A model with high recall catches most of the positive cases, even if it also flags some negatives along the way.

## Definition and formula

Recall, also known as **sensitivity**, **true positive rate (TPR)**, **hit rate**, or **probability of detection**, is the proportion of [true positive](/wiki/true_positive_tp) instances (correctly identified positive instances) among all the actual positive instances in the dataset.[1][6] Mathematically, recall is defined as:

**Recall = TP / (TP + FN)**[1]

Where:

- **TP (True Positives)** denotes the number of instances that are actually positive and were correctly predicted as positive.
- **FN ([False Negatives](/wiki/false_negative_fn))** denotes the number of instances that are actually positive but were incorrectly predicted as negative. False negatives are also known as Type II errors, or "misses."[6]

The [scikit-learn](/wiki/scikit-learn) documentation gives the same definition in plain language: "The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples."[8]

Recall is expressed as a value between 0 and 1 (or 0% to 100%), where a value of 1 indicates perfect recall (every positive instance was found), and a value of 0 indicates that no positive instances were identified.

Because recall is defined as TP / (TP + FN), it is exactly the complement of the false negative rate (FNR). In other words:

**FNR = FN / (TP + FN) = 1 - Recall**

This means that maximizing recall is equivalent to minimizing the false negative rate, which in statistical hypothesis testing corresponds to minimizing the probability of a Type II error.[1]

### Worked example

Consider a medical screening test for a disease. Out of 1,000 patients, 50 actually have the disease and 950 do not. After running the test:

| | Predicted Positive (Disease) | Predicted Negative (Healthy) |
|--|--|--|
| **Actually Positive** | 45 (TP) | 5 (FN) |
| **Actually Negative** | 30 (FP) | 920 (TN) |

The recall of this screening test is:

*Recall = TP / (TP + FN) = 45 / (45 + 5) = 45 / 50 = 0.90*

This means the test correctly identifies 90% of patients who actually have the disease. The remaining 10% (5 patients) have the disease but were missed by the test (false negatives).

## Understanding true positives and false negatives

To fully understand recall, it helps to place TP and FN within the full [confusion matrix](/wiki/confusion_matrix):[1]

| | Predicted Positive | Predicted Negative |
|--|--|--|
| **Actually Positive** | True Positive (TP) | False Negative (FN) |
| **Actually Negative** | False Positive (FP) | True Negative (TN) |

Recall focuses on the top row of this matrix: of everything that is actually positive (TP + FN), what fraction did the model catch (TP)? Note that recall does not consider false positives at all. A model can flag many negative instances as positive (low precision) and still have high recall, as long as it catches all the true positives.[4]

### Why do false negatives matter?

False negatives are costly in many real-world applications:

| Application | What a False Negative Means | Consequence |
|-------------|----------------------------|-------------|
| Cancer screening | Cancerous tumor missed by the test | Delayed treatment; potentially fatal outcome |
| Airport security | Weapon not detected by scanner | Threat passes through undetected |
| [Fraud detection](/wiki/fraud_detection) | Fraudulent transaction not flagged | Financial loss for the institution or customer |
| Autonomous driving | Pedestrian not detected by the perception system | Potential collision |
| Product recall identification | Defective product not identified | Harm to consumers; liability |
| Cybersecurity intrusion detection | Active attack not detected | Data breach; system compromise |

In every one of these cases, missing a positive instance is far more dangerous than generating a false alarm. This is why recall is the primary metric of concern in safety-critical applications. Google's Machine Learning Crash Course recommends recall as the headline metric precisely in this situation: "Use when false negatives are more expensive than false positives."[12]

## Connection to sensitivity and other synonyms

Recall goes by many names across different fields. All of the following terms refer to the same formula, TP / (TP + FN):[1][6]

| Term | Field Where Commonly Used |
|------|---------------------------|
| Recall | Machine learning, information retrieval |
| Sensitivity | Medicine, biostatistics, clinical testing |
| True Positive Rate (TPR) | Statistics, [ROC analysis](/wiki/roc_receiver_operating_characteristic_curve) |
| Hit rate | Signal detection theory, psychology |
| Detection rate | Radar engineering, security screening |
| Probability of detection | Engineering, telecommunications |
| Power (1 - beta) | Statistical hypothesis testing |

In **medical diagnostics**, sensitivity is perhaps the most important metric for a screening test. A highly sensitive test catches nearly all patients with the condition, ensuring that few cases are missed. Follow-up confirmatory tests (which may prioritize specificity or precision) can then be used to eliminate false positives. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. For SARS-CoV-2 testing, the FDA Emergency Use Authorization acceptance criterion for antigen tests was a point estimate of at least 80% Positive Percent Agreement (PPA), the regulatory equivalent of sensitivity, measured against an RT-PCR reference standard.[13] For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified.

In **statistical hypothesis testing**, the concept of recall maps onto the **power** of a test: the probability of correctly rejecting a false null hypothesis.[1] A test with low power has a high rate of Type II errors (false negatives), meaning it frequently fails to detect a real effect.

## How does the classification threshold affect recall?

Most [classification models](/wiki/classification_model) do not output a hard binary label directly. Instead, they produce a continuous score or probability for each instance. A [classification threshold](/wiki/classification_threshold) determines the cutoff above which an instance is classified as positive.[6]

Adjusting this threshold directly controls the tradeoff between recall and precision:

| Threshold Change | Effect on Predictions | Recall | Precision | False Negatives | False Positives |
|------------------|-----------------------|--------|-----------|----------------|----------------|
| Lower the threshold | More instances classified as positive | Increases | Decreases | Fewer | More |
| Raise the threshold | Fewer instances classified as positive | Decreases | Increases | More | Fewer |

At the extreme, if the threshold is set to 0 (classify everything as positive), recall equals 1.0 because no positive instance is missed. However, precision drops because many negative instances are also classified as positive. Conversely, if the threshold is set to 1.0 (classify nothing as positive), recall drops to 0 because no instance is classified as positive at all.[6]

In practice, the optimal threshold depends on the relative costs of false negatives and false positives. For safety-critical applications like disease screening, the threshold is often set low to maximize recall. For applications like spam filtering, the threshold may be set higher to avoid annoying false positives.

## The recall vs. precision tradeoff

Recall and precision exist in tension. In most practical systems, improving recall comes at the cost of precision, and vice versa. This inverse relationship is known as the **precision-recall tradeoff**.

The **precision-recall curve** visualizes this tradeoff by plotting precision (y-axis) against recall (x-axis) at various threshold settings.[2] A model that can achieve both high precision and high recall simultaneously is superior. The area under the precision-recall curve (AUPRC) provides a single-number summary of a model's performance across all thresholds.[2]

On highly imbalanced datasets, the precision-recall curve is often more informative than the ROC curve.[7] This is because the ROC curve's false positive rate denominator (FP + TN) is dominated by the large negative class, which can make even a large number of false positives appear as a small FPR. The precision-recall curve, by contrast, directly penalizes false positives through the precision metric.[2][7]

### When should recall be favored over precision?

Recall should be prioritized when the cost of a false negative is much higher than the cost of a false positive:[4]

- **Medical screening:** Missing a disease (false negative) can be fatal. A false alarm leads to a follow-up test, which is inconvenient but not dangerous.
- **Safety systems:** Failing to detect a hazard (earthquake alert, fire detection, intrusion alarm) can result in loss of life. False alarms cause temporary disruption but keep people safe.
- **Search and rescue:** It is better to investigate a false lead than to miss a trapped survivor.
- **Legal discovery:** Missing a relevant document in litigation can result in sanctions or an adverse verdict. Reviewing an irrelevant document wastes time but has no legal consequence.

## Relation to precision and F-scores

Recall is often used alongside precision, another performance metric in machine learning. Precision measures the proportion of true positive instances among all the instances that were predicted as positive by the model.[1][3] While recall emphasizes the ability of a model to correctly identify positive instances, precision focuses on the model's accuracy in predicting positive instances.

When evaluating a [classification model](/wiki/classification_model), it is often necessary to consider both recall and precision to get a comprehensive understanding of performance. One way to do this is by calculating the **F1 score**, which is the harmonic mean of recall and precision:[5]

**F1 = 2 * (Precision * Recall) / (Precision + Recall)**[5]

The F1 score ranges from 0 to 1, with a value of 1 indicating a perfect balance between recall and precision. The harmonic mean is used instead of the arithmetic mean because it penalizes large differences between the two values. If precision is 1.0 but recall is 0.1, the arithmetic mean would be 0.55 (misleadingly high), but the F1 score is only 0.18, correctly reflecting the severe imbalance.

The **F-beta score** generalizes the F1 score by introducing a parameter beta that controls the relative weight given to recall versus precision:[5]

**F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)**[5]

| F-beta Variant | Beta Value | Weighting | Typical Use Case |
|---------------|------------|-----------|------------------|
| F0.5 score | 0.5 | Precision weighted 2x more than recall | Spam filtering, where false positives are costly |
| F1 score | 1.0 | Equal weight to precision and recall | General-purpose balanced evaluation |
| F2 score | 2.0 | Recall weighted 2x more than precision | Medical screening, safety systems |

When beta is greater than 1, the F-beta score gives more weight to recall. When beta is less than 1, it gives more weight to precision. Setting beta = 1 yields the standard F1 score.[4][10]

### Summary of related metrics

| Metric | Formula | Focus |
|--------|---------|-------|
| [Recall](/wiki/recall) | TP / (TP + FN) | Completeness of positive identification |
| [Precision](/wiki/precision) | TP / (TP + FP) | Quality of positive predictions |
| [F1 Score](/wiki/f1_score) | 2 * P * R / (P + R) | Balance of precision and recall |
| Specificity | TN / (TN + FP) | Correctness of negative predictions |
| False Negative Rate | FN / (TP + FN) = 1 - Recall | Rate of missed positives |
| [Accuracy](/wiki/accuracy) | (TP + TN) / (TP + TN + FP + FN) | Overall correctness |

## ROC curves and the recall connection

Recall (TPR) is one of the two axes of the **Receiver Operating Characteristic (ROC) curve**.[6] The [ROC curve](/wiki/roc_receiver_operating_characteristic_curve) plots the True Positive Rate (recall) on the y-axis against the False Positive Rate (FPR = FP / (FP + TN)) on the x-axis at various classification thresholds.[6]

As the threshold decreases (more instances classified as positive), both the TPR and FPR increase. The model moves from the bottom-left corner of the ROC space (threshold = 1, nothing classified as positive) toward the top-right corner (threshold = 0, everything classified as positive). A good model reaches high TPR (recall) at a low FPR, curving toward the top-left corner of the plot.

The area under the ROC curve (AUC-ROC) is a threshold-independent metric that summarizes how well a model distinguishes between positive and negative classes. An AUC of 1.0 indicates perfect separation, while an AUC of 0.5 indicates random chance (the diagonal line).[6]

While AUC-ROC is widely used, it can be misleading on highly imbalanced datasets because the FPR denominator (FP + TN) is dominated by the large negative class. In such cases, the precision-recall curve is often preferred, as discussed in the previous section.[7]

## Recall in multi-class settings

For multi-class classification problems, recall is computed per class and then aggregated. The three standard aggregation methods mirror those for precision:[4]

| Averaging Method | How It Works | When to Use |
|-----------------|--------------|-------------|
| **Macro recall** | Compute recall independently for each class, then take the unweighted mean | When all classes are equally important |
| **Micro recall** | Sum all TP and FN across all classes, then compute a single recall value | When you care about overall detection rate |
| **Weighted recall** | Compute recall per class and take the mean weighted by class support | When class sizes differ and you want proportional representation |

### Macro recall

For C classes:

*Macro Recall = (1/C) * sum of Recall_c for c = 1 to C*

Macro recall gives equal weight to every class, which means that performance on rare classes has the same influence as performance on common classes. This is useful when minority classes are important (e.g., rare diseases in a diagnostic system).[4]

### Micro recall

*Micro Recall = (sum of TP_c for all c) / (sum of TP_c + FN_c for all c)*

For standard multi-class problems where each instance belongs to exactly one class, micro recall equals the overall accuracy.[4] This equivalence occurs because, in single-label multi-class classification, every false negative for one class is simultaneously a false positive for another class, making the total TP + FN equal to the total number of instances.

### Practical example

Consider a multi-class classifier for three types of manufacturing defects:

| Defect Type | TP | FN | Total Actual | Recall |
|-------------|----|----|-------------|--------|
| Crack | 40 | 10 | 50 | 40/50 = 0.800 |
| Scratch | 70 | 5 | 75 | 70/75 = 0.933 |
| Dent | 15 | 10 | 25 | 15/25 = 0.600 |

- **Macro recall** = (0.800 + 0.933 + 0.600) / 3 = 0.778
- **Micro recall** = (40 + 70 + 15) / (50 + 75 + 25) = 125 / 150 = 0.833

Micro recall is higher because the model performs well on the large "Scratch" class. Macro recall reveals that the model struggles with "Dent" defects, a fact that micro recall obscures.

## Recall@K in information retrieval

**Recall@K** is a variant of recall used in information retrieval and ranking systems. It measures the proportion of all relevant items that appear within the top K results:[3][5]

**Recall@K = (Number of relevant items in the top K results) / (Total number of relevant items)**

For example, if a database contains 20 relevant documents for a query, and a search engine retrieves 12 of them within the top 50 results, then Recall@50 = 12 / 20 = 0.60.

The concept long predates machine learning. C. J. van Rijsbergen's 1979 textbook *Information Retrieval* defined recall as the proportion of the relevant documents in the collection that have actually been retrieved, pairing it with precision (the proportion of retrieved documents that are relevant) as the two foundational measures of retrieval effectiveness.[5]

Recall@K is important for evaluating:

- **Search engines:** What fraction of all relevant web pages appear on the first few pages of results?
- **Recommendation systems:** What fraction of items a user would find interesting are included in the top-K recommendations?
- **Candidate retrieval stages:** In two-stage retrieval pipelines (retrieval followed by re-ranking), the retrieval stage must achieve high Recall@K to ensure that no relevant items are lost before re-ranking.
- **Vector similarity search:** In approximate nearest neighbor systems, Recall@K measures how many of the true nearest neighbors are returned by the approximate search.

| Metric | Question It Answers | Denominator |
|--------|---------------------|-------------|
| Precision@K | Of the K items returned, how many are relevant? | K (the number of items returned) |
| Recall@K | Of all relevant items, how many appear in the top K? | Total relevant items in the entire collection |

A key property of Recall@K is that it does not consider the ranking order of items within the top K. Whether a relevant item appears at position 1 or position K, it contributes equally to Recall@K. Metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) are used when ranking order matters.[3]

As K increases, Recall@K generally increases (more relevant items are found), while Precision@K may decrease (the incremental results are less likely to be relevant).

## Recall in object detection

In [object detection](/wiki/object_detection), recall measures the proportion of ground-truth objects that the model successfully detects.[11] A detection is considered a true positive only if its predicted bounding box overlaps sufficiently with a ground-truth bounding box, as measured by the Intersection over Union (IoU) threshold.[9][11]

For example, at an IoU threshold of 0.5, a predicted bounding box must overlap at least 50% with the ground-truth box to count as a correct detection. If the IoU falls below the threshold, the prediction is either a false positive (if no matching ground-truth box exists) or the ground-truth object remains a false negative.

The COCO (Common Objects in Context) benchmark reports 12 standard evaluation metrics, split between Average Precision and Average Recall, and defines **Average Recall (AR)** as the maximum recall given a fixed number of detections per image, averaged over all categories and over the 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05.[9][11] AR is reported at different maximum detection limits per image:

| COCO Recall Metric | Description |
|--------------------|-------------|
| AR@1 | Average recall with at most 1 detection per image |
| AR@10 | Average recall with at most 10 detections per image |
| AR@100 | Average recall with at most 100 detections per image |
| AR (small) | AR for objects with area less than 32x32 pixels |
| AR (medium) | AR for objects with area between 32x32 and 96x96 pixels |
| AR (large) | AR for objects with area greater than 96x96 pixels |

AR correlates strongly with localization accuracy for IoU thresholds above 0.5. Models with higher AR tend to produce tighter, more accurate bounding boxes. Unlike mean Average Precision (mAP), which incorporates both precision and recall at multiple confidence thresholds, AR focuses solely on the recall side and does not account for confidence scores.[11]

## Implementation in scikit-learn

The [scikit-learn](/wiki/scikit-learn) library provides straightforward functions for computing recall in Python.[8] The library documents the function as returning the ratio tp / (tp + fn), where "the best value is 1 and the worst value is 0":[8]

```python
from sklearn.metrics import recall_score, classification_report

# Binary classification example
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Binary recall
recall = recall_score(y_true, y_pred)
print(f"Recall: {recall:.3f}")  # Output: 0.800

# Multi-class averaging options
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_micro = recall_score(y_true, y_pred, average='micro')
recall_weighted = recall_score(y_true, y_pred, average='weighted')

# Full classification report with per-class recall
print(classification_report(y_true, y_pred))
```

The `classification_report` function produces a table showing precision, recall, and F1 score for each class, along with macro, weighted, and (in some configurations) micro averages.[8] For [binary classification](/wiki/binary_classification), the `pos_label` parameter controls which class is treated as the positive class (default is 1). When `average=None`, the function returns per-class recall values as an array, which is useful for inspecting recall on individual classes.

Scikit-learn also provides `precision_recall_curve` for generating the precision-recall curve and `PrecisionRecallDisplay` for plotting it, both of which are useful for visualizing how recall changes across different threshold values.[8]

## What is recall used for?

Recall is especially important in situations where false negatives carry a high cost.

### Medical diagnosis

In medical screening (e.g., mammography for breast cancer, PCR tests for infectious diseases), high recall is essential. Missing a positive case can lead to delayed treatment, disease progression, or even death. Regulatory bodies such as the U.S. Food and Drug Administration (FDA) often set minimum sensitivity requirements for diagnostic devices. The FDA Emergency Use Authorization criterion for SARS-CoV-2 antigen tests required a point estimate of at least 80% Positive Percent Agreement (the regulatory term for sensitivity) versus an RT-PCR reference.[13] For instance, rapid antigen tests for COVID-19 were evaluated primarily on sensitivity (recall) to ensure that infected individuals were identified. A typical screening workflow uses a high-recall first-stage test followed by a high-precision confirmatory test.

### Fraud detection

In financial fraud detection, a missed fraudulent transaction (false negative) results in direct monetary loss. Banks and payment processors typically tune their fraud models for high recall, accepting a moderate number of false alarms (which are resolved through customer verification) to ensure that as many fraudulent transactions as possible are caught.

### Security and safety systems

Intrusion detection systems, airport scanners, and autonomous vehicle perception systems all require high recall. In autonomous driving, for example, failing to detect a pedestrian (false negative) can result in a collision, while a false positive (perceiving a pedestrian that is not there) results in unnecessary braking, which is inconvenient but not dangerous.

### Information retrieval and legal discovery

In patent search, legal discovery, and systematic reviews (e.g., medical literature reviews for clinical guidelines), recall is often the primary metric.[3] Missing a relevant document can have legal or clinical consequences, while including an irrelevant document simply adds to the review workload.

## Explain like I'm 5 (ELI5)

Imagine you have a bag of colored balls. Some are red and some are blue. Your job is to pull out all the red balls.

Recall tells you how good you are at finding the red balls. If there are 10 red balls in the bag and you found 8 of them, your recall is 8 out of 10, or 80%. You missed 2 red balls.

A perfect recall score of 100% means you found every single red ball. It does not matter if you also accidentally grabbed some blue balls along the way. Recall only cares about one thing: did you find all the red ones?

## References

1. Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
2. Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*, 233-240.
3. Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press.
4. Sokolova, M. and Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." *Information Processing and Management*, 45(4), 427-437.
5. Van Rijsbergen, C. J. (1979). *Information Retrieval*. 2nd edition. Butterworths.
6. Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), 861-874.
7. Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
8. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. recall_score documentation: scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html
9. Lin, T.-Y. et al. (2014). "Microsoft COCO: Common Objects in Context." *Proceedings of the European Conference on Computer Vision (ECCV)*, 740-755.
10. Hossin, M. and Sulaiman, M. N. (2015). "A Review on Evaluation Metrics for Data Classification Evaluations." *International Journal of Data Mining and Knowledge Management Process*, 5(2), 1-11.
11. Padilla, R., Netto, S. L., and da Silva, E. A. B. (2020). "A Survey on Performance Metrics for Object-Detection Algorithms." *Proceedings of the International Conference on Systems, Signals and Image Processing (IWSSIP)*, 237-242.
12. Google for Developers. "Classification: Accuracy, recall, precision, and related metrics." Machine Learning Crash Course. developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
13. U.S. Food and Drug Administration. "In Vitro Diagnostics EUAs - Antigen Diagnostic Tests for SARS-CoV-2" (template performance criterion: point estimate of at least 80% Positive Percent Agreement versus RT-PCR). fda.gov