The true positive rate (TPR) is the proportion of actual positive cases that a classifier correctly identifies as positive. It is one of the most widely used evaluation metrics in binary classification, medical diagnostics, signal detection, and information retrieval, and it appears under several names depending on the field. In statistics and medicine it is called sensitivity; in machine learning and information retrieval it is called recall; in radar and psychophysics it is called the hit rate or probability of detection. All of these refer to the same quantity: the conditional probability that a positive case is correctly flagged.
Formally, TPR answers the question: "Of all the cases that are truly positive, how many did the system catch?" It is computed as the number of true positives (TP) divided by the total number of actual positives, which is the sum of true positives and false negatives (FN). TPR is the y-axis of the ROC curve and the recall component of the F1 score.
Imagine you are playing hide and seek with ten friends. They all hide, and your job is to find them. If you find seven of them and miss three, your true positive rate is 7 out of 10, or 70 percent. The TPR does not care how many bushes you searched or how many times you yelled "Found you!" at a squirrel. It only cares about the friends who were actually hiding and whether you found them. A high TPR means you are good at finding the people who are really there. A low TPR means a lot of your friends are still hiding and waiting to be found.
The true positive rate is defined as the conditional probability of a positive prediction given that the true class is positive:
TPR = TP / (TP + FN)
where:
In probability notation:
TPR = P(predicted positive | actual positive)
TPR ranges from 0 to 1. A value of 1 means every positive case was caught; a value of 0 means every positive case was missed. Because TPR conditions only on the actual positive class, it is mathematically independent of the number of true negatives and false positives, and therefore independent of class prevalence in the test set. This is why it remains stable under class imbalance, although interpreting it correctly still requires looking at the false positive rate or precision alongside it.
TPR is also the complement of the false negative rate (FNR):
TPR = 1 - FNR
where FNR = FN / (TP + FN).
TPR is a rare case of the same statistic being independently invented in different disciplines, each of which gave it a different name. The table below summarizes the common synonyms.
| Term | Field | Typical context |
|---|---|---|
| True positive rate (TPR) | Machine learning, statistics | ROC analysis, classifier evaluation |
| Sensitivity | Epidemiology, medicine | Diagnostic test performance |
| Recall | Information retrieval, NLP | Search, document classification |
| Hit rate | Signal detection theory, psychophysics | Stimulus detection experiments |
| Probability of detection (Pd) | Radar, sonar, engineering | Target detection systems |
| Statistical power (1 - beta) | Hypothesis testing | Likelihood of detecting a true effect |
The equivalence is exact. A 0.92 sensitivity in a clinical study, a 0.92 recall in a search engine, and a 0.92 hit rate in a radar trial all describe the same fraction of true positives that were correctly flagged.
TPR is one of several metrics derived from the confusion matrix, the 2x2 table that tabulates the four possible outcomes of a binary classification task. The matrix is conventionally laid out with actual classes as rows and predicted classes as columns:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
TPR is computed by dividing the top-left cell (TP) by the sum of the top row (TP + FN). In other words, TPR is a row-wise statistic of the matrix: it normalizes by the actual positive count. Several related metrics use the same matrix but normalize differently:
| Metric | Formula | What it normalizes by |
|---|---|---|
| True positive rate (TPR), recall, sensitivity | TP / (TP + FN) | Actual positives (row) |
| False positive rate (FPR) | FP / (FP + TN) | Actual negatives (row) |
| Specificity, true negative rate (TNR) | TN / (TN + FP) | Actual negatives (row) |
| False negative rate (FNR) | FN / (TP + FN) | Actual positives (row) |
| Precision, positive predictive value | TP / (TP + FP) | Predicted positives (column) |
| Negative predictive value | TN / (TN + FN) | Predicted negatives (column) |
| F1 score | 2 * P * R / (P + R) | Harmonic mean of precision and recall |
A useful mental shortcut: TPR and specificity are paired (one for each true class), and precision and recall are paired (precision conditions on the prediction, recall on the truth).
Suppose a fraud detection model is evaluated on 10,000 credit card transactions, of which 200 are actual fraud. The model produces the following confusion matrix:
| Predicted fraud | Predicted legitimate | Total | |
|---|---|---|---|
| Actually fraud | 160 (TP) | 40 (FN) | 200 |
| Actually legitimate | 320 (FP) | 9,480 (TN) | 9,800 |
| Total | 480 | 9,520 | 10,000 |
The true positive rate is:
TPR = 160 / (160 + 40) = 160 / 200 = 0.80 (80 percent)
The model catches 80 percent of fraudulent transactions. For comparison, the false positive rate is 320 / 9,800 = 0.033 (3.3 percent), and the precision is 160 / 480 = 0.333 (33.3 percent). This combination is typical of fraud and rare-event detection: a respectable TPR, a low FPR, but a precision that suffers because the negative class is so much larger than the positive class. The TPR alone says nothing about the cost of those 320 false alarms, which is why production systems usually report TPR together with at least one prediction-conditioned metric.
The concept of measuring how often a true positive is detected has roots in two parallel traditions: medical diagnostics in the 1940s and signal detection theory in the 1950s.
The terms sensitivity and specificity were introduced by the American biostatistician Jacob Yerushalmy in a 1947 paper for the U.S. Public Health Reports titled "Statistical Problems in Assessing Methods of Medical Diagnosis, with Special Reference to X-ray Techniques." Yerushalmy was studying how reliably radiologists could diagnose tuberculosis from chest X-rays. He found that two readers looking at the same film often disagreed, and that the same reader looking at the same film twice could disagree with himself. To quantify this, he proposed evaluating any diagnostic test using two probabilities: "a measure of sensitivity or the probability of correct diagnosis of positive cases, and a measure of specificity or the probability of correct diagnosis of negative cases." That paper effectively gave the modern definition of TPR.
Independently, during and after World War II, radar engineers needed a way to describe the trade-off between detecting real aircraft and reacting to noise. Wilson P. Tanner and John A. Swets applied the framework to human perception in a 1954 paper in Psychological Review, "A decision-making theory of visual detection," introducing the receiver operating characteristic to psychology. In this tradition the true positive rate is called the hit rate and the false positive rate is called the false alarm rate. The signal detection framework established that any detector, biological or mechanical, can be characterized by a curve of hit rate versus false alarm rate as its decision criterion is varied, which is the modern ROC curve.
The medical and signal detection lineages converged in machine learning in the 1990s, when researchers including Foster Provost, Tom Fawcett, and others popularized ROC analysis as a tool for comparing classifiers under skewed class distributions and asymmetric misclassification costs. The recall name entered the same vocabulary from information retrieval, where it had been used since at least Cyril Cleverdon's Cranfield experiments in the 1960s. By the 2000s the three terms (TPR, sensitivity, recall) were used interchangeably in the machine learning literature, with the choice usually signaling the author's home discipline.
Most classifiers do not output hard 0/1 labels directly. They output a continuous score, such as a probability, a margin, or a logit, and a threshold converts that score into a class label. Sweeping the threshold from very strict (predict positive only when very confident) to very lenient (predict positive almost always) traces out a curve in TPR-FPR space. This curve is the receiver operating characteristic (ROC). TPR sits on the y-axis, FPR on the x-axis.
A few canonical points on the ROC curve are worth memorizing:
| Threshold behavior | TPR | FPR | Meaning |
|---|---|---|---|
| Predict positive for everything | 1 | 1 | All positives caught, but every negative is also flagged |
| Predict negative for everything | 0 | 0 | No false alarms, but every positive is missed |
| Perfect classifier | 1 | 0 | Top-left corner of ROC space |
| Random guessing | p | p | Diagonal line from (0,0) to (1,1) |
The area under this curve, the AUC, has a clean probabilistic interpretation: it equals the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative instance. An AUC of 1.0 is a perfect ranker; 0.5 is no better than chance. Because the ROC curve is built from TPR and FPR, both of which condition only on the true class, the curve and its AUC are invariant to class prior shifts in the test set. That property makes ROC analysis attractive for imbalanced problems, although for very rare positives the precision-recall curve often gives a more honest picture.
Moving the decision threshold changes TPR and FPR in lockstep, and it changes precision in the opposite direction from recall. Lowering the threshold increases TPR (you catch more positives) but also increases FPR and lowers precision (you raise more false alarms). Raising the threshold does the reverse. The right operating point depends on the relative cost of false negatives versus false positives in the application.
| Application | Cost asymmetry | Typical threshold choice |
|---|---|---|
| Cancer screening | Missing a tumor is far worse than a follow-up scan | Low threshold, high TPR, accept high FPR |
| Spam filtering | A wrongly blocked legitimate email is worse than a missed spam | High threshold, lower TPR, very low FPR |
| Fraud detection | Missing fraud and blocking a real customer both cost money | Tuned per merchant using cost-weighted ROC |
| Search ranking | Missing a relevant document hurts recall, surfacing junk hurts precision | Tuned by F1 or position-aware metrics |
In medical screening it is common to fix a target sensitivity (for example, 95 percent) and report the corresponding specificity. In information retrieval it is more common to fix recall and report precision, or to compute F1. Both conventions are picking a single point on the same underlying ROC or PR curve.
A frequent source of confusion is whether TPR is robust to class imbalance. The arithmetic answer is yes: because TPR = TP / (TP + FN) only involves actual positives, multiplying the number of negatives by ten or by ten thousand leaves TPR unchanged. The interpretive answer is more nuanced. A model that predicts positive for every input has a TPR of 1.0 regardless of how rare the positive class is, because it catches every positive by definition. That model is useless. So TPR is necessary but not sufficient as a summary of classifier quality on imbalanced data, and it is almost always reported together with at least one of FPR, precision, or specificity.
For very imbalanced problems where the positive class is the minority of interest (rare disease screening, fraud, defect detection), the precision-recall curve and its area (average precision) are often more informative than the ROC curve, because precision actively penalizes the false positives that a low-prevalence ROC plot can hide.
TPR is naturally a binary-classification metric, but it generalizes to multi-class problems through one-vs-rest decomposition. For each class, treat that class as the positive label and all others as negative, compute the per-class TPR (which is the per-class recall), and then aggregate. The standard aggregation strategies, as implemented in scikit-learn's recall_score function, are:
| Average | Formula | Behavior |
|---|---|---|
| macro | unweighted mean of per-class recalls | Treats every class equally, regardless of frequency |
| micro | total TP / (total TP + total FN) summed across classes | Equivalent to overall accuracy in the multi-class single-label case |
| weighted | mean of per-class recalls weighted by class support | Reflects performance on the actual class distribution |
| samples | per-sample recall averaged over samples | Used for multi-label problems |
The choice matters. Macro recall punishes a model that ignores rare classes; micro recall does not. Reporting both is a common diagnostic for understanding whether a multi-class classifier is genuinely competent or is leaning on the dominant class.
Medical screening. A mammography program reports a sensitivity of 0.87, meaning 87 percent of women who actually have breast cancer in the screened population get a positive screening result. The remaining 13 percent are false negatives, the most consequential kind of error in cancer screening.
Fraud detection. A credit card issuer's model has a TPR of 0.65 at an FPR of 0.001. Sixty-five percent of fraudulent charges are flagged, and only 0.1 percent of legitimate charges trigger a false alarm. Because legitimate transactions outnumber fraud by roughly 1,000 to 1, even that low FPR translates into a substantial absolute number of false alarms, which is why precision is also tracked.
Information retrieval. A search engine returns 80 of the 100 truly relevant documents in its index for a given query, achieving a recall (TPR) of 0.80. The remaining 20 relevant documents are false negatives that the user never sees.
Spam filtering. An email provider tunes its classifier to a recall of 0.99 on spam while keeping the false positive rate on legitimate mail under 0.001, reflecting the asymmetric cost of accidentally blocking a real message.
Object detection. In computer vision, the recall of an object detector at a given intersection-over-union (IoU) threshold is the fraction of ground-truth objects that the detector successfully localized. Average precision, the area under the per-class precision-recall curve, is the standard summary for benchmarks like COCO.
The standard Python implementation lives in scikit-learn's metrics module. The relevant functions are recall_score, roc_curve, roc_auc_score, and the RocCurveDisplay plotting helper.
from sklearn.metrics import recall_score, roc_curve, roc_auc_score, RocCurveDisplay
import matplotlib.pyplot as plt
# Binary classification: TPR equals recall
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1, 1]
recall = recall_score(y_true, y_pred)
print(f"TPR (recall) = {recall:.3f}") # 0.800
# ROC curve from scores
y_scores = [0.1, 0.9, 0.4, 0.2, 0.8, 0.6, 0.7, 0.95]
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
auc = roc_auc_score(y_true, y_scores)
RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc).plot()
plt.show()
The roc_curve function returns three arrays: false positive rates, true positive rates, and the decreasing thresholds used to compute them. As of scikit-learn 1.3, the first threshold is set to np.inf to represent a classifier that always predicts the negative class, so the curve always starts at (0, 0). For multi-class problems, recall_score accepts an average parameter taking the values described above.
TPR is a single number summarizing one corner of classifier behavior, and treating it as a complete report is the most common mistake. A few specific pitfalls:
A classifier that always predicts the positive class has a perfect TPR of 1.0 and is useless. Always check FPR or precision alongside TPR.
TPR computed at a single fixed threshold can hide large differences between models that would be visible from the full ROC or PR curve. Whenever scores are available, prefer curve-based summaries.
TPR is unstable for very small positive classes. With ten true positives, a single misclassification swings TPR by 10 percentage points. Confidence intervals (Wilson, Clopper-Pearson, or bootstrap) are worth reporting.
In multi-class settings, micro-averaged recall in a single-label problem equals accuracy, which can mask poor performance on rare classes. Reporting macro recall as well is standard hygiene.
TPR is invariant to class prior on the test set, but precision is not. Models tuned and validated on a balanced test set may behave differently in deployment if the underlying base rate shifts. The TPR will hold; the precision will not.