See also: Machine learning terms
In machine learning and statistics, a true positive (TP) is a prediction that correctly assigns the positive label to a sample whose actual label is positive. It is one of the four basic outcomes of a binary classification decision, alongside false positive, true negative, and false negative. Together these four counts make up the confusion matrix, the foundational accounting device used to evaluate almost every classification model.
The true positive count is the load-bearing quantity behind a long list of derived metrics: precision, recall, sensitivity, the F1 score, and the true positive rate (TPR) all depend on TP. Whether a classifier is filtering spam, screening for cancer, or drawing bounding boxes around pedestrians, the question "how many real positives did the model actually catch?" is answered by counting true positives.
Given a binary classifier that outputs a predicted label (positive or negative) for each sample with a known ground-truth label, the four possible outcomes are:
Following the convention used in the Wikipedia article on the confusion matrix, a true positive is the cell at row "actual positive" and column "predicted positive" in the 2x2 matrix; the positive class was correctly identified.
The definition assumes that one of the two classes has been designated as the positive class. In medical screening the positive class is usually "has the disease." In spam filtering it is usually "spam." In fraud detection it is "fraudulent transaction." The choice is conventional, but it determines what counts as a true positive and which derived metrics carry the meaning the practitioner cares about.
The confusion matrix lays out the four counts in a single table. The convention used by scikit-learn, by Wikipedia, and by most machine-learning textbooks places actual labels on the rows and predicted labels on the columns:
| Predicted positive | Predicted negative | Row total | |
|---|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) | P = TP + FN |
| Actual negative | False positive (FP) | True negative (TN) | N = FP + TN |
| Column total | TP + FP | FN + TN | TP + FP + FN + TN |
P is the total number of actual positives in the evaluation set and N is the total number of actual negatives. The diagonal of the matrix (TP and TN) counts correct predictions; the off-diagonal (FP and FN) counts errors. Some references swap rows and columns or list predictions on the rows; either layout works as long as the labels are read carefully. Confusing the two is a common source of bugs in evaluation code.
The vocabulary of true and false positives predates machine learning by several decades. During and after World War II, radar operators and acousticians faced the problem of deciding whether a faint blip on a screen was a real target or random noise. The mathematical framework that grew out of this work, signal detection theory, classified each decision as a hit (signal present, observer says yes), a miss (signal present, observer says no), a false alarm (signal absent, observer says yes), or a correct rejection (signal absent, observer says no). These four outcomes map directly onto today's true positive, false negative, false positive, and true negative.
The terms sensitivity and specificity were introduced by the American biostatistician Jacob Yerushalmy in a 1947 paper on radiographic screening for tuberculosis. Sensitivity (the fraction of actual cases a test catches) and specificity (the fraction of actual non-cases it clears) became the standard way to describe the accuracy of a diagnostic test. The confusion matrix itself was later adapted from human perceptual studies and used by Frank Rosenblatt, among other early researchers, to compare human and machine classifications. Modern machine-learning evaluation inherited the entire conceptual machinery: the four cells, the trade-off between catching real positives and avoiding false alarms, and the language of "true" and "false" labels.
A worked example helps fix the definitions. Suppose a spam classifier is evaluated on 1,000 emails. Of these, 200 are actually spam and 800 are legitimate. The classifier flags 240 messages as spam. Among the 240 flagged, 180 are genuinely spam and 60 are legitimate messages misclassified as spam. The remaining 760 messages are predicted to be legitimate, and of those, 20 are actually spam that slipped through.
| Predicted spam | Predicted legitimate | Row total | |
|---|---|---|---|
| Actual spam | TP = 180 | FN = 20 | 200 |
| Actual legitimate | FP = 60 | TN = 740 | 800 |
| Column total | 240 | 760 | 1,000 |
From this table:
Notice how every metric except specificity depends on TP. Change the count of true positives and the picture of the model changes immediately.
The table below summarises the most common metrics whose definition includes TP, along with their formulas and the question each metric answers. All of these are computed routinely by libraries like scikit-learn through sklearn.metrics.classification_report and sklearn.metrics.confusion_matrix.
| Metric | Formula | Question it answers |
|---|---|---|
| Recall, sensitivity, TPR | TP / (TP + FN) | Of all real positives, what fraction did the model catch? |
| Precision, positive predictive value (PPV) | TP / (TP + FP) | Of all positive predictions, what fraction were correct? |
| F1 score | 2 * TP / (2 * TP + FP + FN) | Harmonic mean of precision and recall. |
| F-beta score | (1 + beta^2) * TP / ((1 + beta^2) * TP + beta^2 * FN + FP) | Weighted balance of precision and recall (beta > 1 favours recall). |
| Accuracy | (TP + TN) / (TP + FP + TN + FN) | Of all predictions, what fraction were correct? |
| Specificity, true negative rate (TNR) | TN / (TN + FP) | Of all real negatives, what fraction were correctly cleared? |
| False positive rate (FPR) | FP / (FP + TN) | Of all real negatives, what fraction were wrongly flagged? |
| False negative rate (FNR) | FN / (TP + FN) | Of all real positives, what fraction were missed? |
| Negative predictive value (NPV) | TN / (TN + FN) | Of all negative predictions, what fraction were correct? |
| Matthews correlation coefficient (MCC) | (TP * TN - FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN)) | Single balanced score in [-1, 1] using all four cells. |
F1 has the convenient algebraic identity 2 * TP / (2 * TP + FP + FN), which makes plain that it ignores TN entirely. Accuracy treats every cell equally, which is why it can be misleading on skewed datasets. Sensitivity and recall are mathematically identical; the first term comes from medicine, the second from information retrieval, and most libraries report both.
The confusion matrix extends naturally to k-class problems by becoming a k by k table whose rows are actual classes and whose columns are predicted classes. Diagonal entries are correct predictions and off-diagonal entries are confusions between specific class pairs.
When there are more than two classes, true positives are defined per class using a one-vs-rest scheme: for class c, a TP for c is a sample whose true label is c and whose predicted label is also c. Everything not in c is treated as the negative class for the purposes of computing per-class precision, recall, and F1. Library functions like sklearn.metrics.precision_recall_fscore_support apply this logic automatically and offer averaging strategies (micro, macro, weighted, samples) to combine per-class scores. In micro averaging the TPs, FPs, and FNs from every class are summed before computing the metric, which weights every sample equally. In macro averaging the per-class metric is computed first and then averaged, which weights every class equally regardless of frequency.
Reporting the raw count of true positives, by itself, is rarely informative because the count grows trivially with the number of positive predictions. A spam filter that flags every email as spam will achieve TP = 200 in the example above, but it will also have FP = 800 and a precision of 0.20. The complementary errors matter just as much as the hits.
The problem becomes acute on imbalanced datasets, where one class is much rarer than the other. The accuracy paradox is the classic illustration: in a population where 1% of patients have a disease, a test that always predicts "healthy" achieves 99% accuracy yet has zero true positives and is medically useless. On data this skewed, TP-aware metrics like recall, precision, F1, the precision-recall curve, and the Matthews correlation coefficient are more informative than accuracy. Chicco and Jurman (2020) recommend MCC as a single robust score because it uses all four confusion-matrix cells and is less affected by class imbalance than F1 or accuracy.
Most classifiers produce a continuous score (a probability or logit) rather than a hard label. The label is obtained by comparing the score to a threshold. Sweeping the threshold from 0 to 1 traces out two important curves:
In both cases, lowering the threshold tends to increase TP (the model labels more samples as positive) at the cost of also increasing FP. Raising the threshold reverses the trade. Choosing the operating point is a deliberate decision about where on the curve the application should live.
The interpretation of a true positive depends on what the positive class represents. The same arithmetic supports very different practical decisions.
| Domain | Positive class | True positive means | Cost of FN | Cost of FP |
|---|---|---|---|---|
| Medical screening | Patient has the disease | Disease correctly detected | Patient goes untreated | Healthy patient sent for follow-up tests |
| Spam filter | Email is spam | Spam correctly blocked | Spam reaches the inbox | Legitimate mail lost in the spam folder |
| Fraud detection | Transaction is fraudulent | Fraud correctly flagged | Fraudulent charge succeeds | Legitimate purchase declined |
| Object detection | Bounding box matches a real object | Predicted box overlaps a real box with IoU above the threshold | Object missed entirely | Phantom object reported |
| Information retrieval | Document is relevant | Relevant document returned | Relevant document missing from results | Irrelevant document clutters results |
| Quality control | Part is defective | Defective part removed from the line | Defective part ships to a customer | Good part scrapped |
Object detection deserves a special note because the definition of a TP is not just a label match. A predicted bounding box counts as a true positive only if its intersection-over-union (IoU) with a ground-truth box exceeds a chosen threshold and the predicted class is correct. PASCAL VOC uses a single IoU threshold of 0.5; the COCO benchmark averages average precision over ten thresholds from 0.5 to 0.95 in steps of 0.05, written as AP@[.5:.05:.95]. Tightening the threshold makes the TP definition stricter and rewards models that localise objects precisely.
In most real applications the cost of a false negative differs from the cost of a false positive, sometimes by orders of magnitude. Missing a malignant tumour is not equivalent to recalling a patient for an unnecessary biopsy, and failing to block a fraudulent wire transfer is not equivalent to declining a legitimate one. This asymmetry is what makes the choice between favouring recall and favouring precision an engineering and policy decision rather than a pure modelling one.
Several practical levers exist for tuning the trade-off without changing the model itself. Adjusting the decision threshold on the predicted score moves the operating point along the ROC or PR curve. Cost-sensitive learning attaches different misclassification costs to FP and FN during training, so the optimiser produces a model that already favours the cheaper errors. Class-weighted loss functions, available in sklearn via the class_weight parameter and in deep-learning frameworks via per-sample weights, behave similarly. Resampling the training data (oversampling the minority class with techniques like SMOTE, or undersampling the majority class) is another common approach for imbalanced problems. The F-beta score generalises F1 by weighting recall and precision unequally: F2 weights recall twice as heavily as precision and is often used in medical screening, while F0.5 favours precision and is used in spam filtering.
A handful of recurring errors crop up in code that counts true positives. Row and column confusion is the most frequent: sklearn.metrics.confusion_matrix returns a matrix where rows are true labels and columns are predicted labels, but other libraries and many textbook diagrams swap them. Reading TP from the wrong cell is one of the most common bugs in evaluation code. A second pitfall is silently flipping the positive class. In scikit-learn the positive class for binary metrics defaults to the label 1, but precision_score and related functions accept an explicit pos_label argument. If the positive class is encoded as 0 (or as a string like "spam") and pos_label is not set, the reported numbers may answer a different question than intended.
A third pitfall is reporting a single TP-derived metric without context. A model can have very high recall and very poor precision, or vice versa, and either deficiency may be unacceptable in deployment. Reporting precision and recall together, or using F1 or MCC as a single summary, is safer than reporting either in isolation. The fourth pitfall is evaluating on the training set. TP, FP, TN, and FN should be computed on a held-out test set or by cross-validation; numbers computed on the training data measure memorisation rather than generalisation.