True positive (TP)

See also: Machine learning terms

In machine learning and statistics, a true positive (TP) is a prediction that correctly assigns the positive label to a sample whose actual label is positive. It is one of the four basic outcomes of a binary classification decision, alongside false positive, true negative, and false negative. Together these four counts make up the confusion matrix, the foundational accounting device used to evaluate almost every classification model.

The true positive count is the load-bearing quantity behind a long list of derived metrics: precision, recall, sensitivity, the F1 score, and the true positive rate (TPR) all depend on TP. Whether a classifier is filtering spam, screening for cancer, or drawing bounding boxes around pedestrians, the question "how many real positives did the model actually catch?" is answered by counting true positives.

Definition

Given a binary classifier that outputs a predicted label (positive or negative) for each sample with a known ground-truth label, the four possible outcomes are:

True positive (TP): actual positive, predicted positive.
False negative (FN): actual positive, predicted negative.
False positive (FP): actual negative, predicted positive.
True negative (TN): actual negative, predicted negative.

Following the convention used in the Wikipedia article on the confusion matrix, a true positive is the cell at row "actual positive" and column "predicted positive" in the 2x2 matrix; the positive class was correctly identified.

The definition assumes that one of the two classes has been designated as the positive class. In medical screening the positive class is usually "has the disease." In spam filtering it is usually "spam." In fraud detection it is "fraudulent transaction." The choice is conventional, but it determines what counts as a true positive and which derived metrics carry the meaning the practitioner cares about.

The 2x2 confusion matrix

The confusion matrix lays out the four counts in a single table. The convention used by scikit-learn, by Wikipedia, and by most machine-learning textbooks places actual labels on the rows and predicted labels on the columns:

	Predicted positive	Predicted negative	Row total
Actual positive	True positive (TP)	False negative (FN)	P = TP + FN
Actual negative	False positive (FP)	True negative (TN)	N = FP + TN
Column total	TP + FP	FN + TN	TP + FP + FN + TN

P is the total number of actual positives in the evaluation set and N is the total number of actual negatives. The diagonal of the matrix (TP and TN) counts correct predictions; the off-diagonal (FP and FN) counts errors. Some references swap rows and columns or list predictions on the rows; either layout works as long as the labels are read carefully. Confusing the two is a common source of bugs in evaluation code.

Origin in signal detection theory and medical statistics

The vocabulary of true and false positives predates machine learning by several decades. During and after World War II, radar operators and acousticians faced the problem of deciding whether a faint blip on a screen was a real target or random noise. The mathematical framework that grew out of this work, signal detection theory, classified each decision as a hit (signal present, observer says yes), a miss (signal present, observer says no), a false alarm (signal absent, observer says yes), or a correct rejection (signal absent, observer says no). These four outcomes map directly onto today's true positive, false negative, false positive, and true negative.

The terms sensitivity and specificity were introduced by the American biostatistician Jacob Yerushalmy in a 1947 paper on radiographic screening for tuberculosis. Sensitivity (the fraction of actual cases a test catches) and specificity (the fraction of actual non-cases it clears) became the standard way to describe the accuracy of a diagnostic test. The confusion matrix itself was later adapted from human perceptual studies and used by Frank Rosenblatt, among other early researchers, to compare human and machine classifications. Modern machine-learning evaluation inherited the entire conceptual machinery: the four cells, the trade-off between catching real positives and avoiding false alarms, and the language of "true" and "false" labels.

Worked numerical example

A worked example helps fix the definitions. Suppose a spam classifier is evaluated on 1,000 emails. Of these, 200 are actually spam and 800 are legitimate. The classifier flags 240 messages as spam. Among the 240 flagged, 180 are genuinely spam and 60 are legitimate messages misclassified as spam. The remaining 760 messages are predicted to be legitimate, and of those, 20 are actually spam that slipped through.

	Predicted spam	Predicted legitimate	Row total
Actual spam	TP = 180	FN = 20	200
Actual legitimate	FP = 60	TN = 740	800
Column total	240	760	1,000

From this table:

Recall (sensitivity, TPR) = TP / (TP + FN) = 180 / 200 = 0.90. The filter catches 90% of real spam.
Precision (PPV) = TP / (TP + FP) = 180 / 240 = 0.75. Three out of four messages flagged as spam really are spam.
Specificity (TNR) = TN / (TN + FP) = 740 / 800 = 0.925. The filter correctly clears 92.5% of legitimate mail.
Accuracy = (TP + TN) / 1,000 = 920 / 1,000 = 0.92.
F1 = 2 * (0.75 * 0.90) / (0.75 + 0.90) = 1.35 / 1.65 ~ 0.818.

Notice how every metric except specificity depends on TP. Change the count of true positives and the picture of the model changes immediately.

Metrics derived from true positives

The table below summarises the most common metrics whose definition includes TP, along with their formulas and the question each metric answers. All of these are computed routinely by libraries like scikit-learn through sklearn.metrics.classification_report and sklearn.metrics.confusion_matrix.

Metric	Formula	Question it answers
Recall, sensitivity, TPR	TP / (TP + FN)	Of all real positives, what fraction did the model catch?
Precision, positive predictive value (PPV)	TP / (TP + FP)	Of all positive predictions, what fraction were correct?
F1 score	2 * TP / (2 * TP + FP + FN)	Harmonic mean of precision and recall.
F-beta score	(1 + beta^2) * TP / ((1 + beta^2) * TP + beta^2 * FN + FP)	Weighted balance of precision and recall (beta > 1 favours recall).
Accuracy	(TP + TN) / (TP + FP + TN + FN)	Of all predictions, what fraction were correct?
Specificity, true negative rate (TNR)	TN / (TN + FP)	Of all real negatives, what fraction were correctly cleared?
False positive rate (FPR)	FP / (FP + TN)	Of all real negatives, what fraction were wrongly flagged?
False negative rate (FNR)	FN / (TP + FN)	Of all real positives, what fraction were missed?
Negative predictive value (NPV)	TN / (TN + FN)	Of all negative predictions, what fraction were correct?
Matthews correlation coefficient (MCC)	(TP * TN - FP * FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN))	Single balanced score in [-1, 1] using all four cells.

F1 has the convenient algebraic identity 2 * TP / (2 * TP + FP + FN), which makes plain that it ignores TN entirely. Accuracy treats every cell equally, which is why it can be misleading on skewed datasets. Sensitivity and recall are mathematically identical; the first term comes from medicine, the second from information retrieval, and most libraries report both.

Multi-class generalisation

The confusion matrix extends naturally to k-class problems by becoming a k by k table whose rows are actual classes and whose columns are predicted classes. Diagonal entries are correct predictions and off-diagonal entries are confusions between specific class pairs.

When there are more than two classes, true positives are defined per class using a one-vs-rest scheme: for class c, a TP for c is a sample whose true label is c and whose predicted label is also c. Everything not in c is treated as the negative class for the purposes of computing per-class precision, recall, and F1. Library functions like sklearn.metrics.precision_recall_fscore_support apply this logic automatically and offer averaging strategies (micro, macro, weighted, samples) to combine per-class scores. In micro averaging the TPs, FPs, and FNs from every class are summed before computing the metric, which weights every sample equally. In macro averaging the per-class metric is computed first and then averaged, which weights every class equally regardless of frequency.

Why true positives alone are not enough

Reporting the raw count of true positives, by itself, is rarely informative because the count grows trivially with the number of positive predictions. A spam filter that flags every email as spam will achieve TP = 200 in the example above, but it will also have FP = 800 and a precision of 0.20. The complementary errors matter just as much as the hits.

The problem becomes acute on imbalanced datasets, where one class is much rarer than the other. The accuracy paradox is the classic illustration: in a population where 1% of patients have a disease, a test that always predicts "healthy" achieves 99% accuracy yet has zero true positives and is medically useless. On data this skewed, TP-aware metrics like recall, precision, F1, the precision-recall curve, and the Matthews correlation coefficient are more informative than accuracy. Chicco and Jurman (2020) recommend MCC as a single robust score because it uses all four confusion-matrix cells and is less affected by class imbalance than F1 or accuracy.

Relationship to the ROC and precision-recall curves

Most classifiers produce a continuous score (a probability or logit) rather than a hard label. The label is obtained by comparing the score to a threshold. Sweeping the threshold from 0 to 1 traces out two important curves:

The ROC curve plots the true positive rate (TPR = TP / (TP + FN)) against the false positive rate (FPR = FP / (FP + TN)). Each point on the curve corresponds to a different threshold, and the area under the curve (AUC) summarises performance across all thresholds.
The precision-recall (PR) curve plots precision against recall as the threshold changes. It is generally preferred over the ROC curve when the positive class is rare, because precision and recall ignore TN, so the PR curve is unaffected by a large negative-class baseline.

In both cases, lowering the threshold tends to increase TP (the model labels more samples as positive) at the cost of also increasing FP. Raising the threshold reverses the trade. Choosing the operating point is a deliberate decision about where on the curve the application should live.

Domain examples

The interpretation of a true positive depends on what the positive class represents. The same arithmetic supports very different practical decisions.

Domain	Positive class	True positive means	Cost of FN	Cost of FP
Medical screening	Patient has the disease	Disease correctly detected	Patient goes untreated	Healthy patient sent for follow-up tests
Spam filter	Email is spam	Spam correctly blocked	Spam reaches the inbox	Legitimate mail lost in the spam folder
Fraud detection	Transaction is fraudulent	Fraud correctly flagged	Fraudulent charge succeeds	Legitimate purchase declined
Object detection	Bounding box matches a real object	Predicted box overlaps a real box with IoU above the threshold	Object missed entirely	Phantom object reported
Information retrieval	Document is relevant	Relevant document returned	Relevant document missing from results	Irrelevant document clutters results
Quality control	Part is defective	Defective part removed from the line	Defective part ships to a customer	Good part scrapped

Object detection deserves a special note because the definition of a TP is not just a label match. A predicted bounding box counts as a true positive only if its intersection-over-union (IoU) with a ground-truth box exceeds a chosen threshold and the predicted class is correct. PASCAL VOC uses a single IoU threshold of 0.5; the COCO benchmark averages average precision over ten thresholds from 0.5 to 0.95 in steps of 0.05, written as AP@[.5:.05:.95]. Tightening the threshold makes the TP definition stricter and rewards models that localise objects precisely.

Cost-sensitive considerations

In most real applications the cost of a false negative differs from the cost of a false positive, sometimes by orders of magnitude. Missing a malignant tumour is not equivalent to recalling a patient for an unnecessary biopsy, and failing to block a fraudulent wire transfer is not equivalent to declining a legitimate one. This asymmetry is what makes the choice between favouring recall and favouring precision an engineering and policy decision rather than a pure modelling one.

Several practical levers exist for tuning the trade-off without changing the model itself. Adjusting the decision threshold on the predicted score moves the operating point along the ROC or PR curve. Cost-sensitive learning attaches different misclassification costs to FP and FN during training, so the optimiser produces a model that already favours the cheaper errors. Class-weighted loss functions, available in sklearn via the class_weight parameter and in deep-learning frameworks via per-sample weights, behave similarly. Resampling the training data (oversampling the minority class with techniques like SMOTE, or undersampling the majority class) is another common approach for imbalanced problems. The F-beta score generalises F1 by weighting recall and precision unequally: F2 weights recall twice as heavily as precision and is often used in medical screening, while F0.5 favours precision and is used in spam filtering.

Common pitfalls

A handful of recurring errors crop up in code that counts true positives. Row and column confusion is the most frequent: sklearn.metrics.confusion_matrix returns a matrix where rows are true labels and columns are predicted labels, but other libraries and many textbook diagrams swap them. Reading TP from the wrong cell is one of the most common bugs in evaluation code. A second pitfall is silently flipping the positive class. In scikit-learn the positive class for binary metrics defaults to the label 1, but precision_score and related functions accept an explicit pos_label argument. If the positive class is encoded as 0 (or as a string like "spam") and pos_label is not set, the reported numbers may answer a different question than intended.

A third pitfall is reporting a single TP-derived metric without context. A model can have very high recall and very poor precision, or vice versa, and either deficiency may be unacceptable in deployment. Reporting precision and recall together, or using F1 or MCC as a single summary, is safer than reporting either in isolation. The fourth pitfall is evaluating on the training set. TP, FP, TN, and FN should be computed on a held-out test set or by cross-validation; numbers computed on the training data measure memorisation rather than generalisation.

True positive (TP)

Definition

The 2x2 confusion matrix

Origin in signal detection theory and medical statistics

Worked numerical example

Metrics derived from true positives

Multi-class generalisation

Why true positives alone are not enough

Relationship to the ROC and precision-recall curves

Domain examples

Cost-sensitive considerations

Common pitfalls

References

Improve this article

Definition

The 2x2 confusion matrix

Origin in signal detection theory and medical statistics

Worked numerical example

Metrics derived from true positives

Multi-class generalisation

Why true positives alone are not enough

Relationship to the ROC and precision-recall curves

Domain examples

Cost-sensitive considerations

Common pitfalls

References

Definition

The 2x2 confusion matrix

Origin in signal detection theory and medical statistics

Worked numerical example

Metrics derived from true positives

Multi-class generalisation

Why true positives alone are not enough

Relationship to the ROC and precision-recall curves

Domain examples

Cost-sensitive considerations

Common pitfalls

References

Improve this article

Related Articles

PR AUC (area under the PR curve)

True positive rate (TPR)

IoU

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Definition

The 2x2 confusion matrix

Origin in signal detection theory and medical statistics

Worked numerical example

Metrics derived from true positives

Multi-class generalisation

Why true positives alone are not enough

Relationship to the ROC and precision-recall curves

Domain examples

Cost-sensitive considerations

Common pitfalls

References

Related Articles

PR AUC (area under the PR curve)

True positive rate (TPR)

IoU

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy