True positive
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,520 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,520 words
Add missing citations, update stale details, or suggest a clearer explanation.
A confusion matrix, also called an error matrix or contingency table, is a tabular layout that summarizes the performance of a classification algorithm by comparing its predictions against ground-truth labels.[1][2] In the binary case the matrix has four cells: true positive (TP) counts cases correctly predicted as positive, true negative (TN) counts cases correctly predicted as negative, false positive (FP) counts negative cases mislabelled as positive, and false negative (FN) counts positive cases mislabelled as negative.[1][3] These four counts are the raw material for almost every classification metric, including accuracy, precision, recall, specificity, the F1 score, the Matthews correlation coefficient (MCC), and Cohen's kappa.[1][3][4] The terms false positive and false negative correspond, respectively, to the Type I and Type II errors introduced in the 1933 paper by Jerzy Neyman and Egon Pearson on hypothesis testing.[5] The confusion matrix is widely used to evaluate machine-learning classifiers, medical diagnostic tests, fraud-detection systems, and any other procedure that assigns discrete labels to instances.[6][7]
The conceptual distinction between the two kinds of classification error predates machine learning. In their 1933 paper "On the problem of the most efficient tests of statistical hypotheses", published in the Philosophical Transactions of the Royal Society, Series A, Jerzy Neyman and Egon Pearson formalized hypothesis testing as a decision problem in which one chooses between a null hypothesis and an alternative, and described the two ways such a decision can be wrong.[5] An error of the first kind (Type I) is the rejection of a true null hypothesis, and an error of the second kind (Type II) is the failure to reject a false null hypothesis.[5] When the null corresponds to the negative class, Type I and Type II coincide with the classification labels false positive and false negative.[8] Neyman and Pearson also fixed the convention that the probability of a Type I error is denoted by alpha and the probability of a Type II error by beta, so the power of the test is 1 minus beta.[5]
Signal detection theory, developed at the University of Michigan and Bell Labs in the 1950s for radar and psychophysics applications, gave the same four outcomes operational names. In a signal-versus-noise discrimination experiment the response can be a hit (TP), a miss (FN), a false alarm (FP), or a correct rejection (TN), and the performance of an observer is traced out as a receiver operating characteristic (roc receiver operating characteristic curve) curve.[7] In medical diagnostics the same vocabulary is recast in terms of sensitivity and specificity, which D. G. Altman and J. M. Bland popularized in the Statistics Notes series of the BMJ in 1994.[6] The terminology entered the machine learning vocabulary through pattern-recognition research, where early perceptron and nearest-neighbour studies used contingency tables to compare predicted and observed labels; the name "confusion matrix" itself comes from the observation that off-diagonal entries record which classes the model confuses for which others.[1][2]
By the mid-1990s the four-cell vocabulary was the standard way to report classifier performance in supervised machine learning. Foster Provost and Tom Fawcett argued in a sequence of papers from 1997 to 2006 that bare classification accuracy was an inadequate summary, and that practitioners should reason directly about the confusion matrix and the trade-off between its off-diagonal cells.[7] Christopher Bishop's Pattern Recognition and Machine Learning (2006) and Hastie, Tibshirani, and Friedman's The Elements of Statistical Learning (second edition, 2009) both present the confusion matrix as the canonical bridge between the binary loss function and the metrics that practitioners actually report.[9][10]
For a binary task with positive and negative classes, every prediction lands in exactly one of four cells. The convention adopted in this article (and in most modern textbooks) places the true label on the rows and the predicted label on the columns; some textbooks transpose the layout, which is one source of confusion when comparing references.
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
A true positive is an instance whose true label is positive and whose predicted label is also positive.[1][3] In a spam filter a TP is a spam email correctly routed to the junk folder; in a mammography screen a TP is a malignant tumour correctly flagged for biopsy; in a credit-card fraud detection system a TP is a fraudulent transaction correctly blocked.[6][7] The number of TPs is the diagonal entry in the positive row and positive column of the matrix and is the numerator of both precision and recall.[3]
A true negative is an instance whose true label is negative and whose predicted label is also negative.[1][3] In a spam filter a TN is a legitimate email correctly delivered to the inbox; in mammography a TN is a benign image correctly cleared; in fraud detection a TN is a legitimate transaction correctly approved.[6] The TN count is the diagonal entry in the negative row and negative column and is the numerator of specificity and of the negative predictive value (NPV).[6][11]
A false positive, also called a Type I error or a false alarm, is an instance whose true label is negative but whose predicted label is positive.[1][5] In spam filtering an FP is a wanted email incorrectly diverted to junk; in mammography an FP is a benign image incorrectly flagged for biopsy, sending the patient through follow-up procedures she did not need; in fraud detection an FP is a legitimate purchase incorrectly blocked.[6][7] FPs inflate the denominator of precision and reduce specificity. The false-positive rate FPR = FP / (FP + TN) is also called the fall-out; in signal detection it is the probability of issuing a false alarm.[7]
A false negative, also called a Type II error or a miss, is an instance whose true label is positive but whose predicted label is negative.[1][5] In spam filtering an FN is a spam email that lands in the inbox; in mammography an FN is a missed cancer; in fraud detection an FN is a fraudulent transaction that is wrongly approved.[6][7] FNs reduce recall and increase the false-negative rate FNR = FN / (FN + TP), also called the miss rate.[7] Whether FPs or FNs are the more costly kind of error depends entirely on the application: in mammography FNs are usually viewed as far more damaging than FPs because a missed cancer can be lethal, whereas in spam filtering FPs are typically more damaging because losing a legitimate email is worse than reading a junk one.[6][7]
The canonical two-by-two confusion matrix collects the four counts TP, FN, FP, TN for a single decision threshold. Suppose a model classifies 1000 patients for a disease with 100 actual cases. After applying the chosen threshold the predictions resolve into 90 TPs, 10 FNs, 50 FPs, and 850 TNs. The matrix is then:
| Predicted positive | Predicted negative | Total | |
|---|---|---|---|
| Actual positive | TP = 90 | FN = 10 | P = 100 |
| Actual negative | FP = 50 | TN = 850 | N = 900 |
| Total | 140 | 860 | 1000 |
Several derived quantities follow directly. The total number of actual positives is P = TP + FN = 100; the total number of actual negatives is N = FP + TN = 900; the total number of predicted positives is PP = TP + FP = 140; the total number of predicted negatives is PN = FN + TN = 860; and the total sample size is N + P = 1000.[3][7] The class prevalence in this example is 100 / 1000 = 10 percent, which immediately reveals that a constant "predict negative" classifier would already achieve 90 percent accuracy, illustrating why bare accuracy is misleading on imbalanced data.[4][11]
A confusion matrix is always constructed at a specific decision threshold. Most modern classifiers (logistic regression, neural network, gradient boosted decision trees gbt) produce a continuous score and convert it to a binary label by comparison with a classification threshold; sweeping the threshold from low to high traces out a family of confusion matrices, each with different TP, FN, FP, and TN counts.[7] This is the construction that produces the roc receiver operating characteristic curve and precision-recall curve.
For a problem with K classes the confusion matrix generalises to a K-by-K table whose rows are true classes and whose columns are predicted classes; the diagonal entries are correct predictions and off-diagonal entries are misclassifications.[11] In scikit-learn's convention, entry (i, j) gives the number of samples actually in class i that were predicted as class j.[11] For the three-class example in the scikit-learn documentation, with y_true = [2, 0, 2, 2, 0, 1] and y_pred = [0, 0, 2, 2, 0, 2], the resulting matrix is
[[2, 0, 0],
[0, 0, 1],
[1, 0, 2]]
so class 0 is always predicted correctly, class 1 is always misclassified as class 2, and class 2 is mostly correct with one confusion as class 0.[11] Inspecting which off-diagonal cells are large is a common diagnostic step: for example, in MNIST digit recognition the confusion of 4 with 9 and 3 with 8 are well-known failure modes that the matrix reveals at a glance.
To compute per-class precision, recall, and F1 from a multiclass matrix, each class k is treated as the positive class in a one-versus-rest binary problem. TP for class k is the diagonal entry C[k, k]; FN is the sum of off-diagonal entries in row k; FP is the sum of off-diagonal entries in column k; TN is the sum of all other entries.[11][12] These per-class metrics are then aggregated. Macro averaging computes the metric for each class and takes an unweighted mean, giving each class equal weight regardless of its frequency. Weighted averaging does the same but weights each class by its support (number of true instances). Micro averaging sums the per-class TP, FP, and FN across all classes before computing the metric, so micro-averaged precision, recall, and F1 are equal to overall accuracy in the standard multiclass setting.[11][12] The choice between macro and micro matters most when classes are imbalanced: macro F1 punishes a model for ignoring a rare class, whereas micro F1 lets the model coast on the majority class.[12]
The four cells of the confusion matrix are the building blocks of a long list of scalar metrics. The most important are defined below using the abbreviations TP, FP, FN, TN. Most of the formulas appear in scikit-learn's Metrics and scoring documentation,[11] in Fawcett's ROC tutorial,[7] and in Sokolova and Lapalme's systematic survey of twenty-four classification metrics.[12]
Accuracy is the fraction of predictions that are correct:
Accuracy = (TP + TN) / (TP + FN + FP + TN)
For the worked example above, accuracy = (90 + 850) / 1000 = 0.94.[11] Accuracy is intuitive but degenerates on imbalanced datasets, where a trivial classifier that always predicts the majority class can score high without learning anything.[4][11] Balanced accuracy averages sensitivity and specificity to correct for this bias; for binary classification it is Balanced accuracy = (TP / (TP + FN) + TN / (TN + FP)) / 2.[11]
Precision, also called the positive predictive value (PPV), is the fraction of predicted positives that are actually positive:
Precision = TP / (TP + FP)
precision answers the question "of the cases the model flagged as positive, how many really are positive?".[7][11] In the worked example precision = 90 / 140 = 0.643. High precision matters whenever a false positive triggers a costly downstream action, such as blocking a customer's credit card or biopsying a patient.[6]
Recall, also called sensitivity or the true positive rate (TPR), is the fraction of actual positives that the model finds:
Recall = TP / (TP + FN)
recall answers the question "of the positive cases that existed, how many did the model catch?".[7][11] In the worked example recall = 90 / 100 = 0.90. High recall matters whenever a false negative is more costly than a false positive, such as missing a cancer or letting a fraudulent transaction pass.[6]
Specificity, also called the true negative rate (TNR), is the fraction of actual negatives that the model correctly rejects:
Specificity = TN / (TN + FP)
Specificity is the medical-diagnostics counterpart of recall.[6][7] In the worked example specificity = 850 / 900 = 0.944. The false-positive rate is the complement: FPR = 1 - Specificity = FP / (FP + TN).[7]
The negative predictive value (NPV) is the fraction of predicted negatives that are actually negative:
NPV = TN / (TN + FN)
In the worked example NPV = 850 / 860 = 0.988. PPV and NPV are sensitive to disease prevalence in ways that sensitivity and specificity are not, and Altman and Bland recommended that diagnostic-test reports include all four.[6]
The f1 score is the harmonic mean of precision and recall:
F1 = 2 * Precision * Recall / (Precision + Recall) = 2 * TP / (2 * TP + FP + FN)
For the worked example F1 = 2 * 0.643 * 0.90 / (0.643 + 0.90) = 0.750.[11] The F-beta score generalises F1 by weighting recall beta times as heavily as precision:
F_beta = (1 + beta^2) * Precision * Recall / (beta^2 * Precision + Recall)
F2 (beta = 2) emphasises recall and is used when missing positives is more costly than producing false positives; F0.5 emphasises precision.[11] F1 is widely reported but inherits a structural problem: it ignores the TN cell entirely, so two classifiers can have the same F1 yet very different specificity.[4]
The Matthews correlation coefficient was introduced by the biochemist Brian W. Matthews in 1975, in a paper that compared predicted and observed secondary structures of T4 phage lysozyme.[13] It is the Pearson correlation between the binary vectors of true and predicted labels and uses all four cells of the matrix:
MCC = (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
MCC ranges from -1 (totally inverted prediction) through 0 (no better than chance) to +1 (perfect prediction).[13][14] Davide Chicco and Giuseppe Jurman argued in BMC Genomics in 2020 that MCC is more informative than F1 and accuracy on imbalanced binary problems because it yields a high score only when all four cells are good. They give the canonical counter-example: a classifier that labels every instance positive on a 91-positive / 9-negative dataset attains accuracy 0.91 and F1 0.95 but MCC of approximately -0.03, exposing what F1 hides.[4] Scikit-learn implements both binary and multiclass MCC; the multiclass form generalises the formula using class totals over the K-by-K matrix.[11]
Cohen's kappa was introduced by Jacob Cohen in 1960 in Educational and Psychological Measurement to quantify inter-rater agreement on nominal categories while correcting for chance agreement.[15] Applied to a confusion matrix it measures how much better than random guessing the classifier is, given the marginal class distribution:
kappa = (p_o - p_e) / (1 - p_e)
where p_o is the observed agreement (overall accuracy) and p_e is the agreement expected by chance given the row and column marginals.[15] Kappa is 1 for perfect agreement, 0 for chance agreement, and negative when agreement is worse than chance.[11][15] Kappa shares MCC's virtue of penalising trivial classifiers but is sometimes hard to interpret because the same observed score can arise from very different combinations of marginals.
Several additional one-line metrics appear in the literature.[7][12] The false-positive rate (FPR) is FP / (FP + TN), the x-axis of the ROC curve. The false-negative rate (FNR) or miss rate is FN / (FN + TP). The false discovery rate (FDR), used heavily in multiple-comparisons statistics, is FP / (FP + TP). Prevalence is (TP + FN) / total. David Powers's 2011 paper introduced informedness (TPR + TNR - 1, also called Youden's J) and markedness (PPV + NPV - 1) and showed that MCC is the geometric mean of these two quantities, giving MCC a clean probabilistic interpretation.[16]
A classifier that outputs a continuous score induces a different confusion matrix at every decision threshold. Sweeping the threshold from very strict (predicts positive only when very confident) to very loose (predicts positive almost always) produces a family of (FPR, TPR) pairs. Plotting TPR against FPR produces the roc receiver operating characteristic curve; the area under that curve is the auc area under the roc curve, which equals the probability that the model assigns a higher score to a randomly chosen positive instance than to a randomly chosen negative one.[7] ROC is independent of class prevalence and of the chosen threshold, which is why it is the standard threshold-agnostic summary; the precision-recall curve plays the analogous role when one wants a prevalence-sensitive view.[11]
The link to the confusion matrix is direct: each point on a ROC or PR curve corresponds to a specific confusion matrix; reporting a confusion matrix without a curve fixes performance to a single threshold, while reporting a curve without a confusion matrix obscures what happens at the operating point that will actually be deployed.[7] Practitioners often report both: the curve and AUC for threshold-agnostic comparison and a confusion matrix at the chosen operating threshold for deployment.
In supervised machine learning the confusion matrix is the canonical first step in error analysis.[11] Scikit-learn's classification_report function reports per-class precision, recall, F1, and support derived from the matrix; the confusion_matrix function returns the raw counts and is often visualised with the ConfusionMatrixDisplay heatmap helper.[11] Library functions in scikit learn, tensorflow when available, PyTorch, R's caret package, and MATLAB's Statistics and Machine Learning Toolbox all expose the same four-cell abstraction. The matrix is typically computed on a held-out test set (or via cross-validation) rather than on training data, since training accuracy overstates performance for any model capable of overfitting.[10]
The confusion matrix is the lingua franca of diagnostic-test evaluation. A screening test for a disease produces TPs (correctly identified cases), TNs (correctly cleared patients), FPs (false alarms that trigger follow-up procedures), and FNs (missed cases). Sensitivity and specificity are the principal performance summaries, but as Altman and Bland argued in 1994, positive and negative predictive values are what actually matter to a patient because they answer the question "given my test result, how likely is it that I have (do not have) the disease?".[6] PPV and NPV depend on disease prevalence, which is why a high-sensitivity test can still have a low PPV when the disease is rare. The same vocabulary applies to PCR tests, antigen tests, mammography, colonoscopy screening, and laboratory assays.[6]
Credit-card and payment-fraud systems are highly imbalanced (fraud is typically well under one percent of transactions), so accuracy is a useless metric and confusion-matrix-derived measures take over.[14] Operators tune the threshold to balance the cost of FPs (annoyed customers, friction in checkout) against the cost of FNs (direct monetary loss and regulatory exposure). The matrix is also the input to cost-sensitive metrics such as expected cost = c_FP * FP + c_FN * FN, which allows asymmetric error costs to be encoded explicitly.[7][14] fraud detection pipelines often report the confusion matrix at several operating thresholds, plus precision at fixed recall and the area under the precision-recall curve.
The same Type I / Type II framework that motivates the confusion matrix appears in A/B testing, where alpha controls the chance of declaring a winning variant when there is no real effect (FP) and beta controls the chance of missing a real effect (FN). Other application areas that reduce to confusion-matrix evaluation include spam filtering, malware detection, content moderation, biometric authentication (binary classification of "same person" versus "different person"), anomaly detection, churn prediction, and image-segmentation pixel labelling.[7][12]
When one class dominates, accuracy and even F1 can mislead. A classifier that predicts the majority class for everything scores high on accuracy but has zero recall for the minority class.[4][11] The remedy is to report metrics that use all four cells (MCC, balanced accuracy, kappa) or to inspect precision and recall for the minority class explicitly.[4][14] Davide Chicco and Giuseppe Jurman recommend MCC as the default summary for binary tasks on imbalanced data.[4]
FPs and FNs are rarely equally costly, but precision, recall, and F1 treat them as if they were comparable. When costs differ, the right object to optimise is expected cost, not accuracy.[7] Operators choose the threshold that minimises expected cost given the prevalence and the cost ratio, which corresponds to picking a specific point on the ROC curve.[7]
Because the confusion matrix depends on a single threshold, comparing models on the basis of one matrix can be deceptive: model A may dominate model B at one threshold and lose at another. Threshold-agnostic summaries (AUC, average precision) avoid this problem; when a single threshold must be chosen, it should be selected on a validation set rather than the test set to avoid leakage.[7][11]
Macro, micro, and weighted averages can disagree dramatically on imbalanced multiclass problems. Papers that report a single F1 number without specifying the averaging convention can be misleading; the scikit-learn classification_report defaults to printing all three summary rows for this reason.[11][12]
The confusion matrix says nothing about whether the model's probability scores are calibrated. A model can have perfect precision and recall at the chosen threshold while still being wildly overconfident or underconfident in its probability outputs. calibration is a complementary axis of evaluation that requires Brier scores, log loss, or reliability diagrams rather than four-cell counts.[11]
Because the confusion matrix is a deterministic function of predictions on a labelled dataset, it inherits all the usual concerns about test-set construction: leakage from training, label noise, distribution shift between training and deployment, and selection bias.[10] A glittering confusion matrix on a contaminated test set proves nothing.
The four-cell matrix is usually drawn as a square table with row colour intensity proportional to count. Modern libraries render multiclass confusion matrices as heatmaps with a colour scale, often with row-normalised entries (each row sums to 1) so the diagonal shows per-class recall and off-diagonal entries show per-class confusion probabilities.[11] Scikit-learn's ConfusionMatrixDisplay.from_predictions and from_estimator helpers generate both raw-count and normalised heatmaps; TensorBoard and Weights and Biases offer similar interactive renders.
Two layout conventions exist. The "row equals truth, column equals prediction" convention (used by scikit-learn) is the one used throughout this article and matches the matrix that Wikipedia and most modern textbooks publish.[1][11] The transposed convention ("row equals prediction, column equals truth") is found in older signal-detection texts and in some medical-statistics references; the four cells contain the same numbers but TPR and FPR are read from different axes.[6] When comparing references, the safest practice is to label each axis explicitly.
For multiclass matrices, ordering the rows and columns to group semantically similar classes makes clusters of off-diagonal mass visible at a glance: in MNIST, for instance, ordering digits by visual similarity rather than numeric value reveals the well-known 4-9, 3-8, and 7-1 confusions.[11] For very large K the matrix becomes hard to read directly and is often summarised by per-class precision-recall bar charts or by reporting only the largest off-diagonal entries.
The confusion matrix is a complete summary of classification performance at a single threshold, but it has known limitations.[4][7][12] First, it discards the rank ordering of confidence scores; two classifiers with identical confusion matrices can have very different ROC curves at other thresholds. Second, by reducing each instance to "correct or incorrect" it loses information about probability calibration. Third, in multilabel and structured-output settings the binary-confusion abstraction is not always natural: a model that predicts five tags per instance has a different error structure than one that predicts a single label, and bespoke metrics (sample-averaged F1, Hamming loss, mean average precision) are usually required.[11][12] Fourth, when the cost of FPs and FNs varies across instances rather than just across classes (for instance, a fraud system where transaction amounts vary), the four counts need to be replaced by amount-weighted versions.[7]
Despite these caveats the four-cell decomposition remains the foundation of classification evaluation. Almost every metric in the standard toolbox can be written in terms of TP, TN, FP, and FN, and the matrix itself is the most direct picture of where a classifier is succeeding and where it is failing.