Confusion Matrix

A confusion matrix (also called an error matrix or classification matrix) is a table that summarizes the performance of a classification model by comparing predicted class labels against actual class labels. It is one of the most widely used tools in machine learning evaluation because it provides a detailed breakdown of correct and incorrect predictions, making it far more informative than a single aggregate metric like accuracy.

The confusion matrix is not limited to binary classification. It generalizes to multi-class problems, where the matrix becomes an N x N table for N classes. By examining the entries of a confusion matrix, practitioners can calculate a wide range of performance metrics, diagnose specific failure modes of a classifier, and make informed decisions about model improvements.

Historical background

The mathematical foundations of the confusion matrix trace back to 1904, when Karl Pearson published "On the theory of contingency and its relation to association and normal correlation." Pearson introduced the concept of contingency tables to study associations between categorical variables, and his work laid the groundwork for what would later become the confusion matrix in classification analysis.

The term "confusion matrix" itself emerged from the field of human perception studies. In the 1950s and 1960s, researchers studying auditory and visual stimuli used the term to describe tables that recorded how often subjects confused one stimulus with another. James Townsend's 1971 paper "Theoretical analysis of an alphabetic confusion matrix" applied the concept to the classification of uppercase English letters, analyzing how frequently participants misidentified one letter as another. The tool was later adopted by early machine learning researchers, including Frank Rosenblatt, who used it to compare human and machine classification performance.

The term gained formal recognition in the machine learning community through the glossary published by Ron Kohavi and Foster Provost in 1998, which appeared in a special issue of the journal Machine Learning.

Structure of a binary confusion matrix

For a binary classifier that assigns instances to either a positive class or a negative class, the confusion matrix is a 2 x 2 table with four cells:

	Predicted positive	Predicted negative
Actual positive	True positive (TP)	False negative (FN)
Actual negative	False positive (FP)	True negative (TN)

Each cell counts the number of instances that fall into that category:

True positive (TP): The model correctly predicts the positive class. The instance is actually positive, and the classifier labels it as positive.
True negative (TN): The model correctly predicts the negative class. The instance is actually negative, and the classifier labels it as negative.
False positive (FP): The model incorrectly predicts the positive class. The instance is actually negative, but the classifier labels it as positive. This is also known as a Type I error or a "false alarm."
False negative (FN): The model incorrectly predicts the negative class. The instance is actually positive, but the classifier labels it as negative. This is also known as a Type II error or a "miss."

The diagonal of the matrix (TP and TN) represents correct predictions, while the off-diagonal entries (FP and FN) represent misclassifications.

Worked example

Suppose a medical test is used to screen 200 patients for a disease. Of these, 50 actually have the disease and 150 do not. The test produces the following results:

	Predicted: Disease	Predicted: No disease	Total
Actual: Disease	45 (TP)	5 (FN)	50
Actual: No disease	10 (FP)	140 (TN)	150
Total	55	145	200

From this table, the test correctly identified 45 of the 50 patients who have the disease and correctly cleared 140 of the 150 patients who do not. However, it produced 10 false alarms and missed 5 actual cases.

Type I and Type II errors

The confusion matrix provides a direct way to count and analyze the two fundamental types of classification errors from statistical hypothesis testing:

Error type	Confusion matrix cell	Description	Also known as
Type I error	False positive (FP)	Rejecting a true null hypothesis; predicting positive when the actual class is negative	False alarm, alpha error
Type II error	False negative (FN)	Failing to reject a false null hypothesis; predicting negative when the actual class is positive	Miss, beta error

In many real-world applications, the two error types carry different costs. A false negative in cancer screening (missing a cancer case) is generally far more harmful than a false positive (an unnecessary follow-up test). Understanding the relative costs of Type I and Type II errors is critical for selecting an appropriate classification threshold and evaluating model fitness for a given task.

Derived performance metrics

One of the most valuable aspects of the confusion matrix is that it serves as the basis for computing a wide range of performance metrics. Each metric highlights a different aspect of classifier performance.

Primary metrics

Metric	Formula	Description
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Proportion of all predictions that are correct
Precision (positive predictive value)	TP / (TP + FP)	Proportion of positive predictions that are actually positive
Recall (sensitivity, true positive rate)	TP / (TP + FN)	Proportion of actual positives that are correctly identified
Specificity (true negative rate)	TN / (TN + FP)	Proportion of actual negatives that are correctly identified
F1 score	2 * (Precision * Recall) / (Precision + Recall)	Harmonic mean of precision and recall
Negative predictive value (NPV)	TN / (TN + FN)	Proportion of negative predictions that are actually negative

Error rate metrics

Metric	Formula	Description
False positive rate (FPR)	FP / (FP + TN)	Proportion of actual negatives incorrectly classified as positive; equals 1 - specificity
False negative rate (FNR)	FN / (FN + TP)	Proportion of actual positives incorrectly classified as negative; equals 1 - recall
False discovery rate (FDR)	FP / (FP + TP)	Proportion of positive predictions that are actually negative; equals 1 - precision
False omission rate (FOR)	FN / (FN + TN)	Proportion of negative predictions that are actually positive; equals 1 - NPV

Using the medical screening example above: accuracy = (45 + 140) / 200 = 0.925, precision = 45 / 55 = 0.818, recall = 45 / 50 = 0.90, specificity = 140 / 150 = 0.933, and F1 score = 2 * (0.818 * 0.90) / (0.818 + 0.90) = 0.857.

Matthews correlation coefficient

The Matthews correlation coefficient (MCC) is a metric that uses all four cells of the confusion matrix to produce a single score between -1 and +1. It is calculated as:

MCC = (TP x TN - FP x FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN))

A value of +1 indicates perfect classification, 0 indicates performance no better than random guessing, and -1 indicates complete inverse classification. The MCC is widely regarded as one of the most reliable single-number metrics for binary classification evaluation. Unlike accuracy, it remains informative even when classes are severely imbalanced. Research by Chicco and Jurman (2020) demonstrated that the MCC is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. A follow-up study (Chicco and Jurman, 2023) argued that the MCC should replace the ROC AUC as the standard metric for assessing binary classification.

The MCC is mathematically equivalent to the phi coefficient used in statistics, which measures the association between two binary variables.

Cohen's kappa

Cohen's kappa is another metric derived from the confusion matrix that accounts for the possibility of correct predictions occurring by chance. It is computed as:

Kappa = (observed accuracy - expected accuracy) / (1 - expected accuracy)

Here, "observed accuracy" is the overall accuracy of the model, and "expected accuracy" is the probability that the model and the actual labels agree by random chance alone. The coefficient ranges from -1 to +1, where +1 represents perfect agreement and 0 represents agreement no better than chance.

Landis and Koch (1977) proposed the following interpretation guidelines:

Kappa value	Interpretation
< 0.00	Poor
0.00 - 0.20	Slight
0.21 - 0.40	Fair
0.41 - 0.60	Moderate
0.61 - 0.80	Substantial
0.81 - 1.00	Almost perfect

Cohen's kappa is particularly useful when dealing with imbalanced datasets because it penalizes classifiers that achieve high accuracy simply by predicting the majority class. However, some researchers have cautioned that kappa can be difficult to interpret and depends on the prevalence of each class. Delgado and Tibau (2019) argued that kappa should be avoided as a standalone performance metric and used alongside other measures.

When accuracy is misleading: imbalanced datasets

Accuracy, despite being the most intuitive metric, can be deeply misleading when class distributions are imbalanced. Consider a dataset where 95% of instances belong to the negative class and only 5% belong to the positive class. A naive classifier that always predicts "negative" would achieve 95% accuracy while failing to identify a single positive instance.

The confusion matrix for such a classifier would look like this:

	Predicted positive	Predicted negative
Actual positive	0	50
Actual negative	0	950

Accuracy = 950 / 1000 = 0.95, but recall = 0 / 50 = 0, precision is undefined, and MCC = 0. The confusion matrix immediately exposes the problem that a single accuracy number hides.

For imbalanced datasets, metrics such as precision, recall, F1 score, and MCC are generally more informative. The confusion matrix itself remains the best starting point for understanding model behavior because it shows exactly how predictions are distributed across all classes.

Cost-sensitive analysis

In many practical applications, different types of misclassifications carry different costs. A confusion matrix can be paired with a cost matrix to compute the total expected cost of a classifier's predictions.

A cost matrix assigns a numerical penalty to each cell of the confusion matrix:

	Predicted positive	Predicted negative
Actual positive	C(TP) = 0	C(FN) = cost of missing a positive
Actual negative	C(FP) = cost of a false alarm	C(TN) = 0

The total cost is calculated as: Total Cost = C(FP) x FP + C(FN) x FN.

For example, in fraud detection, a false negative (missing a fraudulent transaction) might cost $10,000 on average, while a false positive (flagging a legitimate transaction) might cost $50 in customer inconvenience. A model with 20 false negatives and 200 false positives would have a total cost of 20 x $10,000 + 200 x $50 = $210,000. Cost-sensitive learning algorithms use this framework to optimize classifiers for minimum total cost rather than maximum accuracy.

Multi-class confusion matrix

When a classifier assigns instances to one of N classes (where N > 2), the confusion matrix expands to an N x N table. Each row corresponds to the actual class, and each column corresponds to the predicted class. The diagonal elements represent correct classifications, while off-diagonal elements represent misclassifications.

For example, a three-class classifier (classes A, B, C) might produce:

	Predicted A	Predicted B	Predicted C
Actual A	40	5	5
Actual B	3	42	5
Actual C	2	8	40

From this matrix, several observations can be made. Class A has the highest overall accuracy (40 out of 50 correct). Class C is most often confused with class B (8 instances), suggesting these two classes share similar features. Class B also shows some confusion with class C (5 instances), further supporting the similarity between those classes.

For multi-class problems, per-class metrics can be derived by treating each class as a one-vs-rest binary problem. For class A, TP = 40, FP = 5 (instances predicted as A but actually B or C), FN = 10 (instances actually A but predicted as B or C), and TN = all remaining entries. These per-class metrics can then be averaged across classes using macro-averaging (unweighted mean), micro-averaging (aggregate TP, FP, FN across all classes), or weighted averaging (weighted by class support).

Normalized confusion matrix

Raw counts in a confusion matrix can be hard to interpret when classes have different numbers of instances. Normalization converts the raw counts to proportions, making it easier to compare performance across classes.

Row normalization (recall-based)

Dividing each row by its row sum converts each entry to the proportion of instances in that actual class that were classified into each predicted class. The diagonal entries become the per-class recall values. This normalization is useful for answering: "Given an instance from class X, what is the probability the model predicts each class?"

Column normalization (precision-based)

Dividing each column by its column sum converts each entry to the proportion of instances predicted as that class that actually belong to each actual class. The diagonal entries become the per-class precision values. This normalization answers: "Given a prediction of class X, what is the probability the instance actually belongs to each class?"

Using the three-class example above with row normalization:

	Predicted A	Predicted B	Predicted C
Actual A	0.80	0.10	0.10
Actual B	0.06	0.84	0.10
Actual C	0.04	0.16	0.80

Normalized confusion matrices are especially helpful when classes are imbalanced, as they prevent large classes from dominating the visual impression of the matrix.

Confusion matrix in medical diagnosis

The confusion matrix has a long history of use in clinical medicine, where sensitivity and specificity are the primary measures of diagnostic test performance.

Medical term	Statistical equivalent	Formula
Sensitivity	Recall / true positive rate	TP / (TP + FN)
Specificity	True negative rate	TN / (TN + FP)
Positive predictive value (PPV)	Precision	TP / (TP + FP)
Negative predictive value (NPV)	--	TN / (TN + FN)

Screening tests (such as mammography for breast cancer or PCR tests for infectious diseases) are typically designed with high sensitivity to minimize the chance of missing a true case, even at the expense of lower specificity. Confirmatory diagnostic tests, by contrast, are designed with high specificity to minimize false positives.

The trade-off between sensitivity and specificity is inherent in any diagnostic or classification system. Lowering the classification threshold increases sensitivity (catching more true positives) but decreases specificity (producing more false positives). Clinicians and machine learning practitioners must select a threshold that balances these concerns based on the clinical or business context.

Relationship to ROC curves

A receiver operating characteristic (ROC) curve is closely related to the confusion matrix. Each point on an ROC curve corresponds to a specific classification threshold, and at each threshold, a distinct confusion matrix is generated. The ROC curve plots the true positive rate (recall) on the y-axis against the false positive rate (1 - specificity) on the x-axis across all possible thresholds.

While a confusion matrix captures model performance at a single threshold, the ROC curve provides a comprehensive view of performance across all thresholds. The area under the ROC curve (AUC) summarizes this into a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing.

The two tools complement each other. The confusion matrix is essential when a specific operating point (threshold) has been chosen and the exact counts of correct and incorrect predictions matter. The ROC curve is more useful during model selection and threshold tuning, when the goal is to understand how performance changes as the threshold varies.

Visualization

Confusion matrices are commonly visualized as heatmaps, where the intensity of each cell's color corresponds to its value. This makes it easy to spot patterns at a glance: bright diagonal cells indicate strong classification, while bright off-diagonal cells indicate systematic confusion between specific classes.

Most machine learning libraries provide built-in support for confusion matrix visualization:

scikit-learn (Python):

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Positive"])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.show()

The sklearn.metrics.confusion_matrix function accepts y_true and y_pred arrays and returns a NumPy array. The optional normalize parameter can be set to 'true' (row normalization), 'pred' (column normalization), or 'all' (normalization over the entire matrix).

PyTorch (via torchmetrics):

from torchmetrics.classification import BinaryConfusionMatrix
import torch

metric = BinaryConfusionMatrix()
preds = torch.tensor([0, 1, 0, 0, 0, 1, 1, 0, 1, 1])
target = torch.tensor([0, 1, 0, 1, 0, 1, 0, 0, 1, 1])
cm = metric(preds, target)
print(cm)

TensorFlow:

import tensorflow as tf

y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]

cm = tf.math.confusion_matrix(y_true, y_pred)
print(cm)

Explain like I'm 5 (ELI5)

Imagine you have a big pile of toys, and your job is to sort them into two boxes: one box for cars and one box for dolls. After you finish sorting, your mom checks your work and makes a chart:

Cars in the car box: You got these right. (True positive)
Dolls in the doll box: You got these right too. (True negative)
Dolls that ended up in the car box: Oops, you made a mistake. (False positive)
Cars that ended up in the doll box: Another mistake. (False negative)

That chart is a confusion matrix. It shows you exactly what you got right and what you got wrong, so you can figure out where you need to be more careful next time. If you were sorting into three boxes (cars, dolls, and blocks), the chart would be bigger, with more rows and columns to track all the possible mix-ups.

Limitations

While the confusion matrix is an essential evaluation tool, it has several limitations:

Single threshold: A confusion matrix captures performance at one specific decision threshold. It does not reveal how the classifier would perform at other thresholds.
No probability information: The matrix does not indicate the confidence of predictions. Two classifiers with identical confusion matrices may have very different probability calibration.
Does not explain reasoning: A confusion matrix shows what the model got right and wrong, but not why. Correct predictions may result from sound pattern recognition or from spurious correlations in the data.
Scales poorly for many classes: For problems with dozens or hundreds of classes, the N x N matrix becomes difficult to read and interpret without normalization or aggregation.
Requires labeled data: Computing a confusion matrix requires ground truth labels, which may be expensive or difficult to obtain in some domains.

References

Pearson, K. (1904). "On the theory of contingency and its relation to association and normal correlation." *Drapers' Company Research Memoirs*.
Townsend, J. T. (1971). "Theoretical analysis of an alphabetic confusion matrix." *Perception and Psychophysics*, 9(1), 40-50.
Kohavi, R. and Provost, F. (1998). "Glossary of terms." *Machine Learning*, 30, 271-274.
Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Powers, D. M. W. (2011). "Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
Chicco, D. and Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21, 6.
Chicco, D. and Jurman, G. (2023). "The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification." *BioData Mining*, 16, 4.
Landis, J. R. and Koch, G. G. (1977). "The measurement of observer agreement for categorical data." *Biometrics*, 33(1), 159-174.
Stehman, S. V. (1997). "Selecting and interpreting measures of thematic classification accuracy." *Remote Sensing of Environment*, 62(1), 77-89.
Delgado, R. and Tibau, X. A. (2019). "Why Cohen's kappa should be avoided as performance measure in classification." *PLoS ONE*, 14(9), e0222916.
Luque, A. et al. (2019). "The impact of class imbalance in classification performance metrics based on the binary confusion matrix." *Pattern Recognition*, 91, 216-231.

Historical background

Structure of a binary confusion matrix

Worked example

Type I and Type II errors

Derived performance metrics

Primary metrics

Error rate metrics

Matthews correlation coefficient

Cohen's kappa

When accuracy is misleading: imbalanced datasets

Cost-sensitive analysis

Multi-class confusion matrix

Normalized confusion matrix

Row normalization (recall-based)

Column normalization (precision-based)

Confusion matrix in medical diagnosis

Relationship to ROC curves

Visualization

Explain like I'm 5 (ELI5)

Limitations

References

Improve this article

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Classification Threshold

Decision Threshold

Precision

Historical background

Structure of a binary confusion matrix

Worked example

Type I and Type II errors

Derived performance metrics

Primary metrics

Error rate metrics

Matthews correlation coefficient

Cohen's kappa

When accuracy is misleading: imbalanced datasets

Cost-sensitive analysis

Multi-class confusion matrix

Normalized confusion matrix

Row normalization (recall-based)

Column normalization (precision-based)

Confusion matrix in medical diagnosis

Relationship to ROC curves

Visualization

Explain like I'm 5 (ELI5)

Limitations

References

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Classification Threshold

Decision Threshold

Precision