A confusion matrix (also called an error matrix or classification matrix) is a table that summarizes the performance of a classification model by comparing predicted class labels against actual class labels. It is one of the most widely used tools in machine learning evaluation because it provides a detailed breakdown of correct and incorrect predictions, making it far more informative than a single aggregate metric like accuracy.
The confusion matrix is not limited to binary classification. It generalizes to multi-class problems, where the matrix becomes an N x N table for N classes. By examining the entries of a confusion matrix, practitioners can calculate a wide range of performance metrics, diagnose specific failure modes of a classifier, and make informed decisions about model improvements.
The mathematical foundations of the confusion matrix trace back to 1904, when Karl Pearson published "On the theory of contingency and its relation to association and normal correlation." Pearson introduced the concept of contingency tables to study associations between categorical variables, and his work laid the groundwork for what would later become the confusion matrix in classification analysis.
The term "confusion matrix" itself emerged from the field of human perception studies. In the 1950s and 1960s, researchers studying auditory and visual stimuli used the term to describe tables that recorded how often subjects confused one stimulus with another. James Townsend's 1971 paper "Theoretical analysis of an alphabetic confusion matrix" applied the concept to the classification of uppercase English letters, analyzing how frequently participants misidentified one letter as another. The tool was later adopted by early machine learning researchers, including Frank Rosenblatt, who used it to compare human and machine classification performance.
The term gained formal recognition in the machine learning community through the glossary published by Ron Kohavi and Foster Provost in 1998, which appeared in a special issue of the journal Machine Learning.
For a binary classifier that assigns instances to either a positive class or a negative class, the confusion matrix is a 2 x 2 table with four cells:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | True positive (TP) | False negative (FN) |
| Actual negative | False positive (FP) | True negative (TN) |
Each cell counts the number of instances that fall into that category:
The diagonal of the matrix (TP and TN) represents correct predictions, while the off-diagonal entries (FP and FN) represent misclassifications.
Suppose a medical test is used to screen 200 patients for a disease. Of these, 50 actually have the disease and 150 do not. The test produces the following results:
| Predicted: Disease | Predicted: No disease | Total | |
|---|---|---|---|
| Actual: Disease | 45 (TP) | 5 (FN) | 50 |
| Actual: No disease | 10 (FP) | 140 (TN) | 150 |
| Total | 55 | 145 | 200 |
From this table, the test correctly identified 45 of the 50 patients who have the disease and correctly cleared 140 of the 150 patients who do not. However, it produced 10 false alarms and missed 5 actual cases.
The confusion matrix provides a direct way to count and analyze the two fundamental types of classification errors from statistical hypothesis testing:
| Error type | Confusion matrix cell | Description | Also known as |
|---|---|---|---|
| Type I error | False positive (FP) | Rejecting a true null hypothesis; predicting positive when the actual class is negative | False alarm, alpha error |
| Type II error | False negative (FN) | Failing to reject a false null hypothesis; predicting negative when the actual class is positive | Miss, beta error |
In many real-world applications, the two error types carry different costs. A false negative in cancer screening (missing a cancer case) is generally far more harmful than a false positive (an unnecessary follow-up test). Understanding the relative costs of Type I and Type II errors is critical for selecting an appropriate classification threshold and evaluating model fitness for a given task.
One of the most valuable aspects of the confusion matrix is that it serves as the basis for computing a wide range of performance metrics. Each metric highlights a different aspect of classifier performance.
| Metric | Formula | Description |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Proportion of all predictions that are correct |
| Precision (positive predictive value) | TP / (TP + FP) | Proportion of positive predictions that are actually positive |
| Recall (sensitivity, true positive rate) | TP / (TP + FN) | Proportion of actual positives that are correctly identified |
| Specificity (true negative rate) | TN / (TN + FP) | Proportion of actual negatives that are correctly identified |
| F1 score | 2 * (Precision * Recall) / (Precision + Recall) | Harmonic mean of precision and recall |
| Negative predictive value (NPV) | TN / (TN + FN) | Proportion of negative predictions that are actually negative |
| Metric | Formula | Description |
|---|---|---|
| False positive rate (FPR) | FP / (FP + TN) | Proportion of actual negatives incorrectly classified as positive; equals 1 - specificity |
| False negative rate (FNR) | FN / (FN + TP) | Proportion of actual positives incorrectly classified as negative; equals 1 - recall |
| False discovery rate (FDR) | FP / (FP + TP) | Proportion of positive predictions that are actually negative; equals 1 - precision |
| False omission rate (FOR) | FN / (FN + TN) | Proportion of negative predictions that are actually positive; equals 1 - NPV |
Using the medical screening example above: accuracy = (45 + 140) / 200 = 0.925, precision = 45 / 55 = 0.818, recall = 45 / 50 = 0.90, specificity = 140 / 150 = 0.933, and F1 score = 2 * (0.818 * 0.90) / (0.818 + 0.90) = 0.857.
The Matthews correlation coefficient (MCC) is a metric that uses all four cells of the confusion matrix to produce a single score between -1 and +1. It is calculated as:
MCC = (TP x TN - FP x FN) / sqrt((TP + FP)(TP + FN)(TN + FP)(TN + FN))
A value of +1 indicates perfect classification, 0 indicates performance no better than random guessing, and -1 indicates complete inverse classification. The MCC is widely regarded as one of the most reliable single-number metrics for binary classification evaluation. Unlike accuracy, it remains informative even when classes are severely imbalanced. Research by Chicco and Jurman (2020) demonstrated that the MCC is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. A follow-up study (Chicco and Jurman, 2023) argued that the MCC should replace the ROC AUC as the standard metric for assessing binary classification.
The MCC is mathematically equivalent to the phi coefficient used in statistics, which measures the association between two binary variables.
Cohen's kappa is another metric derived from the confusion matrix that accounts for the possibility of correct predictions occurring by chance. It is computed as:
Kappa = (observed accuracy - expected accuracy) / (1 - expected accuracy)
Here, "observed accuracy" is the overall accuracy of the model, and "expected accuracy" is the probability that the model and the actual labels agree by random chance alone. The coefficient ranges from -1 to +1, where +1 represents perfect agreement and 0 represents agreement no better than chance.
Landis and Koch (1977) proposed the following interpretation guidelines:
| Kappa value | Interpretation |
|---|---|
| < 0.00 | Poor |
| 0.00 - 0.20 | Slight |
| 0.21 - 0.40 | Fair |
| 0.41 - 0.60 | Moderate |
| 0.61 - 0.80 | Substantial |
| 0.81 - 1.00 | Almost perfect |
Cohen's kappa is particularly useful when dealing with imbalanced datasets because it penalizes classifiers that achieve high accuracy simply by predicting the majority class. However, some researchers have cautioned that kappa can be difficult to interpret and depends on the prevalence of each class. Delgado and Tibau (2019) argued that kappa should be avoided as a standalone performance metric and used alongside other measures.
Accuracy, despite being the most intuitive metric, can be deeply misleading when class distributions are imbalanced. Consider a dataset where 95% of instances belong to the negative class and only 5% belong to the positive class. A naive classifier that always predicts "negative" would achieve 95% accuracy while failing to identify a single positive instance.
The confusion matrix for such a classifier would look like this:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | 0 | 50 |
| Actual negative | 0 | 950 |
Accuracy = 950 / 1000 = 0.95, but recall = 0 / 50 = 0, precision is undefined, and MCC = 0. The confusion matrix immediately exposes the problem that a single accuracy number hides.
For imbalanced datasets, metrics such as precision, recall, F1 score, and MCC are generally more informative. The confusion matrix itself remains the best starting point for understanding model behavior because it shows exactly how predictions are distributed across all classes.
In many practical applications, different types of misclassifications carry different costs. A confusion matrix can be paired with a cost matrix to compute the total expected cost of a classifier's predictions.
A cost matrix assigns a numerical penalty to each cell of the confusion matrix:
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | C(TP) = 0 | C(FN) = cost of missing a positive |
| Actual negative | C(FP) = cost of a false alarm | C(TN) = 0 |
The total cost is calculated as: Total Cost = C(FP) x FP + C(FN) x FN.
For example, in fraud detection, a false negative (missing a fraudulent transaction) might cost $10,000 on average, while a false positive (flagging a legitimate transaction) might cost $50 in customer inconvenience. A model with 20 false negatives and 200 false positives would have a total cost of 20 x $10,000 + 200 x $50 = $210,000. Cost-sensitive learning algorithms use this framework to optimize classifiers for minimum total cost rather than maximum accuracy.
When a classifier assigns instances to one of N classes (where N > 2), the confusion matrix expands to an N x N table. Each row corresponds to the actual class, and each column corresponds to the predicted class. The diagonal elements represent correct classifications, while off-diagonal elements represent misclassifications.
For example, a three-class classifier (classes A, B, C) might produce:
| Predicted A | Predicted B | Predicted C | |
|---|---|---|---|
| Actual A | 40 | 5 | 5 |
| Actual B | 3 | 42 | 5 |
| Actual C | 2 | 8 | 40 |
From this matrix, several observations can be made. Class A has the highest overall accuracy (40 out of 50 correct). Class C is most often confused with class B (8 instances), suggesting these two classes share similar features. Class B also shows some confusion with class C (5 instances), further supporting the similarity between those classes.
For multi-class problems, per-class metrics can be derived by treating each class as a one-vs-rest binary problem. For class A, TP = 40, FP = 5 (instances predicted as A but actually B or C), FN = 10 (instances actually A but predicted as B or C), and TN = all remaining entries. These per-class metrics can then be averaged across classes using macro-averaging (unweighted mean), micro-averaging (aggregate TP, FP, FN across all classes), or weighted averaging (weighted by class support).
Raw counts in a confusion matrix can be hard to interpret when classes have different numbers of instances. Normalization converts the raw counts to proportions, making it easier to compare performance across classes.
Dividing each row by its row sum converts each entry to the proportion of instances in that actual class that were classified into each predicted class. The diagonal entries become the per-class recall values. This normalization is useful for answering: "Given an instance from class X, what is the probability the model predicts each class?"
Dividing each column by its column sum converts each entry to the proportion of instances predicted as that class that actually belong to each actual class. The diagonal entries become the per-class precision values. This normalization answers: "Given a prediction of class X, what is the probability the instance actually belongs to each class?"
Using the three-class example above with row normalization:
| Predicted A | Predicted B | Predicted C | |
|---|---|---|---|
| Actual A | 0.80 | 0.10 | 0.10 |
| Actual B | 0.06 | 0.84 | 0.10 |
| Actual C | 0.04 | 0.16 | 0.80 |
Normalized confusion matrices are especially helpful when classes are imbalanced, as they prevent large classes from dominating the visual impression of the matrix.
The confusion matrix has a long history of use in clinical medicine, where sensitivity and specificity are the primary measures of diagnostic test performance.
| Medical term | Statistical equivalent | Formula |
|---|---|---|
| Sensitivity | Recall / true positive rate | TP / (TP + FN) |
| Specificity | True negative rate | TN / (TN + FP) |
| Positive predictive value (PPV) | Precision | TP / (TP + FP) |
| Negative predictive value (NPV) | -- | TN / (TN + FN) |
Screening tests (such as mammography for breast cancer or PCR tests for infectious diseases) are typically designed with high sensitivity to minimize the chance of missing a true case, even at the expense of lower specificity. Confirmatory diagnostic tests, by contrast, are designed with high specificity to minimize false positives.
The trade-off between sensitivity and specificity is inherent in any diagnostic or classification system. Lowering the classification threshold increases sensitivity (catching more true positives) but decreases specificity (producing more false positives). Clinicians and machine learning practitioners must select a threshold that balances these concerns based on the clinical or business context.
A receiver operating characteristic (ROC) curve is closely related to the confusion matrix. Each point on an ROC curve corresponds to a specific classification threshold, and at each threshold, a distinct confusion matrix is generated. The ROC curve plots the true positive rate (recall) on the y-axis against the false positive rate (1 - specificity) on the x-axis across all possible thresholds.
While a confusion matrix captures model performance at a single threshold, the ROC curve provides a comprehensive view of performance across all thresholds. The area under the ROC curve (AUC) summarizes this into a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing.
The two tools complement each other. The confusion matrix is essential when a specific operating point (threshold) has been chosen and the exact counts of correct and incorrect predictions matter. The ROC curve is more useful during model selection and threshold tuning, when the goal is to understand how performance changes as the threshold varies.
Confusion matrices are commonly visualized as heatmaps, where the intensity of each cell's color corresponds to its value. This makes it easy to spot patterns at a glance: bright diagonal cells indicate strong classification, while bright off-diagonal cells indicate systematic confusion between specific classes.
Most machine learning libraries provide built-in support for confusion matrix visualization:
scikit-learn (Python):
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]
cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Positive"])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.show()
The sklearn.metrics.confusion_matrix function accepts y_true and y_pred arrays and returns a NumPy array. The optional normalize parameter can be set to 'true' (row normalization), 'pred' (column normalization), or 'all' (normalization over the entire matrix).
PyTorch (via torchmetrics):
from torchmetrics.classification import BinaryConfusionMatrix
import torch
metric = BinaryConfusionMatrix()
preds = torch.tensor([0, 1, 0, 0, 0, 1, 1, 0, 1, 1])
target = torch.tensor([0, 1, 0, 1, 0, 1, 0, 0, 1, 1])
cm = metric(preds, target)
print(cm)
TensorFlow:
import tensorflow as tf
y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]
cm = tf.math.confusion_matrix(y_true, y_pred)
print(cm)
Imagine you have a big pile of toys, and your job is to sort them into two boxes: one box for cars and one box for dolls. After you finish sorting, your mom checks your work and makes a chart:
That chart is a confusion matrix. It shows you exactly what you got right and what you got wrong, so you can figure out where you need to be more careful next time. If you were sorting into three boxes (cars, dolls, and blocks), the chart would be bigger, with more rows and columns to track all the possible mix-ups.
While the confusion matrix is an essential evaluation tool, it has several limitations: